Almost every engineer – consciously or sub-consciously – wants to make the best architectural decisions from the get-go (hello!👋). Business will want systems that are 100% consistent, 100% reliable, 100% bug free and cheap as chips (of course!). We want our designs to stand the test of time so that we only ever extend them cleanly by adding new components as opposed to changing the existing components. We are afraid to paint ourselves into a corner with our decisions (in architecture and in life), yet the reality is that we won’t know today what we will know tomorrow so we have to make the best decision for today and keep evolving them based on evolving context. This last part is where many teams fall by the way side (mine included), all good intentions labeled “we should fix this!” albeit with unarticulated consequences, get buried down in the bottomless pit that is the JIRA backlog competing with feature work because we’ve made peace with the status-quo in the absence of irrefutable, cold, hard data that will convince the business enough to want to prioritise something.
This unsettles me, so I will come back to this in a bit!
But, how do I make the best decision for today? The goal of architecture is to provide a safe, flexible and reliable space to solve real business problems over time. Every problem/solution domain will have certain risks associated with it. If my decisions are reducing those risks in a meaningful way whilst solving the problem, then those are the best decisions for today.
The book Just Enough Software Architecture- A Risk Driven Approach, re-iterates the well known albeit somewhat flawed risk formula:
Risk = probability of failure x impact of the failure
In complex distributed systems architecture the probability of failure is never going to be zero and in modern managed cloud systems, reliable and available though they are, we often don’t have much control over them, so the only variable that we can somewhat influence is the impact of that failure.
In other words, by incorporating faster recovery patterns into the architecture, the impact of a failure can be reduced thus lowering the overall risk. For e.g. by setting up a dead letter queue and establishing a requeuing policy, the impact of failed messages can be reduced because we can recover from such a failure without losing data. With retries and circuit breakers, we can achieve resilience and fault tolerance between services. Some recovery measures are also built into the business processes, for e.g. if the supplier delivers more stock than was requested in a purchase order, then the warehouse might still accept the stock but make a record of a conflicting delivery, so the impact of incorrect delivery is lowered. Or if we accidentally charge customer’s credit card twice, we might issue a refund after the fact or a compensatory gift voucher for the equal amount. Needless to say these are edge case scenarios, not the norms!
Through conversations with the business stakeholders, various risks/failure scenarios can be identified, brainstormed and prioritised. Then its a matter of mitigating the top most risks from both engineering and business pov. Purely engineering risks (for e.g. lack of experience in a specific technology, lack of tooling support, codebase with poor maintainability and evolvability, difficulty of integration with legacy systems, hard to evolve architecture etc), will still have to be mitigated by the engineering team because they may indirectly end up impacting the business process.
To have these conversations in a meaningful way, the JESA book recommends that the risks be described as a testable failure scenario that the business stakeholders can also relate to, for e.g.:
Two or more users editing the same purchase orders near about the same time, could result in one of their changes being lost.
The reason for this, is identifying the failure then becomes easier i.e. lost data due to concurrent update which then makes reasoning about the probability and the impact of that failure more concrete. The stakeholders might say, “Oh the users are only allowed to edit purchase orders they created, not any purchase orders so the probability of the failure is low” and this will also pull down the impact score, or they might say, “Its ok, if they end up stepping on each others’ toes, then they are supposed to resolve it amongst themselves” and whilst this increases the probability of failure, it reduces the impact of it this keeping the overall risk lower.
But if they say, “Yeah, we should really try and not have that problem happen or at least make the users aware that the purchase order has been changed by someone else so they can decide accordingly. Otherwise we might be communicating incorrect data to suppliers which will affect stock levels in the warehouse negatively” then we know that both the probability and the impact of this failure are high, thus increasing the overall risk score and this makes it a high priority risk item that needs mitigation in the architecture.
Sometimes it won’t be that straightforward because the engineering problem is too far away or too low level from stakeholders’ mental model of what it takes to build a scalable and useful product. The conversation will inevitably hit a translation barrier and you will either get a rejection outright because they don’t understand the value and we often fear what we don’t understand, OR, you will get a disinterested response like, “do whatever you need to do to make it future proof!” without fully appreciating the gravity of that statement.
This is where I prefer to not bother them with the technical details beyond what’s directly relatable to them but instead focus on what my proposal enables in general terms. For e.g. if I need to get some buy in to refactor one big service into smaller services, I might present the arguments around one part of the business process not going down when this other service goes down or being able to support new use cases more rapidly and improve the overall reliability of the process. Other times, the cost of translating a technical change to business enablement scenario might just be too great which means I’d then lob that change under the umbrella of “engineering maintenance and updates” and that’s usually all the stakeholders will have the patience for anyway!
Point being, this risk based approach to architecture grounds the decision making in reality and pragmatism than the engineers’ own perception of some hypothetical risk or dogma driven assessment of it, which could result in over-engineering and not addressing the important risks enough. The only exception here is that if you (engineers and stakeholders) know of typical risks in the domain you are operating in, then you could mitigate those risks upfront. For e.g. in my domain, we’re aware that data demands of the stakeholders could induce tight coupling with specific databases that we don’t own so we have to pay attention to that engineering risk in our architecture and enforce ownership boundaries a bit strongly.
Its obvious but worth mentioning anyway, no matter how good a decision you think you are making today, it will always have trade-offs and will never be perfect today for everything that will be thrown at it over time. In all the above examples I’ve highlighted, I am gaining something at the cost of something else, for e.g. by adding queues, I am gaining fault tolerance and availability and accepting complexity and a different programming model as trade-offs. By opting for serverless paradigm I am gaining better scalability, lower maintenance overhead, smaller application footprint but accepting the trade-off of the server being a black-box, harder to debug platform level issues, having to design the applications to fit within the resource constrained environments they represent, having to think about concurrency a lot more than other more conventional compute environment.
This is why having a discipline around evolving designs with changing contexts or when you know better, is not only critical for a healthy product but also critical for team morale. Teams should feel rewarded for wanting to improve things, not ignored or being jerked around because “we have higher priority things to do, we can always do this later!”. Agility is first and foremost about feedback loops and continuous improvements, a sprint is an experiment from the outcomes of which also come learning and knowledge that helps us get better in the next round, this also happens to be the core of engineering! Its important for Product Management and Engineering to work together as partners in successes and failures, as opposed to one side working against the best placed intentions of the other.
Whilst not a given at any level, its often easier for organisations that have a level of leadership support around engineering that helps lay the foundation for good practices and build up on them all the whilst helping engineers get management buy in. Some of these practices are codified as organisation wide Engineering Principles and Strategies which teams adopt, improve upon and drive forward. This makes it easier to use these principles as a guidepost for making architectural (refactoring) decisions during or after iterations. As a Principal Engineer, I am always looking to hoist good practices into our org wide engineering principles and create guidelines around some of these.
At team levels I have also helped establish an architectural vision and strategy to help set goals for architectural evolution in a piecemeal fashion. As a follow on, I also help teams do regular architectural reviews and identify areas of improvements going forward. This helps us in helping Product Owners to understand the proposal enough to put it on the roadmap accordingly. In the end we are all wanting to do the right thing, we only need to be more accommodating of each other’s perspectives. This can be hard at times, and I faulter as well but understanding that we will always have more to do than we have time for so we’ve got to prioritise, helps me refocus and persevere on critical items.
Another heuristic I practice and encourage others to do as well, is to always make the best engineering decisions we can for today in accordance with our engineering principles but also document ways to evolve the design in the future based on the signals we get from the context at the time. In other words build for known knowns but plan for known unknowns so we can avoid unnecessary and avoidable re-work. Both these methods ultimately aim for continuous improvement – or Kaizen (as mentioned in the Toyota Production System philosophy)
Command Query Responsibility Unification For one of the product increments we delivered, we made a decision to split the reads and writes between 2 services in a CQRS style architecture to optimise for data reuse. However, over time we had problems where the read side would often "get stuck" on a stale version of the data (story for another post). This led to constant user complains, degradation of trust and dissatisfaction. We had documented this possibility from the very beginning, so we decided to forego the CQRS style in favour of reads and writes both going to the source of truth service which increased the likelihood of recovery from out of sync data, given both operations are happening against the same database.
Sometimes dealing with real failures is the only way to identify what system should evolve, how should it evolve, by how much and what are the trade-offs but these will be cases when you don’t know what you don’t know (i.e. the unknown unknowns). It can also help put things in perspective for the business to want to put corrective actions in place because the consequences become more tangible. Its like breaking your bones to have them heal stronger and more robust, but I would prefer to not have to break bones in the first place if I can avoid it. It also makes for a very painful long term strategy!😉
I have had discussions with Product Managers in the past where that was their evolution strategy. I wonder if that’s their philosophy for things like car maintenance or looking after their health. I can’t imagine that they would only change their lifestyle and eating habits when their doctor says, “change or die!”. Would I call this smart? Would you?
Blocked Procs About half a decade ago we (my predecessors and with good intentions I am sure) created several stored procedures in the org wide shared database to be able to do something at regular intervals and for a long time it kinda worked, until about last May when we suffered a 28 hour long outage of our purchase ordering process. Much of the cause lied in these stored procedures that updated a large number of rows which caused long running transactions and excessive row locking which ultimately caused the process to start timing out without completing. It also started affecting other critical business processes in the organisation. It was very difficult for the team to debug and diagnose the issues with any certainty (some of it has to do with the way we are set up as an org)! This helped crystallise the problem enough for the business to want to invest some time into improving this part of the architecture leveraging an asynchronous event driven paradigm as opposed to scheduled batch type process that can be harder to recover from. There are other trade-offs in this approach but with logic being in application code, the odds of faster recovery go up. It also helps make the design more explicit, context boundaries clearer and domain concepts being represented in programming language code as opposed to buried in a stored procedure also make it easier to evolve the system without affecting others.
If you think good architecture is expensive, try bad architecture.—Brian Foote and Joseph Yoder
The way this usually plays out is management (and sometimes engineers) will often keep putting off essential refactoring and architectural improvements because they see them as additional cost overhead which doesn’t benefit users and all the while, new stuff is being forced around the convoluted design leading to an even more difficult to evolve product. Eventually it reaches a point where even a small change causes something else to break entirely or even worse, even changing the colour on a button looks to be becoming a multi month project. At this point an external consultant is called in who charges the org a big fat wad of cash and comes to the same exact conclusion that the engineering team had been saying all along and then the management comes up with a revolutionary idea to fix it all: REWRITE (i.e. before they decide to outsource it all to a low income country in order save costs, and the whole cycle repeats). True story, I kid you not!
There are other examples of similar or worse debacles, Healthcare.gov hellscape , Knight Capital nightmare, Volkwagen’s vandalism, UK Post Office’s poop-up? The internet is chalk full of them if only one cares to learn from them. These issues are not even unique to software, but it happens to be the easier one to mess up because its soft and we’re an immature and unregulated industry! But…that is a matter for another post!
My preference (as is the preference of most professional and responsible software engineers I have spoken to) is to intervene before we reach these extremes, before we break our bones painfully, before our car conks down in the middle of nowhere and before our heart explodes in our chest. My appeal to all the engineers is to keep pushing and keep challenging assumptions, and not get “influenced” by management speak. You would also need to sharpen that ability to zoom out from the purely technical to the intersection between business and technical to be able to effectively negotiate a positive outcome. This is also hard but you must keep in mind that as long as the incentives and goals align both sides have the negotiation power. The key is to make sure that both sides gain something from the negotiation for it to have any future longevity. I faulter here as well at times but I keep reminding myself of why are we doing this.