A few months back, in March, I was invited to talk at PHP Portugal #[Lisbon] v7 meetup. My talk was titled “Three strategies to split your monolith into services” where I briefly introduced The Strangler Fig, Branch by Abstraction, and The Decorating Collaborator patterns.
Of those three, my favourite is the last one, The Decorating Collaborator, because is the one that allows programmers to stop developing in the monolith immediately and focus on developing new functionality in a new service. To do so, it requires a proxy application that will be responsible for decorating – thus the name – new behaviour into the old monolith’s, making it look like we’re still calling the monolith when, in fact, we might be interacting with the new service.
Another trade-off to consider is that the more information is required for interacting with the new service, the more complex and tangled the proxy application implementation becomes.
This pattern is best used to compose responses and side effects from the inbound request and/or the response from the monolith but we can’t, or won’t, change the monolith itself. It can stay feature freeze, or even outside of our control, but it still can be evolved.
Maybe, gradually, replace features completely until the monolith ceases to exist.
I love the SOLID design principles. It’s a pity that there is a lot of “hate” towards them, and I think it’s mostly because they are pushed as a dogmatic way of developing software, as a set of inflexible rules that need to be respected at all costs.
That pushes people away from seeing and understanding how useful these principles are. And that’s exactly what they are: a set of principles. It’s literally in the name. But I’m not here to regurgitate, yet again, what SOLID means or represents. I think that every programmer, eventually, will learn exactly that. But take that knowledge for what they are: principles!
My love for is exactly in that nuance. They help me guide my applications’ architecture, but I don’t feel bad if a function or class has more than one responsibility, or if I decide to modify them instead of extending their functionality, or if I’m not adhering completely to every other principle. But I know and respect that those principles exist for valid reasons: to develop manageable and extensible software.
But I’m also not naive to the point of not understanding that there are many possible contexts, be it architecturally or because of team dynamics, in which one might consciously need to decide that it can’t or shouldn’t be done now.
Huge traffic spikes hitting our applications can be very problematic and scary, especially if we’re not prepared for such situations.
The common solution I see being applied is to have some type of auto-scaling strategy. It’s a good and simple enough solution: launch more instances to deal with more traffic and reduce them when the spike is over. But this automation might take some time to kick in, so the thresholds need to account for that as we don’t want to react too late and too close to the hardware limits or risk losing instances in the process and, consequently, data.
Another solution, although it’s not for every type of request, is to have what’s called a Firehose event stream. This means that instead of having variadic loads of requests hitting your application, directly, they are first put in an event stream before being consumed by the application. We basically create a virtual dam, and we can control the flow of incoming work.
Now, I want to be clear: this dam needs to be robust. And it fundamentally changes the underlying problem from the application to the workers that compose this dam. But since those workers have a simpler workflow – take the request and store it in the event stream – they probably can handle more requests per second than the application. So it can probably do more with less.
With this approach, the application won’t hit resource exhaustion because the load of the work is under our control. It won’t need to scale until it can’t keep up with the lag of events to consume or any other SLA or SLO we have to guarantee.
But, as said before, this is not for every type of request. If you need to deliver a response immediately, it’s an incompatible solution. And that, alongside the additional infrastructure requirements of adding an event or message broker and the actual workers, are pretty much the trade-offs implied.
I love a good monolith. It’s easy to develop and maintain, deploy and scale. It’s a real beauty. But, as with everything, developing in a monolith has its trade-offs and requires some effort to keep it from being a huge bottleneck to our companies and teams.
If there’s no care about keeping a separation by business domains (yep, DDD) it ends up creating a lot of attrition, coupling, and conflicts between the teams. With well-defined domains, changes in a domain have a very low possibility of impacting other domains.
The ease of deployment carries a huge responsibility. Since the monolith is deployed as a whole, there’s little to no way to really only do it when all teams have finished their work completely. Unfinished features must be protected from executing or we risk having errors thrown at the user and the application terminating unexpectedly. Feature flags allow the continuous deployment of the monolith, deferring access to unfinished features for a posterior occasion.
The development itself gets limited to one or two programming languages. In the context of web applications, for example, this can mean having a backend and a frontend language (in the case of a full-stack app). This can impact how recruitment is done in the company and the talent pool available.
We all know that database transactions ensure that data is only persisted if all write queries are successfully executed. Otherwise, the entire operation is automatically rolled back. This is super helpful, and I believe we all rely or relied on this feature forever!
But transactions have trade-offs. The engine will lock either entire tables, or rows, depending on the operations being made. And depending on the type of locks, even read queries will wait for the transaction to be completed to gain access to the data. This can accumulate, in a busy enough service or domain, and can become a performance bottleneck.
So be very careful about operations that could make those transactions opened for an extended period of time. As a rule of thumb, we should open transactions as close as possible of the write operations and close them as soon as the work is done. This is to lower the possibilities of extending those locks for too long.
This is something I see not being much discussed within development teams. It’s almost as if there’s some kind of fear of serving stale content, or an internal assumption that information should be always fetched and guaranteed to be fresh by the second.
Here’s another (kind of) hot take for this calm Monday morning: there’s a very finite number of reasons to not use a caching strategy. In fact, I believe it should be something we default to using unless proven otherwise.
Here’s a basic strategy to cache and refresh data: many of the data I usually see applications manage are mostly updated through specific side effects or endpoints. This is a perfect use case for caching on updates: refresh the cache whenever the data is explicitly updated and always try to serve from the cache whenever that data is being retrieved.
Even in more complex workflows, where parts of the data must be fetched or go through some type of real-time calculation/transformation/etc., when the request comes in, we can always:
Fetch the cached base data from the cache.
Fetch or calculate only the needed “real-time” data from the appropriate data sources.
Compose and serve the entire information.
Caching data also creates a temporary redundancy in our systems. As long as the TTL (Time To Live) of the data hasn’t expired, the underlying data source can be offline because we’re still going to serve the cached data. The response times will go down, too.
I’m generalising, of course, but it should give a push into, at least, initiating a discussion on your particular cases if caching could bring benefits.
It’s better to discuss and conclude that it’s not something you want to add, rather than not discussing and continuing to invest computational power and service communication to always fetch fresh data without needing it.
Here’s an hot take (I think) to end this beautiful Sunday: avoid using hidden control flows in your applications. I’m referring to throwing exceptions, specifically, here. When we throw an exception, we lose control on the application flow.
It’s also something that can’t really be ignored. The caller needs to wrap interactions to methods and functions that throws exceptions if they don’t really need to do anything with the exception. Without that their own flow would be impacted and interrupted unnecessarily.
A more sensible solution is to do something like Go and Rust does: return the error. Let the caller decide what to do with the returned values instead of being forced to handle everything with awful try/catch blocks.
Maybe this could be a nice topic to blog about or submit to a meetup/conference?… 🤷♂️
Infrastructure layer: persistence, data transfer, low-level operations.
And there’s an implicit order of dependencies: higher layers should depend on classes/functions from lower layers. If a lower layer needs to communicate with a higher layer, it should do so through events.
This has greatly helped me understand and design highly decoupled, cohesive, applications.
I do find event-driven architectures to be the pinnacle of resilience and fail-tolerance in software. In fact, that’s pretty much how the real world works: for every action, there’s a reaction.
In synchronous communication, we have an orchestration of interactions between services. The orchestrator needs to know every single service and the order it should be called. High coupling: consumers and producers of information need to constantly know each other. Adding or removing services, or upgrade their interfaces are hard so the owning teams need constantly syncs with each other to know what breaking changes need to be prepared for. It guarantees data is updated after the action is executed.
In asynchronous communication, we have a choreography of services reacting to each other’s events. There’s no orchestrator, no single point of failure (theoretically; realistically there’s the event broker). Adding or removing services can be done at any time, without disrupting other services. Data will eventually be consistent, but there’s no guarantee when exactly that’ll be.
And here’s the kicker: complex systems will most probably use a mix of both. And it’s very healthy that that happens. We have processes that can be eventually consistent, and asynchronous, but there are other systems that need to do things now and need to do it well, or everything must fail. And even in success there needs to be idempotency guarantees.
When implementing a queueing system, it’s easy to forget that the async benefits only apply after the jobs are dispatched to the queues. The dispatching, itself, is almost always a sync process. If that fails, our systems might return errors to the users. And there are other ways of producing errors besides bugs in the code.
This was a lesson recently learned the hard way. When everything’s working well, we tend to almost forget we have this additional infrastructure that supports some workloads that we don’t want to – and probably don’t need to – make our users wait to be fully executed. When we implement a queueing system we subconsciously assume that the two planes of operation (sync and async; the request and the processing of that request, respectively) are fully independent and that exceptions and problems in one don’t affect the other.
And, to some extent, we’re right!
A basic queue system design
If we research how to implement a queue system, we’ll get to an architecture something like the following:
Don’t delve too much into its simplicity as it was on purpose. But it does capture what we might describe as a valid queueing system: we have our Web Application producing (or queueing) jobs in our Queue Cluster, and then Queue Workers (long-lived processes) will continuously pool for new jobs, reserve and process them.
Nothing exotic; nothing magically complex. So what’s this blog post about?
How a queue system quickly becomes a single point of failure
We can’t blindly consider a queueing system as something secondary, unimportant, and not critical to our system’s architecture. Here’s what I mean: if we look at Fig. 1, above, we can say that everything north of the Queue Cluster is a synchronous flow and everything south of it is asynchronous. We tend to instinctively assume a queueing system is totally asynchronous and fail-tolerant but that will lead to possible headaches and unavailability of part of our system.
You see, the synchronous part can still fail for any number of reasons, and when that happens we’ll probably lose the job that should have been dispatched to the queue unless we take proactive actions.
If we look closely at the aforementioned architecture, we can note a few problems:
The dispatching of the jobs relies on the operation being 100% correct all the time and no problem occurs between services’ communication (between the Web Application and the Queue Cluster).
Relying on a single cluster to persist data from (possibly) multiple application domains/services.
If any of those problems happen, we’ll probably lose any reference to the jobs that aren’t dispatched to the queue, and we won’t have a way to retry that operation.
How to improve resilience and have a fail-tolerant dispatching logic
So, what to do to prevent the first issue, above, and handle failures gracefully with the option to eventually retry them when the error is fixed? Well, here’s a possible solution to guarantee that:
We can introduce a transactional outbox pattern. What this means, putting it simply is that a copy of the job is saved in a log before the dispatch itself.
This, in fact, will allow us to re-dispatch jobs for whatever reason. If retrying failed dispatching jobs is all we need, we can remove this record after the dispatch is successfully done or simply only write to this log on dispatch failures. The Queue Cluster can be unavailable for whatever time necessary, and once it comes back online, all jobs’ dispatching can be retried.
Now, to the next problem.
Minimise a domain’s cluster-specific problems’ impact on other domains
With the dispatching failures being gracefully handled, we can turn to the problem of depending on a single Cluster. If it goes down – for maintenance or because it’s filled up, to name a few possibilities- it’ll take the entire async part of our architecture down, impacting all the consumers of that Cluster.
A possible solution for guaranteeing that the blast radius of the unavailability of a Cluster doesn’t propagate to all domains/services is as follows:
Introducing a Queue Cluster per domain/service, each one independent from the others can limit the impact on the owning domain/service. So if one Cluster is down for any reason, the others will continue to operate normally.
This brings, of course, other consequences: increased infrastructure to manage and increased costs. As with everything in life, use this wisely and in the correct measure to the criticality of your architecture.
In conclusion
This is a cautionary tale. I was recently reminded the hard way that having a queueing system doesn’t magically detach itself from the application it belongs to and can, actually, bring it to its knees. Addin redundancy and some fallback strategies to prevent it from losing data when things aren’t working as expected.
At least, now, we learned a few to make queues more resilient, robust, and trustworthy.