A microservice journey - part 2: what type of micro service are you?

Once we established we were on the microservice path, the question of just how would that look came up?
We had established the problems and understood what we were trying to achieve, but what exactly is a microservice, and how do you go about building it?

The importance of the first step can not be understated enough. If everyone understands the problem and really gets what we are trying to achieve, then microservices write themselves. Well not really, but by understanding what we are trying to achieve, the solution options can be assessed around a set of questions which would help drive the decisions being made.

Can this service be deployed on its own, without affecting other parts of prod?
If we had an influx of users using this feature, can we scale it out independent of the other components?
If we did need to scale it out? do we need to scale out any other components as well?
What happens if we had a network issue? would it still be able to do its job?
If this service did go down for a period, what effect would it have on the rest of production?
Would we be able to recover when it does come back up?

Each time we started thinking about a solution, we would look at each 5 million solution options available, and ask those simple questions to determine the suitability of the solution to meet high-level objectives. Actually, it was never that easy, in fact everything was always very hard, as everyone has different opinions, and shares different facts or inferences which is hard to navigate, and hard to choose. But also rewarding and sometime eye opening. And in the beginning, because of all the new concepts, and counter-intuitive pattern, it was hard to always come to the right solution.

Luckily, we have the internet, and so many people have already trodden down this path, and have a lot to say about it. Initially, it was very hard to digest exactly what type of microservices people were building and what type we should be building here. But at the end of the day, we used the goals we specified earlier to help guide our decision.

We landed on eventual consistent microservices, based on event-driven architecture.

So basically, each system would emit an event (an event was anything which was significant), and put any data it had available in the event payload.
The downstream consumers could choose what they would like to do with the information, but usually, store the data somewhere and do something about it. These microservices would be built around a domain, and start out as just a single code base doing 1 or 2 things.
Once we published all the important business event, any new microservices could pick the events to subscribe to, to build a data model and away they went without needing to bother any other team in the process. This ~~eliminating~~ reducing the dependencies between the teams, which meant we could build more stuff with more teams at the same time, with little to no dependencies between squads.

We never set out to define all the business objects, and API endpoints available, or the size or scope of the microservice, or even a high-level view of all the microservices. We never set out to define all the events we would emit or the payload they would contain.
I have previously been involved in this type of discussion, and the term analysis paralysis immediately comes to mind. By the time you finish the masterpiece the world has changed, and you have to do it all over again. It's kind of the opposite of what we are trying to achieve in this fast, flexible, adaptable environment.
It also give you an out if things start to go pear shaped. By incrementally designing and build the system, we can all learn and grow the concepts, ideas and patterns we use for the next iteration, or next bit of work. Another critical aspect is evolutionary architecture. There is a good book on it https://www.thoughtworks.com/books/building-evolutionary-architectures

We all agreed that we would go down the microservice path, and that we would figure it out along the way. This is why it was so critical to get buy-in, and for everyone to understand the core Architecture principals and patterns we chose to implement. By delegating the design choices, even at the domain level, it gave developers much greater freedom in choosing which way to go, what would be the scope of the domain but also instilled the ownership of the domain right from the beginning.

To build a complex system, you need really good people. To keep good people, you need to give them enough freedom to be creative and come up with designs and solutions. And surprise, surprise, good people also come up with great, interesting designs.

Eventual Consistency

We had some long discussion around this point. eventual consistency it is a very hard concept to accept. But if you have ever read and understood the CAP theorem [https://en.wikipedia.org/wiki/CAP_theorem] you would know that you can only even get 2 of the 3 guarantees. In a monolithic world you don't have to deal with the networks as much, so you can always guarantee consistency and availability. But whenever there is a network partition, you need to choose between consistency or Availability. Since microservices are, at the heart introducing networks between everything, your only choices are

A system which is well separated but goes down all the time (but is consistent) by choosing consistency or
Choose Availability and make sure the system become eventually consistent.

So the question is really, do you want the customer to wait, and click retry when there is a fault? or do you accept, carry on and allow the downstream systems to eventually take up the change and act on it.

Event Driven

We had some long discussions around the event driven aspect, and if we should just publish the event, and make the consumer call an API to retrieve the payload, or if we should put the payload on the event.

After some long hours on the white board, we chose to keep the model of persisting data on the event. We were building a distributed system, and we were using Pub/Sub which was naturally distributing the data down stream to each subscribers queue.
Traditionally, events emit the ID, and the consumer would know to call the related API to get the data needed to make the event meaningful. But this would mean when an order is placed, 1 consumer would call the API to get the data, then as we sprawl out 100 consumers could be calling the API at exactly the same time to get the data, 1000 consumers? That's a lot of API calls, very bursty at exactly the same time and it also introduces some nasty dependencies.
Firstly, the Pub/Sub tool needs to be up and running for the consumer to continue. then n number APIs would need to be up to get the data required for the k number of events.
Secondly, this would introduce inter-team dependencies, so again we wait. If the order squad didn't have an API ready, we wait. I thought the whole point on this solution was to eliminate waiting.

Reliability

Given the above two aspects of the solution, reliability was a real concern. If we are going to be eventually consistent, we need to be 100% eventually consistent. 1 or 2 drop offs was not an acceptable outcome. Again, there were some long, hard discussions to try to get to a place we could trust that eventually consistent was a guarantee. It took a few incidence to realise that if we loose a message, bad things happen. A solution we are considering is to break down the message lifecycle, and ensure the data is stored somewhere reliably, monitor the *%$%# out of it, and have a way to reply any missing events in a timely manner.

Basically, the publisher will write a log to disk or database (using the outbox pattern) and the consumer would write a log that it received the message. Our central logging ingestion takes all those messages and made sure that each message was accounted for, and if it was not, it could be replayed to all partied (rebroadcast the publish) or to a subset of subscribers (targeted publish).

We live on cloud, so reliability is something that needs special attention. So I will write up a post on the Reliability aspects in more detail in a later post.

Search This Blog

Bits and bobs