A microservice journey - part 4: Reliability

November 11, 2019

We live in a world which has come to expect everything to just work, all the time.
At the same time we started to move to the cloud?

Cloud is a completely different kind of beast to self managed infrastucture. Self managed mean control in all aspects of the infrastructure and total control when it comes to replacing hardware.
In the cloud, a separate organisation is managing infrastructure and for many more clients, not just you. And if a piece of hardware needs changing, they just take it out. Imagine if they had to give a warning to everyone and wait until everyone was ready for the hardware to be replaced. It would never happen!

A Shift in thinking

So we live in a world where people expect everything to be working all the time, yet we are moving to a platform which can be pulled from under us without any notice. Seems kind of counter intuitive. However this is the exact catalyst needed to ensure our application are purposefully build with resilience in mind right from the get go. Even with this understanding in place, there are still times that solutions get deployed with little or no thought for resilience. We are used to thing being more stable, more inherently resilient.

Observability

The first step to being resilient is to know exactly what is happening in your system.
Along with making stuff resilient you also need to have enough information about what is going on in your solution to be able to understand what is happening and fix it. Observability is a critical aspect to any solution, and every single piece needs maticulous thought about what exception scenarios could happen, and what would that look like in our monitoring tool. You can not find every single problem all the time, but the more you find before you release to production, the less times you will be called out of bed at 3am in the morning to fix that broken something.

Recoverability

When something does do down, how long does it take to restore? Will the alternate stack work? Did you test it?

A common pattern companies try to employ is to have multiple data centres and run Active/Active. Google and AWS offer availability zones. Essencially this is active active, they both offer 3 data centres in a region. For Azure at the time of writing this unfortunately they do not support Availability zones. Azure offer multiple regions only and you have run Active/Active yourself.

But a simple question to ask is, when something goes down, will the system keep running? Does anybody need to do any manual steps to restore services?

If you build a microservices system, manual steps are the enemy. The more microservices, the more manual steps, the longer it takes to restore service.

Auto healing

A good way to make services resiliant to to have them heal them-selves automatically. What does this mean exactly? A service doesn't nessesarily need to restart itself or fix bugs automatically. But due to our reliance on fragile hardware, a self healing system can recover from component outages. Everything needs a back up plan, and a great way to achieve this is to run everything on a pub/sub backbone. The benefit of this model means messages are always on a queue and if a component is temporarily down, the messages will bank up and automatically continue when services are restored. Kuberneties also adds value here as it will natively watch out for unhealthy pods and create new instances which can replace services which are in trouble.

Search This Blog

Bits and bobs