Building applications for the cloud

Designing and building a cloud application offer a significant set of advantages but requires different strategies compared to building on-premise applications.

In traditional application development the focus is on preventing the system from failing. In technical terms: increasing the Mean Time Between Failures (MTBF). Applications build for the cloud require a different mindset. One must expect failure to happen (every now and then) and focus on how to recover from them. The focus should shift to Mean Time To Recovery (MTTR).

The background

Distributed systems are complex by nature. They connect a broad set of components which all interact to form one solution. Failure in one component can trigger a cascade of issues. Some of these components might even be external services (e.g. payment services). These might or will most likely not be under your control. They might be temporary unavailable.

Mitigation strategies

Some resilience features are built-in the cloud services you will be using. Some other pre-cautions need to be added. There’s 2 of those that are crucial. Monitoring and automation are key capabilities for a cloud workload. By monitoring a workload, you can trigger notification and tracking of failures. And automation should allow for automated recovery processes to work around or repair the failure.

Monitoring of a cloud environment is not something you want to do by workload. Of course, every workload should be monitored. But you need to have a consistent monitoring setup across workloads. It not just monitoring of “an application”, but it’s monitoring of the whole environment. You need a foundation with your specific monitoring requirements.

Resilience as a function of cost

The technical design considerations are only one part of the story. Probably more important is the conversation with the business about availability requirements. Users expect today a 24/7 availability. There’s an implicit expectation that by moving workloads to the cloud, they will get 24/7 availability out of the box. It’s crucial to have that conversation during the initial phase, before starting to transition a new workload. How much downtime is acceptable? How much would this downtime cost to the business? How much is the business willing to invest to make their application highly available?

Summary

Designing and building a cloud application offer a significant set of advantages but requires different strategies compared to building on-premise applications. There’s a mindset shift between the concept of going from Mean Time Between Failures to Mean Time To Recovery. This shift has both technical as business implications that need to be properly addressed.