Maximize Velocity by Minimizing Risk
Over the past 10 months, my team at Work & Co and I have been designing and building an ad tech platform. Our goal with the system was to rapidly and iteratively develop a high quality, scalable platform while minimizing the client’s ongoing operational costs and complexity. To accomplish this, we took the following approach.
Use AWS for everything it’s worth.
If it’s good enough for Netflix, it’s good enough for our client. At the end of the day, there are three major cloud services providers, and AWS currently has the most expansive offering. By embracing AWS, we speed development and minimize operational costs as many of the services we are using are managed by AWS themselves. And by using Terraform to manage infrastructure configuration, we leave open the option to use additional cloud providers on a tactical basis.
Go event-driven and serverless wherever possible.
Lambda functions enable you to build services quickly without having to worry about capacity planning. Making services event-driven and focused on accomplishing a single task makes it easy to reason about the behavior of each service. And, over time, you can safely add or remove services in isolation from each other, reducing the operational risk of a change introducing faults into the larger system. Both of these benefits make incremental, iterative development cheaper, faster, and easier to test.
Embrace asynchronicity and eventual consistency.
Real-time is brittle and expensive. You have to plan for more edge cases and invest in explicit (and often tricky) error handling logic that is also challenging to effectively test. On the other hand, decoupling components of the system makes development, testing, and scaling easier. You should always identify which parts of the system need to be real-time and which can be handled asynchronously. As you get closer to the end user in a system (e.g., user interfaces, APIs, etc.), the components tend to require more real-time behavior — special care should be taken in thinking through the failure modes as these will be the most visible faults to your end users.
Favor static files wherever possible.
Save everything. Storage is cheap. Files are fast, easy to cache, and easy to version. They introduce a natural buffer to upstream failures, and old files don’t break. Files are also easy to test, often with a simple diff.
Automate everything.
Configuring AWS is intricate work. If you’re going to build anything at scale with a team, you absolutely need to use Terraform or CloudFormation to manage changes and automate application. These tools provide an excellent form of functional documentation that enables you to recreate any part of the system at will.
Deploy constantly. Every push to master should be expected to go live at any time. TATFT, and automate failure detection. This demands a level of discipline from the development team that may be uncomfortable for many organizations, but like chaos engineering practices, the automation will force you to address any defects immediately. It also makes it easier to recover in the inevitable eventuality that a defect enters the system.
Failure is a constant, but all failures are temporary.
You should be able to absorb most temporary failures by appropriately decoupling your systems. As with everything, this is a defense in depth strategy. Think carefully about the failures modes of each service and the potential impact on the larger system. Assume things will fail constantly, and make sure you’re introducing buffers so that system components can recover automatically from temporary service failures.
Optimize for mean time to recovery.
The best code is no code.
Program in the small. Program with infrastructure. Use languages that are expressive and compact.
Avoid stasis.