When I first started consulting with NuID almost 2 years ago, one of our objectives was to bake DevOps into the culture. Given our choice to rely heavily on serverless to keep initial costs down, it was essential for the team to be much more intimate with AWS and operations than is typical for developers working exclusively in their language of choice. Instead of building services that run on top of a shared infrastructure, NuID’s AWS resources are inseparable from the code utilizing them. This has significant implications for how NuID delivers software.
So, what is a service?
A service is a collection of AWS resources and code that fulfills a desired function. It has a well-defined interface that other services and users can interact with. And it is something we want to deploy and test as a unit.
Defining a service in this way makes it easier to evolve the architecture over the long term. If a service provides a consistent interface, it can be rearchitected internally as needed. For example, it could be migrated from (potentially many) Lambda functions to a shared EKS cluster without impacting customers or dependent services.
NuID uses Terraform to manage its infrastructure. Each service has its own Terraform state to maintain isolation and minimize blast radius. Each service also has its own pipeline (with the exception of some shared services that don’t change all that often or should only be changed with special care). Terraform enables NuID to maintain consistency from
prod, eliminating “works on my machine” issues and simplifying integration across services.
We wanted to instill continuous delivery in NuID’s culture from the start. We began with the principle that any code in
master can go live in
prod at any time. This requires developers to consider the operational impact of anything they push to
master, both in terms of the service they’re working on and other dependent services. Following from this principle, code review, testing, and monitoring are essential—no one wants to break the build.
NuID uses the GitHub flow model with the exception that feature branches aren’t deployed into production for testing. Instead, pull requests are reviewed in
dev by spinning up a unique instance of the service from the feature branch. This is valuable both for testing and for soliciting feedback from other stakeholders.
Integration testing amongst services happens in
prod are kept nearly identical. Excluding questions of scale, if a service works in
stage, it will likely work in
prod. This consistency simplifies the code and increases confidence in a given release.
Each environment is maintained in a separate AWS account. Developers are only able to deploy directly to
dev. Deployment to
prod is only possible via a service’s pipeline. This restriction eliminates the risks of environmental drift and “doing it live”.
“I can’t imagine a more ideal approach for the way I want to work. Our decision budget is tight—Clojure, Terraform, AWS. And with our infrastructure as code, every seam is exposed. With this approach, I feel like there isn’t anything we couldn’t build rapidly.”
NuID is primarily a Clojure shop, so enabling REPL-driven development was a priority. Interactive development is one of the keys to a Clojure developer’s productivity, so access to AWS resources from the REPL is essential.
Because the AWS resources are inseparable from NuID’s code, it’s not possible to create a self-contained environment on a developer’s individual machine. Instead, we’ve created shell scripts wrapping Terraform that enable a developer to easily spin up a unique instance of any service in
dev. This developer interface to Terraform is shared across all services, keeping the developer experience consistent and making it easier for a small team to maintain multiple services. This tooling also makes it easy for developers to rapidly develop new services.
NuID’s infrastructure is treated as immutable, so there’s no need to SSH into a particular box to make changes. However, some resources such as private APIs on API Gateway and Elasticsearch clusters are not accessible directly from the internet (with good reason). To facilitate developer access from the REPL, we created an on-demand bastion service for securely proxying to resources within a particular VPC. Bastion instances are locked to a particular developer and are kept running only while in use to minimize attack surface.
Because the infrastructure and code are inseparable, NuID’s pipelines plan and apply Terraform. If you’d like to see what this looks like in action, check out my
pipeline-example. This ensures consistency across the environments, minimizing unexpected behavior. It also facilitates integration testing across services.
While Terraform is not intended or recommended for orchestration, it works well enough for NuID on deployment of Lambdas and, with judicious use of
null_resource, Datomic ions. The same approach can be used for deploying containers if a service requires it, providing NuID with a lot of flexibility to evolve services going forward.
Programming with Infrastructure
As I’ve previously written, the best code is no code. If you can diffuse your logic into the infrastructure declaratively, that’s less code for your team to maintain.
Maintaining portability between cloud environments is a white elephant. For a small team, it’s better to take advantage of your chosen cloud provider as much as possible.
Overall, this approach has given NuID the ability to grow its headcount slowly and deliberately while maintaining high productivity and keeping costs low without foreclosing the future evolution of the platform.