Datomic with Terraform

22 November 2019 Updated 24 December 2019

I recently gave a talk at Clojure/Conj 2019:

Since last fall, I’ve been working with Nolan Smith, the CTO of NuID, to design and build NuID’s production infrastructure. While I have much more to share on that project as a whole, I’d like to take some time to discuss how we’ve integrated Datomic Cloud into the NuID infrastructure. Because Datomic is relatively unknown beyond the Clojure community, finding articles describing how people are using it is a bit of a challenge. Hopefully my description of our experience will be valuable to others.

Architectural Overview

Overall, we’re using a service-oriented architecture for NuID. While I may write a future post on our overall architectural objectives, with regards to Datomic, the goal was to maintain a shared Datomic service while maintaining isolation and independent scalability across Datomic consumers (i.e., other services) and taking advantage of the convenience and performance of Datomic Ions.

In addition to Datomic, we make heavy use throughout the NuID platform of on-demand, serverless resources, and we use Terraform and AWS CodePipeline to keep everything manageable and repeatable. A full service deployment includes both provisioning AWS resources and packaging and deploying code (primarily via AWS Lambda). By relying heavily on AWS resources, we are able to drastically reduce both initial running costs and the amount of Clojure code we need to maintain. However, because the AWS resources are essential to each service, we need to test and deploy everything as a unit.

We have 4 environments across 4 AWS accounts: dev, stage, prod, and tools. dev is our sandbox environment. Thanks to Terraform (and some bash scripting), developers are able to quickly spin up unique instances of entire services to develop against, test, and review. tools houses our Terraform remote state and CodePipelines. stage is primarily for integration testing and review across all services, and prod is our production envieronment. Again, thanks to Terraform, we are able to maintain nearly identical stage and prod environments (though we can scale down some resources in stage for better cost control if necessary). All service deployments to stage and prod are fully automated via CodePipeline.

Datomic

I won’t go into the many reasons we went with Datomic as our central database—if you’re familiar with Datomic, you likely already know why we made that choice. I will mention that if you’re particularly concerned about cost, the approach we’ve taken will not appeal to you—there is a lot of redundancy, and Datomic’s production topology is not cheap relative to many other options out there.

We have 3 Datomic clusters running in the production topology, one each in dev, stage, and prod. We decided to stick with the production topology across all three environments to maintain consistency in our services across all stages of development. Our services are isolated within their own VPCs, so we need to use either a VPC endpoint to access the Datomic cluster or spin up service-specific Datomic query groups to maintain code isolation and independent scalability.

Bootstrapping

Unfortunately, there’s no way to spin up an initial Datomic cluster without manually going through the AWS Marketplace. I had to go through the initial setup process 3 times, once for each account. I also went through the first upgrade process for each cluster since we were eventually going to have to do it anyway, and it saves us from having to update our Terraform wrapper down the road.

Consistency is king.

To make Datomic easily accessible in Terraform, I created a thin wrapper around the CloudFormation stacks that publishes outputs we need to properly configure our other services. I also use this Terraform module to create the inbound security group rule for the Datomic bastion, the VPC endpoint service, and any other VPC endpoints needed to connect to AWS resources.

You can find a starter module in my datomic-service repository.

Once you’ve applied this Terraform to your Datomic clusters, you can use it via the terraform_remote_state data source in your other services.

Note that in our current system this Terraform isn’t automatically applied via CodePipeline. My thinking is that there are some things that are done infrequently enough and/or are critical enough that they require hands-on attention of an Ops team member.

Individual service deployments are likely going to be frequent. Changes to your core database service less so. YMMV.

In any case, it’s certainly possible to build more automation around Datomic—we just chose to invest our time in the automation that’s more immediately valuable to NuID.

Client Applications

If your service (or client application, in Datomic terms) is running in its own VPC and doesn’t warrant its own query group, all you need to do is create a VPC endpoint within the service’s VPC:

resource "aws_vpc_endpoint" "datomic" {
  security_group_ids = [aws_vpc.consumer.default_security_group_id]
  service_name       = var.datomic_service_name

  subnet_ids = [
    aws_subnet.private_az_a.id,
    aws_subnet.private_az_b.id,
    aws_subnet.private_az_c.id,
  ]

  vpc_endpoint_type = "Interface"
  vpc_id            = aws_vpc.my_service.id
}

datomic_service_name corresponds to the endpoint_service_name output from the global datomic Terraform module linked above.

We then inject the Datomic endpoint DNS name (i.e., aws_vpc_endpoint.datomic.dns_entry[0]["dns_name"]) into any Lambda functions that need to communicate with Datomic:

resource "aws_lambda_function" "my_function" {
  # …

  environment {
    variables = {
      DATOMIC_DATABASE_NAME        = var.datomic_database_name
      DATOMIC_ENDPOINT             = "http://${var.datomic_endpoint}:8182"
      DATOMIC_SYSTEM_NAME          = var.datomic_system_name
      # …
    }
  }

  vpc_config {
    # …
  }
}

For NuID services, we typically use an endpoint for Lambda functions that are not particularly sensitive to latency and/or only issue transactions. For everything else, we use Datomic ions deployed to the service’s query group.

Datomic Ions

Because we are sharing the Datomic cluster amongst multiple services, to maintain service isolation as much as possible (i.e., not deploy service-specific code to the shared cluster), we have any service that uses Datomic ions spin up its own query group. This enables multiple developers working on different versions of the service to work in relative isolation in dev (e.g., using multiple feature branches) and independently scale based off service traffic requirements in prod.

You can find a starter example for a service that integrates Datomic ions in my datomic-terraform-example repository.

We ran into a few interesting issues mostly related to our automated deployments and use of Terraform.

While spinning up a query group using the aws_cloudformation_stack resource is trivial, we also need to do an initial deployment of the ions. Otherwise, downstream resources such as API Gateway that are dependent on Lambdas generated by Datomic will fail to create. To resolve this, we use a null_resource:

resource "null_resource" "ions" {
  depends_on = [aws_cloudformation_stack.query_group]

	triggers = {
    ion_dependency_hash  = local.ion_dependency_hash
    query_group_stack_id = aws_cloudformation_stack.query_group.id
  }

  provisioner "local-exec" {
    command = var.env == "dev" ? "cd \"${path.module}/ions\" && DEPLOYMENT_GROUP=${aws_cloudformation_stack.query_group.outputs["CodeDeployDeploymentGroup"]} bash bin/dev-release.sh" : "cd \"${path.module}/ions\" && ASSUME_ROLE_ARN=${var.assume_role_arn} DEPLOYMENT_GROUP=${aws_cloudformation_stack.query_group.outputs["CodeDeployDeploymentGroup"]} UNAME=${var.rev} REGION=${var.region} bash bin/codebuild-release.sh"
  }
}

The triggers block ensures that we only deploy ions in one of two scenarios: if the code for the ions has changed or if the query group’s CloudFormation stack has changed (for example, after a stack update).

To detect changes in the ion code, we use a locals block to calculate the MD5 hash of any ion-related files and then hash the hashes. This is a bit of a hack, and it unfortunately requires you to manually add source files to this list. In the future, I’d like to come up with a more elegant approach, but this works well enough for now.

locals {
  ion_dependency_hashes = [
    md5(file("${path.module}/ions/deps.edn")),
    md5(file("${path.module}/ions/resources/datomic/ion-config.edn")),
    # …
  ]

  ion_dependency_hash = md5(join("\n", local.ion_dependency_hashes))
}

We also trigger a deployment on aws_cloudformation_stack.query_group.id which changes whenever something changes in the CloudFormation stack. This should happen relatively infrequently, but it does make it possible to easily update the query group stack and redeploy our ions to the updated stack. It also helps Terraform perform the stack update and ion deployment in the proper order.

We’ve also wrapped the ion push/deploy process in a combination of shell scripts backed by a Clojure release script. Because we’re using multiple accounts, the shell scripts handle assuming the appropriate IAM role in the appropriate account. We also need to block Terraform application until the ion code is fully deployed.

I couldn’t make blocking work with the existing command-line tooling provided by Datomic, so I’m using the undocumented datomic.ion.dev API (hat-tip to Oliver George for the kickstart):

(defn- release
  [args]
  (pprint args)
  (try
    (let [push-data (ion-dev/push (select-keys args [:uname :creds-profile :region]))]
      (pprint push-data)
      (let [deploy-args (merge (select-keys args [:group :uname :creds-profile :region])
                               (select-keys push-data [:rev]))
            deploy-data (ion-dev/deploy deploy-args)
            status-args (merge (select-keys args [:creds-profile :region])
                               (select-keys deploy-data [:execution-arn]))]
        (pprint deploy-data)
        (loop []
          (let [status-data (ion-dev/deploy-status status-args)]
            (if (or (= "RUNNING" (:deploy-status status-data))
                    (= "RUNNING" (:code-deploy-status status-data)))
              (do
                (pprint status-data)
                (Thread/sleep 5000)
                (recur))
              status-data)))))
    (catch Exception e
      (st/print-cause-trace e)
      {:deploy-status "ERROR"
       :message (.getMessage e)})))

I’m not happy about having to use an undocumented API. It would be nice if Cognitect opened this API up or provided more shell-friendly CLI tooling, but for now, my approach works pretty well.

Also, because there is not a Git repo available by default when we’re deploying via CodePipeline, we have to always pass in a :uname. It’s annoying that all of our releases in stage and prod are labelled “unreproducible” (since we’re enforcing reproducibility via CodePipeline). It would be nice to be able to pass in an overriding :revision instead. But, we’re still able to capture the Git revision in the build package and use that for the :uname in stage and prod releases.

External Resources

Sometimes you may need to access resources in a VPC other than the one created by Datomic. To facilitate this, we need to create a VPC peering connection between the two VPCs, the necessary routes, and a security group rule (in the below example, a rule for Redis):

resource "aws_vpc_peering_connection" "datomic" {
  auto_accept = true
  peer_vpc_id = aws_vpc.consumer.id
  vpc_id      = var.datomic_vpc_id

  accepter {
    allow_remote_vpc_dns_resolution = true
  }
}

resource "aws_route" "datomic_to_front_end_private_az_a" {
  destination_cidr_block    = "10.0.0.0/19"
  route_table_id            = var.datomic_route_table_id
  vpc_peering_connection_id = aws_vpc_peering_connection.datomic.id
}

resource "aws_route" "datomic_to_front_end_private_az_b" {
  destination_cidr_block    = "10.0.64.0/19"
  route_table_id            = var.datomic_route_table_id
  vpc_peering_connection_id = aws_vpc_peering_connection.datomic.id
}

resource "aws_route" "datomic_to_front_end_private_az_c" {
  destination_cidr_block    = "10.0.128.0/19"
  route_table_id            = var.datomic_route_table_id
  vpc_peering_connection_id = aws_vpc_peering_connection.datomic.id
}

resource "aws_security_group_rule" "datomic_nodes" {
  from_port                = 6379
  protocol                 = "tcp"
  security_group_id        = aws_vpc.consumer.default_security_group_id
  source_security_group_id = var.datomic_node_security_group_id
  to_port                  = 6379
  type                     = "ingress"
}

One thing to keep in mind with peering connections is that you can’t have any address space overlap between the two VPCs. Datomic’s default base is 10.213.0.0. Configure the CIDR blocks in the VPC you’d like to peer with accordingly.

Deployment

While I haven’t yet done a proper write-up of how to perform automatic deployment of services with Terraform and CodePipeline, you might find these examples helpful:

HTTP Direct

I’ve written up a post on how to setup HTTP Direct with your ions using Terraform: “HTTP Direct with Datomic & Terraform”

Architectural Overview

Datomic

Bootstrapping

Client Applications

Datomic Ions

External Resources

Deployment

HTTP Direct

Related Posts

HTTP Direct with Datomic & Terraform

Continuous Delivery with Terraform

Terraform for Teams

Bastions on Demand

Fenna 2