Terraforming 101

In which I finally get to grips with infrastructure as code on a real project while trying to set up identity aware proxy to protect a private app from prying eyes.

I’ve poked around with Terraform before, but never forced myself to do something real. The Woolpert team I work with is all-in on Terraform for our production workloads and increasingly we start with it for client facing projects as well. As I was trying to build a small app to track my team’s professional certifications and training goals a few months ago, I thought it was time to dig in.

What is Terraform and what problem does it solve?

Infrastructure has long been deployed using a procedure, step-by-step approach. Just look at my own README for this project (called traintrack because I’m, well, tracking training!):

Establish an identity for the Cloud Run service to run as:

1
gcloud iam service-accounts create traintrack-svc-identity

And IAM - this is one time but updatable. It can be use during a CI/CD process:

1
2
3
gcloud iam service-accounts keys create \
service-key.json \
--iam-account traintrack-svc-identity@exp-traintrack.iam.gserviceaccount.com

The latter command creates a service-key.json file that is needed to deploy to Cloud Run since that’s the identity we want the service to run as. If you don’t want to run that, just use the Google Cloud Console to generate a key and save the file.

Create a global IP address:

1
2
3
4
5
6
7
gcloud compute addresses create traintrack-ip \
   --ip-version=IPV4 \
   --global
$ gcloud compute addresses describe traintrack-ip \
   --format="get(address)" \
  --global
34.117.190.61

That’s a whole lot of typing commands. To be sure, I prefer running these in a script over clicking a GUI but still, that’s just the first few steps of a longer process. And of course, if I goof up, I have to carefully back out of every step I’ve made and start over. It’s a really drag, this procedural approach.

Infrastructure as Code (IaC) takes a mindset that evolved in ecosystems like Kubernetes. Instead of saying how to build the infrastructure up, instead you simple declare what the end result should look like. That’s why it’s called a declative approach over a more traditional procedural approach.

On its own, that is neat but not very compelling. Where the idea of declarative (“what not how”) infrastructure becomes a great tool is in the execution. You see, behind the scenes Terraform is actually talking to the various cloud APIs to get the job done. And the authors of Terraform providers like the Google Cloud Platform provider also write the inverse logic of setting something up: the teardown operations.

What does that mean? It means that you can define your infrastructure in a series of declarative statements (we’ll see examples in a minute) and run terraform apply. That one command will build up the entire infrastructure in one go. But the real magic is when you either:

Adapt the infrastructure and retype terraform apply. Terraform figures out what needs to change to make the current infrastructure match what you’ve declared as the desired end state, and just does it. Magic!
Delete the infrastructure, e.g., you had a test environment and you want to wholesale delete it. Just type terraform destroy and it removes everything that it created automatically. Double magic!

This workflow is such an enormous time saver, not to mention the infrastructural equivalent of a unit testing framework. If you try something and you screw it up, just terraform destroy then terraform apply again. Or if you just want to make a change, terraform plan to see what is going to happen as a dry run. And then run terraform apply. It can seem onerous the first time you pick up Terraform as a tool, but the savings are truly phenomenal in short order.

Building the app is not Terraform’s job

Terraform is not a build tool, packaging tool, or configuration management tool. That’s why tech like Docker, pip, and others exist.

No, Terraform is for your infrastructure.

My TrainTrack app is really simple: just some Python packages, a Makefile to do some data prep, and a Dockerfile to bring it all together. It uses Datasette to put a functional web interface on top of some data so I can see things like how many certifications we’ve achieved as a team:

A simple way to view CSV files using SQL, courtesy of Datasette

The build and deploy process is pretty much encapsulated in this Makefile:

Makefile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
cloud-run-service := svc-traintrack
service-account := svc-acct-traintrack
gcp-project := exp-traintrack-tf
db := ./training.db
image := traintrack-tf
tag = latest
region := us-west1
artifact-repo :=  traintrack-repo
hostname := $(region)-docker.pkg.dev/$(gcp-project)/$(artifact-repo)

.PHONY: data
data: clean
	csvs-to-sqlite --replace-tables --primary-key code data/courses.csv $(db)
	csvs-to-sqlite --replace-tables --primary-key email data/people.csv $(db)
	csvs-to-sqlite --replace-tables --primary-key id data/certs.csv $(db)
	sqlite-utils add-foreign-key $(db) certs course courses code 
	sqlite-utils add-foreign-key $(db) certs person people email
	sqlite3 $(db) < data/awards.sql

run: data
	datasette $(db) --host 0.0.0.0 --port 1234 --metadata metadata.yaml

clean:
	rm -f $(db)

build: data
	docker build -t $(image) .

push: build
	docker tag $(image) $(hostname)/$(image):$(tag)
	docker push $(hostname)/$(image):$(tag)

deploy: push
	gcloud config set project $(gcp-project)
	gcloud run deploy $(cloud-run-service) \
	  --image $(hostname)/$(image):$(tag) \
	  --platform managed \
	  --region $(region) \
	  --service-account $(service-account) \
	  --no-allow-unauthenticated \
	  --port 8080 \
	  --platform managed

docker-auth:
	gcloud auth configure-docker $(region)-docker.pkg.dev

Hopefully it’s obvious that my choice of Google Cloud Run as a deployment platform affects how I package my app (Dockerfile built and shipped as an deployable image in an artifact repository).

Dockerfile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
FROM python:3.7-buster as base
FROM base as builder
RUN mkdir /install
WORKDIR /install
COPY requirements.txt /requirements.txt
RUN pip install --no-cache --prefix="/install" -r /requirements.txt
FROM base
COPY --from=builder /install /usr/local
COPY metadata.yaml ./
COPY training.db ./
CMD ["datasette", "serve", "--port", "8080", "--host", "0.0.0.0", "--metadata", "metadata.yaml", "training.db"]

And hopefully it’s obvious where my app build and config ends, and where Terraform will need to take over: the infrastructure to run the app on.

Picking a target architecture for the app

Here’s a view of the infrastructure I need to deploy my app:

Google Cloud Run behind a load balancer and a correctly configured DNS

It’s a bit of an eye chart, so let’s break it down. I need:

A Cloud Run service. That’s where the containerized app will live, and it will be deployed on demand as I push new images to the Cloud Artifact Repository.

The Cloud Run service with a few revisions, tastefully named

Cloud Artifact Registry. Cloud Run will pull images from that, so I need to create one, so that I can upload docker images in the first place.

Cloud Artifact Registry is a place to keep build outputs like docker images, self-hosted package repos, etc.

All of the other elements on the diagram are needed to support the app, but aren’t especially interesting from an infra perspective. What is interesting is the interdependencies as revealed by the arrows.

Inbound traffic to traintrack.woolpert.io will need a global forwarding rule which points to…
A reverse proxy that understands where to find…
URL rewriting rules to translate from traintrack.woolpert.io to svc-traintrain.blahblah.a.run.app
The proxy needs to understand which External Static IP address is associated with…
A GCP-managed DNS entry so that the traffic is all encrypted correctly, so that…
When it gets to the Backend Service and associated Network Endpoint Group, that the request reach the correct…
Cloud Run app instance.

So there’s a bunch of infrastructure to create. Not to mention a service account for this to all run under.

Terraform resources, variables, and values

There are three files involved in this IaC setup:

main.tf which contains the definition or declaration of the infrastructure we want.
variables.tf which defines any variables we want to use in the declarations. This helps us avoid typing the same thing twice and introducing inconsistencies.
terraform.tfvars which are the values for the variables. This gives us separation between the declaration of the variables and their types from the values of the variables.

That last bit can sound confusing so let’s see the the terraform.tfvars:

terraform.tfvars

1
2
3
4
5
6
project-id = "exp-traintrack-tf"
gcp-region = "us-west1"
svc-account-tt = "svc-acct-traintrack"
artifact-repo =  "traintrack-repo"
cloud-run-service =  "svc-traintrack"
site-address =  "traintrack.woolpert.io"

This is our first look at some HCL which is the language that Terraform uses. Seems pretty obvious right? Name and value declarations. But where are these used? First, remember the variable named artifact-repo.

Now let’s look at where those variables are actually defined, because this is not it! The declarations are in the variables.tf file.

variables.tf

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
variable "project-id" {
  description = "GCP project ID"
  type        = string
  sensitive   = false
}

variable "gcp-region" {
  description = "GCP region"
  type        = string
  sensitive   = false
}

variable "svc-account-tt" {
  description = "Service account that Traintrack will run as"
  type        = string
}

variable "artifact-repo" {
  description = "Name of the Google Artifact repository where images are published"
  type        = string
}

variable "cloud-run-service" {
  description = "Name of the Cloud Run service"
  type        = string
}

variable "site-address" {
  description = "Fully qualified domain name where the service will be accessible, e.g., mysite.company.com"
  type        = string
}

See the artifact-repo item there? Yep, that’s where the variable is actually created. It has a type (string), a helpful description, and a name. The name is what we use to refer to the variable in other files.

The real fun begins in the main.tf where the infrastructure is actually defined.

main.tf

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
terraform {
  backend "local" {
  }

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "3.58.0"
    }
  }
}

provider "google" {
  project = var.project-id
  region  = var.gcp-region
}

resource "google_service_account" "service_account" {
  account_id   = var.svc-account-tt
  display_name = "Service account running Train Track cloud run app"
}


resource "google_artifact_registry_repository" "artifact-repo" {
  provider      = google-beta
  location      = var.gcp-region
  project       = var.project-id
  repository_id = var.artifact-repo
  description   = "Train Track docker repository"
  format        = "DOCKER"
}

resource "google_compute_region_network_endpoint_group" "cloudrun_neg" {
  name                  = "neg-cloudrun"
  network_endpoint_type = "SERVERLESS"
  region                = var.gcp-region
  cloud_run {
    # This confused me. Thought the Cloud Run service has to exist
    # first, but just providing the name is enough. Good! Because otherwise
    # not sure how to create a Cloud Run service without first pushing a docker
    # image...which is not possible if the Artifact Registry is also being
    # created by Terraform. Chicken and egg.
    service = var.cloud-run-service
  }
}

resource "google_compute_backend_service" "backend-service" {
  provider = google-beta
  name     = "bes-traintrack"
  project  = var.project-id

  backend {
    group = google_compute_region_network_endpoint_group.cloudrun_neg.id
  }
}

# This is the load balancer
resource "google_compute_url_map" "urlmap" {
  name        = "urlmap-traintrack"
  description = "Directs all traffic directly to backend service without any URL mapping"

  default_service = google_compute_backend_service.backend-service.id
}

resource "google_dns_record_set" "dns-traintrack" {
  # Learned the dangling period trick
  name = "${var.site-address}."
  type = "A"
  ttl  = 300

  project      = "woolpert-corporate-assets"
  managed_zone = "woolpert-io"


  rrdatas = [google_compute_global_address.external-ip-address.address]
}

resource "google_compute_managed_ssl_certificate" "managed-cert" {
  name = "site-cert"

  managed {
    domains = [google_dns_record_set.dns-traintrack.name]
  }
}

# This is the Front End part of the Load Balancer. It won't be
# visible as such in the Cloud Console UI until the forwarding
# rule below is created to expose it to the outside world
resource "google_compute_target_https_proxy" "default-proxy" {
  name             = "proxy-traintrack"
  url_map          = google_compute_url_map.urlmap.id
  ssl_certificates = [google_compute_managed_ssl_certificate.managed-cert.id]
}

resource "google_compute_global_address" "external-ip-address" {
  name    = "ip-ext-traintrack"
  project = var.project-id
}

resource "google_compute_global_forwarding_rule" "default" {
  name   = "global-rule"
  target = google_compute_target_https_proxy.default-proxy.id
  # The Target Proxy explicitly accepts only SSL traffic
  port_range = "443"
  ip_address = google_compute_global_address.external-ip-address.address
}

Picking a few interesting sections…

Line 2. state is where tf stores it’s understanding of which infrastructure it has created and whether what is declared in the main.tf file has changed. In other words, does it need to make any updates?
Line 6. The Google provider. HashiCorp or the cloud provider–Google in this case–write providers that wrap their specific APIs up into a series of creatable and destroyable resources.
Line 24. Here we see an Artifact Repository being defined. Remember, it’s not created here but it is defined here. We never have to worry about how it’s going to be created or managed or destroyed. We just say we want it, and TF will figure ou the rest. This and many other resources have parameters. One is the provider. Others are properties like the repository_id which are required by the underlying Google API. What’s interesting is the use of the variable var.artifact-repo; that’s defined in the variables.tf file and has a value plugged into the terraform.tfvars file.
Line 53. Terraform manages a graph of connected resources it needs to create. Dependencies (edges) on that graph are created by have a property of one resource point to a property of another resource. In this case, the backend.group property of the backend service needs to know about (‘depends on’) the ID of the network endpoint group (NEG): google_compute_region_network_endpoint_group.cloudrun_neg.id. That means the NEG must be created before the Backend, so that the Backend can refer to the ID of the NEG. It’s a directed acyclic graph (DAG) of resources.
This is another key benefit of TF: as a DAG it can optimize the creation of resources. If one resource chain does not depend on another, TF can spin them up in parallel. Look back at the diagram: in our case the artifact repository, service account, and global forwarding rule can all be created in parallel.

Building the infrastructure

This is taken straight from the README file in the project and describes how it all comes together. Remember, the infrastructure is managed as a pre-requisite of the app, not as a build-time dependency.

Manual steps

APIs were turned on via gcloud:

1
2
3
4
5
gcloud config set project $PROJECT_NAME
gcloud services enable compute.googleapis.com
gcloud services enable run.googleapis.com
gcloud services enable artifactregistry.googleapis.com
gcloud services enable iap.googleapis.com

Terraform

DNS records in our corporate-resources project in order for this to work.

You shouldn’t need to touch this, but the rest of the infrastructure is defined in the infra/main.tf file. The Terraform state is not stored in GCP in a GCS bucket because this is a toy project: it’s ok to completely delete the project and start from scratch at any time.

That said, it’s easy to get started. First, check that the values in infra/terraform.tfvars make sense. Look in infra/variables.tf to understand what each one is used for.

Next, run the usual Terraform dance to validate then apply the changes.

1
2
3
4
cd infra
terraform init
terraform plan # STOP and READ the output
terraform apply #type `yes` if it makes sense

Piece of cake! And this is sooooooooo much nicer than messing about with gcloud commands and the like.

Cloud IAP

But I don’t want this app to be available publicly. I want it to have Google single-sign on (SSO) for people at Woolpert and to be inaccessible for anyone else.

Unfortunately at the time I created the project there wasn’t a good Iac/Terraform story for Cloud Identity Aware Proxy, so I did it using the Google Cloud Console:

Create an OAuth2 via APIs & Services > Oauth consent screen
Create an OAuth Credential via APIs & Services > Credentials > +Create Credentials > OAuth client ID
_Security > Identity-aware proxy.
1. Check the box next to bes-traintrack
2. Slide the toggle to enable IAP.
3. In the slide-in window ADD MEMBER
4. Add a Google group by typing the email.
5. Choose Role > All roles > Cloud IAP > IAP-secured web user.

Indeed the documentation still says that this is not doable:

Only internal org clients can be created via declarative tools. External clients must be manually created via the GCP console. This restriction is due to the existing APIs and not lack of support in this tool.

Summary

I hope a quick tour of a simple project using Terraform helps orient you with the basic concepts. Infrastructure as code, declarative infrastructure, and tools to automated it all are a great thing to add to your toolbox.

What is Terraform and what problem does it solve?

Building the app is not Terraform’s job

A simple way to view CSV files using SQL, courtesy of Datasette

Picking a target architecture for the app

Google Cloud Run behind a load balancer and a correctly configured DNS

The Cloud Run service with a few revisions, tastefully named

Cloud Artifact Registry is a place to keep build outputs like docker images, self-hosted package repos, etc.