We have a persistent issue at work: my team is client-driven. These clients can be both internal or external, which helps whilst hindering. What I mean by this is that we have to balance many priorities, but those priorities don’t have sight of each other, so everyone’s request is “important”. However, sometimes we can hide behind some anonymity, so it is always a balancing act.
As a Team, we have developed Terraform modules to use across projects, both internal and external. These have reduced time-to-market for many projects. In the last two years, we have found ourselves inundated with project work from external clients. This work has consistently had priority over our internal-facing work, causing a large amount of tech debt to pile up in the Terraform space. We were blocked (or slowed due to training requirements) on many tickets simply by older Terraform versions.
We aimed to solve the Terraform issue by having a “Terraform week”, in which we would target drawing down some of this tech debt. We started our planning on a team day in the office a week before “Terraform week” and came up with the following plan:
- Assessing modules - are they still relevant?
- Sunsetting - what do we no longer need (both modules and projects)?
- Write a Terraform internal styling guide;
- Creating the repositories each with a ticket telling us what lives there;
- Create module templates - not used before, a new feature of Gitlab for us);
- Archive old “Core” repository;
- Agree on a capture methodology so that we can successfully relay the business benefit.
Some of the challenges we faced during the week were our own making, such as the continued use of Terraform 0.11.15
(or later in most cases), causing, among many things, incompatibility with new AWS features (due to not being able to upgrade the AWS provider). Longer-standing team members have also become adept at writing in HCL 1, so the new starters have the upper hand when writing our new HCL 2!
I have had to make muscle memory changes to move from forever typing ${}
and the use of count
over for_each
.
Some of the challenges are Terraform’s responsibility, the wide-ranging changes since 0.11.x
and the continued drive to use centralised modules in the “new” Terraform public registry. Anyone who has seen an error due to this_xxx
is not an output of module security_groups
will know what I mean.
As the week progressed, we had our Style Guide done, and a template repository that contains a pipeline config that performs the following actions:
- Every commit (branch or MR, but not
main
) has atf_lint
run against it - MRs have a
checkov
run against them - MRs have a semantic pre-release run for each commit (so that we can test our code without having to rely on local references), this uses a package
go-semrel-gitlab
- Master has a semantic release, again using
go-semrel-gitlab
.
This process ensures we’re getting enough checks against our code whilst providing a relatively frictionless experience for Infrastructure Engineers. The team developed this process collaboratively, so everyone has a full buy-in.
After the groundwork, we moved onto the practical work. I can’t say too much about this as it involved some client code/designs, but I can tell you it was more complex than expected. We ran into some of the issues I’ve already described, count
verses for_each
, and some new challenges. How do we utilise some of the new functions, e.g. templatefile
? These functions have never been in our development models until now.
The challenges lead to a slight overrun on some tickets that we now have some minor mop-ups to do in our next standard sprint. However, that is OK, because during the week, as a team, we have:
- Discovered so much about Terraform, personally and as a group;
- Got everyone on board with the style guide;
- Ensured we’re programmatically checking our code with tools like
tf_lint
andcheckov
; - Have the flexibility, continued planning, and buy-in from leadership to ensure that we’re less likely to be blocked by tech debt again.