- Matt's Memos
- Posts
- Infrastructure as Code and Reliability
Infrastructure as Code and Reliability
How can IaC help build more robust software systems?
Hey folks,
Infrastructure as Code (IaC) is how you can write code to provision your infrastructure and manage all the resources you create in the cloud or on-prem.
Instead of clicking through the console for AWS, Digital Ocean or Azure, you can write code. This code enables you to say "hey I want a server to be this size, I want it to be connected to this load balancer, and I want connections to go through this DNS name."
But IaC isn't just about saving clicks. IaC also impacts system reliability and maintainability.
Infrastructure is complex and sometimes long-lived. If you're not writing infrastructure config down in a succinct way, then the knowledge of how it was configured is kept with one person, the person who operated the console. They know how to provision that infrastructure and those resources, but it's just them. If, instead, how the infrastructure was created and updated is captured in versioned infrastructure as code, then others on your team now and in the future have the ability to provision it.
Let's say a staging database gets destroyed and deleted through the console for some reason. Mistakes happen. If you don't know all the configuration parameters that you need to rebuild it, you need to go through the console of a similar system (prod or dev) and painstakingly recreate it. Hope you don't miss any checkboxes or hidden options!
If that infrastructure was defined instead as IaC, it'd be simple to recreate. It's a terminal command away.
Now let’s say that there’s a production outage because an API gateway was accidentally deleted or misconfigured. You can see how site reliability would be positively impacted by being able to recreate this infrastructure from IaC.
I think it's really important for platform engineers and SREs to be able to use infrastructure as code to create repeatable systems and not have all that configuration locked into non-executable documentation or people's heads.
It’s also helpful for root cause analysis. If you had an incident where infrastructure is involved, you can also use versioned IaC to track down configuration changes that might have contributed to the incident.
Of course, you also need to be alerted when systems have issues. Alerting is more often a people/culture/process issue more than a technical problem, but IaC can help you manage all of your monitors/alerts. Don't allow changes in the console. This helps you avoid drift between "Why does this alert for team X not match the alert for team Y in this weird way".
If you are running production workloads in any of the cloud ecosystems, at that point it's critical to have IaC because you're creating the repeatability that you need for the long term.
How has IaC helped you build more reliable systems? Would love to hear any stories.
And may you never be paged in the middle of the night because someone used the AWS console to change a database parameter.
Matt @ Masterpoint.io
PS If you’ve found this newsletter helpful, please forward it to a friend. Or if you want to share on your company slack, here’s the archive of all my newsletters.