Ephemeral infrastructure

December 9, 2023

This site is published through an AWS S3 bucket, delivered by AWS CloudFront edge locations. Every time a new version is written, it goes through those steps:

I push my changes to GitHub repository;
The change automatically triggers a GitHub Actions workflow;
The workflow spins up a brand new GitHub Actions server instance and sets up my AWS credentials on it;
The workflow installs Hugo, which is the build engine of this blog, on the new instance;
On the new instance, the workflow pulls my blog repository from GitHub;
On the new instance, the workflow runs hugo to build the entire site;
On the new instance, the workflow runs hugo deploy to deploy new changes to AWS S3 bucket and clear CloudFront cache. Deployment finished;
The workflow cleans up and destroys the new instance.

These steps usually take 30–40 seconds.

GitHub Actions History

Among those, the instance start-up time is usually 10 seconds or so. The runtime is normally about 20 seconds.

GitHub Actions Details

And among those 20 seconds, in fact, the build + deploy time only takes 2 seconds.

GitHub Actions Workflow Details

From this time consumption analysis, it looks pretty bad because it seems like I am wasting a lot of time repeating the environment setup which is required every time. If I need to deploy very frequently, would it be a better idea to keep a long-running instance?

Actually, the answer is no. Even if I would deploy a million times a day, having a long-running build server would not be a better idea than what I am doing now, which is technically called ephemeral infrastructure.

The idea of ephemeral infrastructure is to always spinning up new resources and throwing away the old ones when make infra changes. For example, if this site were deployed on an AWS EC2 instance using NGINX, the ordinary deployment method could be dumping new static files to the public folder of the instance by some means, without changing the instance itself. The advantage (and actually also a pitfall) of this method is to keep the configuration on the instance as it is. The ephemeral infra method would be completely spinning up another instance with a new NGINX, uploading the new data, hosting a new site, and replace the old instance with the new one, and destroy the old one.

Why do we want to do this?

As I mentioned above, keeping the configuration on the server could be an advantage, but it is also a pitfall. If your site relies on some configuration on the server, it’s best to keep them in your code. Find a way to document and automate them, instead of leaving them on the server and just let it work. The pitfall of doing that can be nothing to be remembered. We all have the experience of taking over an old system with nearly no documentation, and it just works. It is painful to make any changes because you could break things easily. Even if you own the site completely like my case, I could still forget what I did to my site one month earlier. Having the configuration managed in the code allows you to remember what you have done. It is also a quick way to recovery from disaster because you can always recreate your environment faster and in confidence.

Ephemeral infrastructure is more secure, as there is no residual from previous actions. It is less likely to trigger strange environmental issues, and also less likely to allow malicious scripts to gain useful information (not saying it’s impossible though, depending on what kind of sensitive information you hold on your ephemeral infra).

Ephemeral infrastructure is also more cost-effective in most cases. For example, in my case, it is for build service. Actually, it is a very common pattern in the modern cloud-based build services (for example, GitHub Actions, Google Cloud Build, etc.). It is usually impossible to use long-running instance for build if you use those build services. It is a reasonable assumption that you won’t build your software 24 hours per day. So from a cost-saving perspective, it does make more sense to start up your instance only when you need it and be charged for the duration that you use them. The margin of extra time spent generally would not overweight the saving that it brings. This is only possible with the cloud model. If you manage your own on-premise data centre, you are paying for everything nonetheless; in that case, you won’t save money for ephemeral infra, but you still benefit from other aspects.

I have always been married to the idea of ephemeral infra and I really like it. I only realised it could look very counter-intuitive today. The build history of my own blog is indeed quite interesting too. That’s why I think I should write my thoughts down.