A quick and dirty DIY rsync for S3 websites

2020-08-24

1666

As you might know, this website is served to you from an S3 bucket which is great because I don't have to pay for computation that I don't use because all my content is static. That does mean that every time I want to make a change because, for example, I made a new blog post, I have to re-upload the relevant files. Now the crucial thing to know here is that uploads are significantly more expensive than downloads. What that means is that getting traffic for my website is pretty cheap, but if I upload a lot of changed files that gets expensive. To get an idea of what I mean, in the AWS Free Usage Tier you get in your first year of use, you get 20,000 free downloads but only 2,000 uploads, an order of magnitude less. Until now that workflow has been somewhat inefficient, and today I made a quick and dirty solution for it. In the spirit of keeping this blog alive and also defeating perfectionism, I wanted to show you that solution and the journey that got me there.

My old workflow

For those of you unfamiliar with static website generators what happens is that I have some content and template files on my local machine, and every time I want to make a change to the website, the entire website gets re-rendered. That is necessary because as a static webpage re-rendering, the entire thing is the only way to make a change like update links in index pages. That is an important detail because it means that even if I don't change the content on files, it still gets a new creation timestamp in the file system.

The way that I used to upload the files to my S3 bucket is using the aws CLI utility. That has a lot more functionality, but the part of it that I would always use was s3 sync. That takes as input an S3URI and a folder path, and it will synchronise the two directories. It decides which files to update based on the file size and the timestamp that the file was last modified. Now, usually, this is fine, however since my site generator regenerates my entire website every time I make a change, file sizes and creation dates can still vary even if I didn't make significant changes. Sometimes it can be tough to properly control things like white pace when you generate web pages from templates. That means that I have to use a lot more uploads than necessary, and that annoys me. It also helps me save a little bit of money, especially while I'm still on the free usage tier.

Luckily because I use version control (as are you right?) and I try to make a commission every time I upload a new version, I have a local copy of the version that is on the web at all times. I figured that this setup and workflow would be universal for people using a static website generator, especially since it does require some technical know-how. So my first instinct was to bake it right into the website generator itself.

Over-engineering a solution

If you know me, you'll know that over-engineering is my superpower. Seriously, I will spend weeks programming to save myself minutes of work if I don't keep myself in check. So I happily set out to try and integrate this idea into the site generator. Trying this, I immediately ran into two problems: git and s3. While Rust is an excellent programming language, it's still relatively new and as such misses mature tools for dealing with many things, git and s3 included. There are tools out there to work with those systems, but they are for the vast majority either very immature and unstable or very poorly documented. I am honestly surprised how poorly the aws CLI is written given the massive technical weight behind it. One of the biggest problems I had with the documentation of both of these systems was telling whether any piece of documentation I was looking at was still relevant or deprecated. I'll spare you the details, but seriously, it was such a pain.

My next go-to was Python. I had already used the Python SDK for AWS boto3, and while still had some of the same problems there as I did in Rust, I was a little more comfortable with that tool. However, I wasn't happy with the tooling that Python had available for interacting with git. I'm not saying the library itself was horrible, but both the installation process and the documentation for it were pretty bad. At this point, I was already getting pretty frustrated, so I quickly moved on from that solution as well. Eventually, I very grumpily reached for the tool I ended up sticking with bash.

KISS or "Keep it, simple sweety."

Originally I had hoped to make something that integrated nicely with other systems I used, and that was robust enough that I could give it to other people. However, but this point it was starting to dawn on me that that was probably out of reach. I had run across the acronym KISS a couple of times, which stands for "Keep is simple sweety" (usually it's actually "stupid" instead of "sweety", but I like this version better). If you are searching for programming help on the internet, you will inevitably run into this phrase. It's a philosophy that says things work best when they are simple and focused instead of elaborated and complicated so I decided to go for a quick and dirty solution that would work for me, so I decided to make a bash function to do the job.

To me bash is the programming equivalent of duck tape and fishing line, it's great for quick and dirty work, but it's as crude as it is simple and it breaks very quickly, so I try to go for it only as a last resort. The upshot of using bash is that I can use many tools like git and the aws CLI directly, instead of having to rely on some intermediate product.

My first idea was to use rsync. rsync is a very well used and well-regarded package commonly used in GNU systems. It's made for remote backups and has quite sophisticated ways of determining which files need to be updated and avoid unnecessary uploads. Sadly, as far as I could tell, it is not compatible with the S3 protocol, so that would not work.

A DIY rsync

While rsync itself wasn't going to work, I did want to use the same kind of idea, so I decided to make a straightforward DIY solution that acted kinda the same. Circling back to my previous statement, I realised that I could use git diff HEAD --name-status to figure out which files were modified and how. That would tell simultaneously tell me which files I'd have to process and what I had to do with them (upload for created or modified files, delete for removed files).

Because at this point I was going for quick and dirty I decide to use AWK to process the output from git since the output format suited that quite nicely. After some trail and error I settled on this command:

BEGIN{FS=" "}
/^[M|A]/ {
  printf "aws s3 cp "$2;
  gsub(/'$root'\//,"",$2);
  print" s3://'$bucket'/"$2" --dryrun "
}
/^D/ {
  gsub(/'$root'\//,"",$2);
  print "aws s3 rm s3://'$bucket'/"$2" --dryrun "
}

(in this the --dryrun is just for testing).

In case you don't know AWK, this is a little programme that outputs the cp command if the input line starts with an M (for modified) or and A (for added) and outputs the rm command if it starts with a D (for delete). The gsub is to replace the local directory name that I passed in because if I want to sync the directory website and git tells me to sync website/index.html then I need to execute aws s3 cp website/index.html s3://<BUCKETNAME>/index.html so we need to strip the directory name.

This is done for every line that's output by the previous git command. At this point all that is necessary is to pipe those commands back to bash itself so it can exectue them. Finally I also wrapped all of this in a function so I could take the website directory and bucket name as an input and a git commit at the end so I won't forget to actually commit the changes I just uploaded. The final function looks like this:

function s3-update(){
    bucket=$2
    root=$1
    git add . \
      && git diff HEAD --name-status $root \
      | awk  'BEGIN{FS=" "}
        /^[M|A]/ { printf "aws s3 cp "$2; gsub(/'$root'\//,"",$2); print" s3://'$bucket'/"$2}
        /^D/ {gsub(/'$root'\//,"",$2); print "aws s3 rm s3://'$bucket'/"$2}' | sh
    git commit;
}

Some quick testing and benchmarks

To wrap it all up I wanted to make sure that it had any benefit. I tested the function on this website using this very blog post as a test case. With find public | wc -l I found out that the whole website contains 216 files at the time of writing (including directories). Using the aws CLI like I normally would:

aws s3 sync <DIRNAME> s3://<BUCKETNAME> --dryrun | wc -l

it had decided that it would re-upload 93 files. However, my little function agreed that I'd only need to upload 11 files. That is almost an order of magnitude lower. I am not quite sure why the aws CLI decided that those would need to be updated, but I did manually verify with the use of diff that some of the ones that my function didn't select were indeed identical to the ones that were already in the air. All in all, not the most elegant or robust solution, but pretty good for about an evening of work.