Moving on from last year’s collection of site stats, I set up cloudfront to write logs to an S3 bucket, instead of downloading the reports every few months. Here’s the documentation on how to set that up.

So, as of July 2020, every request which goes through cloudfront gets logged and written to its own little file. To get a big useful logfile out of it, you need to go through some steps:

I wrote some patchy bash scripts for this. This script compiles all logs into one big logfile, this one filters the logfile to remove bots and crawlers, and then runs this one to plot a graph for each day, and finally this one strings together all the graphs into a video.

The scrolling graph here covers a 193 day period from end of July 2020 to beginning of January 2021. The red diamonds are posts, black dots are cache hits, grey dots are cache misses.

Over that period I logged 103,992 requests, of which 33,188 count as ‘real’ requests after filtering bots and crawlers. Out of these remaining requests, there was a ~52% cache hit rate.

How to filter out bots

If you find yourself handling one of these log files, and you want to easily remove automated requests, here are the relevant commands from the above scripts:

awk -F"\t" '$11 !~ /bot|crawl|Datanyze|bitdiscovery/' big_logfile.tsv > filtered_logfile.tsv;
awk -F"\t" '$8 !~ /wp|php/' filtered_logfile.tsv > tmp && mv tmp filtered_logfile.tsv;

These start by searching in columns 11 and 8, which refer to the user-agent and the requested url respectively. Anything in the user-agent with the string bot or crawl can be safely deleted. Similarly, anything scanning wp (wordpress) or php in the url can also be discarded; as this site doesn’t run wordpress it’s a safe guess that anything looking for it isn’t human.

Looking at the numbers above, of all the requests about two-thirds are bots. Is this normal? I have no idea, but it doesn’t bother me insofar as the cost of bandwidth is negligible and it has no effect on performance.

Why use a CDN?

What does the cache rate tell us about cloudfront? 48% of requests are misses, so to what extent is the CDN redundant? Again, I can’t really answer this, but if half of the requests aren’t being met by the nearest server, that seems like a problem I should address.

The key to understanding this is that (despite enthusiastic marketing claims) your content isn’t mirrored in all edge caches at all times. If someone in a faraway place asks for an infrequently-accessed page, that request will hop across the network if it’s not in the nearest cache. This is of course the exact scenario you’re trying to avoid.

I also make frequent use of cache invalidations when writing and updating new posts. So, maybe it’s excessive to use a distributed global server network to serve up small pages on a site of such minor popularity. On the other hand, it doesn’t cost much, and I’m not sure it’s actively harmful?

If I’m talking rubbish here, please correct me. And, if there’s a better way of configuring cloudfront to tune performance, I’d be happy to take advice for improvement!

Ethical considerations

Is it morally okay to keep logs like this? Some companies have a rule to wipe server logs after a fixed period. Should I delete my logfile after writing this post? I won’t, but maybe I should.

I get all the standard information included in a HTTP request (IP address + user-agent), as well as the nearest edge cache identified by 3-letter international airport codes. As I suggested last year, those details collectively make it possible to fingerprint users. It’s not difficult to spot obvious patterns, to the point where a court might argue it counts as personally-identifiable data.

Hello , I see you! 👋

Beyond that, I don’t see any information which isn’t already sent by default. There are no trackers here, no ads, and calls out to load external resources are minimal.

Still, I can’t help feeling my position on privacy is insincere, because I still make the effort to read and process the limited information I have. Someone who truly respects privacy would maintain an attitude of principled disinterest in who their audience was.

I pivot between ‘hey read this it’s good!’ and ‘oh boy I hope nobody reads this embarrassing nonsense.’

Search Engine Optimisation

I went to the small effort of adding a robots file, a sitemap, and including word counts in article metadata. Slowly slowly, the site is becoming more machine-readable. I’ve also started using this url online more, testing the waters to see what effect it has on traffic.

I signed up to the Google Seach Console, and the respective portals for Bing and Yandex. Going through the Google index was a good way of finding broken links. Aside from pushing dodgy ‘turbo pages’ or ‘accelerated mobile pages’ I wish search engines would do more to help people make better websites, and highlight things like accessibility or W3C best practices. Use your power for good!

As for the things people search for, I’ve written about a few niche topics (eg. libreboot) which seem to get a lot of repeated attention from Google. There are other things I’m curious about - mostly external hits from facebook exclusively on one post; and someone has pinned the site on their iPhone. That’s… interesting to note.

Finally, I’ve realised the RSS feed needs some work. It seems strict XML parsers have problems with emojis. 😬

How has the site grown?

One way of showing this is what I pay to AWS in $ per month.

AWS cost graph

October looks expensive here because it’s the month I tried out Amazon Media Encoder. That was a dollar well spent, although it’s an outlier for the purposes of tracking growth.

I ran the same word count on posts as in 2019. It shows I’ve written 24,539 words overall, an average of 454 words per post. I wrote far less in 2020 than in 2019, both on this blog and as academic output.

There was once a garden

we called it the earth
where is this garden
where we could have born
and lived, nude and carefree?

– by Georges Moustaki ♬