Site stats page

At the end of 2019 and 2020 I did some yearly roundups of site statistics. I skipped the yearly statistics for 2021 as I didn’t have time at the end of the year, and resolved at some point to develop a rolling automatically-updated statistics page.

Until now I had a chain of scripts which ended with a log file on my computer. From time to time I would read through the raw logs, usually after publishing a post.

stats_pipeline_local

There’s a lot you can learn by doing this, it’s an approach which works best for a focused inspection of individual visitors. Cloudfront also has a built-in dashboard for an overview of broad trends, but it only goes back a maximum of 60 days.

So, I’ve now made some progress on generating graphs with a few ✨ serverless ✨ Lambda functions on AWS, and now the site has a new statistics page.

The data pipeline looks a little different.

stats_pipeline

The graphs end up in a bucket with public access, and I added a new part to the build script to download the latest graphs and then include them as inline SVGs in the page. The inline SVGs can be styled using CSS, and I took advantage of that to make some tweaks for a dark mode, this approach also means they won’t be cached independently of the page.

The extra page weight still sits well below the 14kB round trip limit when compressed.

Compression	Page weight (kB)
None	25.3
Gzip	4.4
Brotli	3.9

Graphs

The first graph counts visitor numbers per month. The function used to draw that graph is here.

visitor_graph

The initial query filters unique IP addresses, because the volume of requests on their own don’t give a good picture of individual visitors, which is what this graph is for. It could be better, it doesn’t give a good picture of regular or recurring visitors, but you get an idea of the general audience. After filtering out bots, this site settles at around 1,500 probably-human visitors per month (and climbing steadily).

I have another graph which tracks the response type. The function used to draw it is here.

hits_graph

The purpose of this graph is just to measure the performance of Cloudfront. Unlike the first graph, what we’re interested in here is not the volume of requests, just whether they were hits or misses.

There isn’t a high hit rate, and that’s a problem of not having a super popular website with no worldwide reach. To prove that the cache does scale up, I did have a short organic traffic spike on one post and saw the hit rate shoot up at the same time. It’s just not generally applicable.

A note about privacy

Every time I tackle this topic I end up diverting into questions about privacy and the ethics of collecting logs. There’s always a contradiction, if I didn’t care who was reading the site I would not look at the logs, I was partly able to justify this to myself in that the logs were being automatically collected anyway whether I looked at them or not.

That justification is getting less and less tenable as at this point I’m keeping archives going back two years and counting. For the purposes of GDPR compliance, IP addresses count as personal information, which would make this a legal question for any large organisation.

I thought about a system where you could click a button, which would call a lambda function, which would remove your IP address from existing logs. The problem with that imaginary system is that if you connect to the site afterwards, your IP address would still get logged again, so, I would need to store your IP address in order to filter it out on an ongoing basis… and that just loops back to the original problem.

Because I care a lot about privacy, I made a hidden service available through Tor specifically for people to view the site anonymously. If you would like to opt out of your personal information being collected, please use the hidden service.

Further improvements

Query optimisation

Currently AWS Athena scans through 224MB of raw (compressed) log files. I’ve already spent a while reading through the Presto docs and tried a few query tricks with no success. Maybe I could write a new lambda function to filter the logfiles before querying them.

None of this is significant in terms of cost, and the query doesn’t have to run fast as it’s asynchronous. Optimisation would be a pursuit of efficiency for its own sake, and more about learning how to filter data in python.

I have a query which collects the eight most popular post-urls from the last month. The query works and returns a comma-separated list, although I’m having difficulty integrating it into Jekyll’s data model.

Other ideas under this are post-related statistics, such as ‘posts per month’, or ‘words per month’. I’ve already looped over ‘posts per year’ on the front page. Again both of these would involve Liquid filters and variables and pushing at what can be done processing things only in Jekyll.

Server timing metrics

There are some new Cloudfront headers which measure performance. I haven’t enabled these yet, but could do so and see if they show anything useful.

Location metrics

Cloudfront can include geolocation headers which guess a visitor’s location. I could try those out. Another less invasive way of tracking location is through the IATA airport code associated with the nearest edge cache which served the request. If I wanted to display that, I would need to link up those codes to a map somehow.

Also, tracking location is definitely a privacy issue.

Browser/device metrics

This refers to the user agent string. I don’t have an idea of how to filter or query that string in a way that is actually useful, and not sure the resulting information is interesting. Is it valuable to know that, at the time of writing, 56.7% of visitors used Chrome or Chrome Mobile?