I recently noticed Novara uploading their podcasts to YouTube, and rather than just uploading a still image, it’s accompanied by a funky audio visualisation. I know listening to music or podcasts on YouTube is a waste of bandwidth, and the audio quality isn’t great; I prefer to download mp3s once and play them locally. For future reference, Novara also put their shows on the Internet Archive, where you can get the audio files in a compressed ogg format. I sometimes need that when downloading on my phone, if I’m on the train or something.

Anyway, the audio visualisation trick is cool, and it might come in useful. For example you can’t upload mp3s to twitter (or facebook?), so if you want to share some audio, you just have to render it as a video, and jazzy effects make it seem less redundant.

At one point, facebook was so concerned with pushing video content, that it forced publishers to create ‘videos’ composed of still image frames in order to reach an audience. In the end they were caught systematically overestimating video viewing metrics, making the entire charade completely pointless. Still, there are probably more people listening to audio-as-video-files than people who just download mp3s. So, let’s make some audio visualisations.

I was pretty sure I’d seen that you could render a waveform effect in FFmpeg. Here’s a whole song, rendered in purple.

waveform

And here’s a visualisation of my favourite Italian ska-punk band performing ‘Luna Rossa.’

Here’s the code behind it.

ffmpeg -i <audio_file>.mp3 -i <background_image>.jpg -ss 00:01:04 -to 00:04:16 -filter_complex \
 "[1:v]scale=1920x1080[image]; \
 [0:a]afade=t=in:st=64:d=15[a]; \
 [0:a]showwaves=mode=cline:s=1920x480:colors=SteelBlue@0.9|DarkOrchid@0.8:scale=cbrt:r=25[waves]; \
 [image][waves]overlay=0:600[bg]; \
 [bg]drawtext=fontsize=15:timecode='00\:00\:00\:00':rate=25:fontsize=65:fontcolor='white':boxcolor='black':box=1:x=1600-text_w/2:y=80[out]" \
 -map "[out]" -map [a] <final_output_file>.mp4

The key thing I learned here was how to use filter-complex and the process of passing on and compositing different layers. I still don’t completely understand it, but I know more now than I used to. Let’s step through it line-by line.

First, we import the audio and the background image, and cut the audio from 1:04 to 4:16.
Next, we scale video input one, the image, to 1920 by 1080 and output that to ‘image’.
Then, we take audio input zero, the audio file, and fade it in, from 64 seconds in, for 15 seconds, and output that to ‘a’. It’s 64 seconds because we’re cutting the audio at 1:04, which is… 64 seconds. Initially I cut the whole thing at the end, just before outputting, then I moved it to the beginning of the command, there are implications for seeking speed, and the order is important, otherwise you get quirks like here where the fade has to be delayed.
Then, we render the sound waves, as a centred vertical line, within dimensions of 1080 by 480, using a blue colour at transparency 0.9 for the first channel and a purple colour at transparency 0.8 for the second, scaling the volume by cubed root, render at 25 frames per second, and output to ‘waves’.
Then, we overlay ‘waves’ onto ‘image’ at 0 pixels across and 600 pixels down, which puts it somewhere around the bottom half of the screen, output it to ‘bg’.
Then, just for fun, we open ‘bg’ and draw a timecode on top, with a whole bunch of options, and output the final thing to ‘out’. Curiously this starts at around 58 seconds in, which I can’t explain. The timecode isn’t really necessary either, I was just messing about because I could.
Finally, we map ‘out’ and ‘a’ onto the final file. The video comes out at 1920x1080 resolution, and I’ve then compressed and scaled it down to 640x360 for embedding here.

It’s fairly self-explanatory, but still took me a couple of hours to figure out. It’s probably not the best way of doing it either, if I come back to it again I’d like to go from ‘good enough’ to a proper understanding of the filter-complex process.

I also tried rendering blank colour backgrounds, and then white noise. Here are some frames from a video where I’ve corrupted the file with white noise.

corrupted file corrupted file corrupted file corrupted file

Corruption artefacts can be beautiful, especially with the distorted colours.

Another thing I learned, to watch out for, is that FFmpeg renders SVGs at their given pixel dimensions and no more. If you give it an SVG of dimensions 100x100, and scale it to 2000x2000, it doesn’t re-render the SVG gracefully for crisp sharpness at a higher resolution, it just blows up that 100x100 original and you get a blurry mess. I guess you could get around this by changing the SVG pixel dimensions, which are completely arbitrary in all other cases. It feels like a kludge fix.

My previous post included a bunch of screenshots from the film ‘Sorry to Bother You’. Normally I’d do this by going into VLC media player and grabbing a screenshot whenever I see an important scene. I think the screenshot functionality might be broken on my VLC install, and this time I tried something a little different.
Here’s the bash script I used.

#!/bin/bash

for i in {1..60}
do
minutes=$(( ( RANDOM % 50 )  + 10 ))
seconds=$(( ( RANDOM % 50 )  + 10 ))
timecode0=$"00:$minutes:$seconds"
timecode1=$"01:$minutes:$seconds"
ffmpeg -ss $timecode0 -i "<video_file>.mkv" -vframes 1 frame_00$minutes$seconds.png;
ffmpeg -ss $timecode1 -i "<video_file>.mkv" -vframes 1 frame_01$minutes$seconds.png;
done

That rendered 120 screenshots at random intervals throughout the film, and I picked the best ones. I could run it again and get a different set of frames. While it’s not particularly user-friendly, the power comes from its programmability. I know Matthieu has done frame capturing similar to this, although (for good reasons) he won’t go into detail about his work.

Let’s step through it, it’s fairly simple. There’s a loop, which fires 60 times. Within the loop, we generate two random numbers between 0 and 50, add 10, and assign these as variables for seconds and minutes. We can’t have a number less than 10 because we need timecodes in 00:00:00 format, and generating 01, 02, 03, etc. seems complicated. That’s the same reason we run FFmpeg twice, for hour zero and hour one (the film runtime is around 1 hour 50 minutes). Pass the timecodes to FFmpeg, pick out one two frames, output as png. Repeat 60 times. Done.

file list

Now that you’ve got a directory full of big heavy uncompressed PNGs, you could run trimage to optimise them.

trimage -d $PWD

That compresses everything in your directory. It might take a while, even on a beefy desktop processor. Alternatively you can use imagemagick and mogrify to jpg.