I’ve not written much on this blog lately. The new term has been uneventful, as sometimes happens I just don’t have much to talk about. I missed out on a teaching opportunity, and I’m not making progress with my research. At every new step my project gets more and more vague. Everyone is stressed and very busy, meanwhile nothing is really going on and the university campus is still basically empty.
I’m also dealing with a series of broken things:
- More holes in clothes, which need sewing up in time for winter. I repaired my broken mask straps though!
- Gears skipping on my bike, leading to a few wobbly near-crashes.
- I attempted, and failed, to replace a broken light switch, which broke the circuit for the light in the toilet/shower. So now I have to pee in the dark, or use my phone as a torch.
Unlike my PhD, these are all good sorts of problems to have: frustrating, but ultimately solvable with time and effort.
So, I was glad last night to have finally fixed a fairly major issue with my computer. An easy procrastination target with a satisfying result.
To recap: back in May I built my fourth computer. It’s almost identical to the part list I posted in February, minus the fancy Noctua cooler, and with a smaller NVMe boot drive.
Here’s what the system looks like right now:
pierre@borvo ------------ OS: Ubuntu 20.10 Groovy Gorilla (development branch) x86_64 Kernel: Linux 5.8.0-20-generic Packages: 2097 (dpkg), 13 (snap) Shell: bash 5.0.17 Desktop: GNOME 3.38.0 Terminal: gnome-terminal Motherboard: X570 AORUS ELITE CPU: AMD Ryzen 7 3700X (8) @ 3.6GHz [40.7°C] GPU: AMD Radeon RX 5500 XT Memory: 1525MiB / 16014MiB Boot Disk: 219GB (13%) Hard Disk: 3.6TB (8%) Scratch Disk: 440GB (19%)
Everything is held together in a Corsair 750D case. I had some trouble aligning the presintalled IO shield with the backplate, and I’m not sure some of the system panel connectors are in the right place, otherwise it was a straightforward assembly.
In general I was okay with it, all fine, except for occasional crashes. At irregular intervals the system would freeze, some processes stumbled on, and inevitably within about 30 seconds it shut down entirely. It was a situation where the computer was completely useable, so long as you accept that it could unexpectedly stall at any moment. I got very good at quick saving.
This processor paired with the x570 chipset already showed problems with GNU/Linux last year. These were solved with kernel version 5+ so it’s no longer an issue, although definitely not a good sign for overall reliability.
There were separate reports of memory issues with Corsair RAM. I turned off XMP, running my fancy Corsair sticks at 2133MHz instead of 3000MHz. A lot of the recent AMD performance improvements are reliant on fast memory, and either way slowing down the RAM didn’t seem to make a difference.
I noticed a consistent error in Gnuplot warning that the random number generator might be broken. And the warning was right! Run the code linked in this article to see if you’ve got this bug.
Sure enough, rdrand was returning 0xffffffff every time.
RDRAND() = 0xffffffff RDRAND() = 0xffffffff RDRAND() = 0xffffffff
I got the latest BIOS from Gigabyte (F30), which has new CPU microcode.
RDRAND() = 0xf49bc747 RDRAND() = 0x7a2ee071 RDRAND() = 0x7198fa4d
Unfortunately this didn’t fix the crashing though.
I thought a newer kernel might make a difference, so upgraded early to the Ubuntu 20.10 Beta. It comes with a newer kernel, from 5.4 to 5.8, and it’ll be released properly in about two weeks so it’s no big risk to run the beta now.
This error was coming up in the logs on startup:
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1601903655 SOCKET 0 APIC 5 microcode 8701021
The main fix was to disable the Cool ‘n Quiet setting in the BIOS, as suggested here. If you’ve landed on this post looking for a solution to a similar problem, try this first.
Maybe the processor was stalling once one of the cores hit an idle state, triggering the power saving measures? I don’t know why this works, because I don’t entirely understand the problem in the first place.
For the sake of stability, I disabled simultaneous multithreading, taking me from 16 threads down to 8. Most normal use doesn’t need more than eight cores at the moment anyway. I also disabled the clock boost, so the system is limited to the 3.6GHz base clock. I can always re-enable the turbo frequency up to 4.4GHz when I’m expecting to do heavy workloads.
And now, my computer feels super stable, no crashes at all! I have a few changes to make to the cooling, though for the moment I’m happy with this.