19 November 2009

How to shoot yourself in the foot, part 391

Very early this morning, I noticed that the oom-killer, the Linux kernel function that attempts to free up memory when it runs out by killing processes, had killed my database. This wasn't a huge problem, because the database is durable in face of unexpected termination (or so it says on the packaging), and the data is replicated in several different ways anyway. But it was still annoying, because it meant that I had to manually restart the update process that had been running when it was killed.

Later that morning, the oom-killer woke up and killed the database again. I took a careful look at the system and noticed that it had no swap file! That was a bit odd.

I googled a bit, and found a blog debate about swap files, the oom-killer, and MySQL. One guy contended that it was better to just let the oom-killer work, because if a computer started thrashing, you might be completely locked out, unable to even ssh in to fix things. Better to keep the system running and rely on InnoDB's inherent safety against crashes.

Bosh, I thought, what a silly idea. Instead of gracefully slowing down as it thrashed, a single memory leak on a swapless system would immediately trigger the oom-killer.

So I merrily added a swap file to the system. I did not want to shut anything down to repartition and add a swap partition, so I swallowed the performance hit (hey, it's never supposed to swap, anyway, right?) and added a file on the root filesystem.

# dd if=/dev/zero of=/swap.file bs=1M count=2048
# mkswap /swap.file
# echo "/swap.file swap swap defaults 0 0" >> /etc/fstab
# swapon -a

Many hours later, my phone started chirping. Every time something goes wrong, I get a text message. When a lot of things go wrong (for instance, the server is completely inaccessible), I get a lot of text messages. I was getting a lot of text messages.

Of course, the server was completely inaccessible. Ping worked just fine, but ssh hung and went nowhere.


I flipped on the EC2 monitoring process (1.5 cents an hour) to see what was going on. Strangely, while the CPU was pegged at 100%, the disk read and write loads were near-zero, which made me wonder if something else was amiss. I waited for perhaps twenty minutes, then I hit the reboot switch.

Nothing happened.

Or rather, the instance went from "rebooting" back to "running" much more quickly than I would have expected. I still couldn't connect, but I could ping.

I tried another reboot. Then the pinging stopped.

Then it started again. I sshed in -- and I connected! Luckily, everything was as I'd left it -- the EBS volume remained mounted, the database came back online. Even the swapfile came back up. When I refreshed the monitor, I saw this:

The middle of the y-axis appears to be 1.0 x 10^19 bytes, which I suspect is an error of some sort.

Then suddenly my terminal stopped echoing. Had the system run out of memory already? I checked the system log through the Amazon management console -- not obviously so. There were no new oom-killer messages. Then I realized that this was the second reboot. A few minutes later, I could ssh in again -- to the freshly booted machine. Oops.

I wasted no time in chopping the swap down to 64 megabytes.

UPDATE: I found the cause of the OOM. I left a single expression out of a WHERE clause, turning a simple join into the Cartesian product of two very large tables. Oops.

No comments:

Post a Comment

About Me

blog at barillari dot org Older posts at http://barillari.org/blog