10 August 2009

gdb to the rescue

I've written a program that starts a bunch of threads, reads a bunch of data from various computers on the Internet, packs it up, and ships it off to another computer for safekeeping. It runs several times an hour, triggered by a cron job. I also have a watchdog alarm that goes off if that process hasn't run for a while.

Today, the alarm went off. I investigated and realized that one of the several thousand threads it launched was hanging because I'd opened a socket using without setting a timeout. The process won't finish until each and every thread finishes, so that single socket was holding up everything. I wrote the program in Python with urllib, and I'd failed to add a line like this to the top:

SOCKET_TIMEOUT = 20
socket.setdefaulttimeout(SOCKET_TIMEOUT)

The problem was that I didn't want to lose the data that was inside these processes by simply killing them -- it's time-sensitive. If I lost those readings, there would be no way to get them back. But I couldn't think of an easy way to close those sockets. I looked into tcpkill, but that only works if packets are moving back and forth. These threads were hanging because the destination server wasn't sending anything.

Fortunately, there was a fairly straightforward solution.

First, I did a quick scan for the stuck processes:
$ lsof |grep python|grep www

python 14139 jdb 12u IPv4 155056253 TCP mybox:49535->failpc:www (ESTABLISHED)
python 24844 jdb 28u IPv4 154951225 TCP mybox:60415->failpc:www (ESTABLISHED)
python 26287 jdb 26u IPv4 153763806 TCP mybox:44628->failpc:www (ESTABLISHED)
python 28168 jdb 9u IPv4 155145417 TCP mybox:37225->failpc:www (ESTABLISHED)

This used lsof to list all sockets held by Python processes that were connected to port 80 (www) on the remote machine. "mybox" is my machine, and "failpc" is the machine that's doing nothing.

The fourth column is the file descriptor for those sockets. The 'u' means that the descriptor is both readable and writable.

I used gdb to unstick the processes by closing the sockets:

$ gdb /usr/bin/python
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...
(gdb) attach 14139
Attaching to program: /usr/bin/python, process 14139
Reading symbols from /lib/i686/cmov/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 0xb7d416c0 (LWP 14139)]
[New Thread 0xb5906b90 (LWP 20872)]
Loaded symbols for /lib/i686/cmov/libpthread.so.0

(a whole bunch of loading-symbols cruft was snipped)

(gdb) print close(12)
[Switching to Thread 0xb7d416c0 (LWP 14139)]
$1 = 0
(gdb) cont
Continuing.
[Thread 0xb5906b90 (LWP 20872) exited]

Program exited normally.


What I did was open gdb, attach to the running python process, and call close() on the socket. The offending thread exited, and the supervisor thread eventually noticed, gathered the data from all the threads, saved it, and exited.

The reality was a bit messier: the threads opened two sockets in a sequence, so I had to repeat the above steps for each thread: I ran lsof to find the new file descriptor, hit Control-C in GDB to stop the process again, and closed the new FD, too. I could also have left tcpkill running with parameters like this

tcpkill -i eth0 dst host failpc

...which would have killed new connections to failpc as soon as they were opened. I used that approach for most of the connections to save time.

You might not be able to reuse gdb processes -- in my case, if I tried to attach to a second process after the first one exited, gdb printed warnings like:

warning: Cannot initialize thread debugging library: versions of libpthread and libthread_db do not match

I found it easier just to quit and start a new session each time.

2 comments:

  1. I have a rather good-looking paramour ^_^

    ReplyDelete
  2. wow, you rock! This was just what I needed this morning to fix a critical system issue.

    We have put together an awesome team of hackers at our company and, from reading your blog, looks like you would fit right in (python, scala, ec2). We are in NYC as well.

    http://www.invitemedia.com/

    Feel free to drop me a line if you are interested in hearing more. scott (at) invitemedia.com

    Thanks again!
    Scott

    ReplyDelete

About Me

blog at barillari dot org Older posts at http://barillari.org/blog