29 October 2009

lxml on RHEL

Screen-scraping again. I opted for lxml instead of BeautifulSoup because lxml is faster and BeautifulSoup appears to be on the way out. (Its author explains: "I no longer enjoy working on Beautiful Soup.")

lxml is easy to install on Debian; you just say "apt-get install python-lxml". Red Hat is another story.

I haven't had root access to a Red Hat box since 2002 or so, when my TP i1482 laptop died* and I installed Debian on its replacement. The screen-scraper needs to run on a Red Hat box.

* When the backlight failed and I couldn't get a replacement to work properly, taking the housing off the screen and shining a bright light through it made it sort-of usable. An external monitor was even better. Then, the power supply or possibly the mobo started to fail and it took many tries to get it to boot. Unfortunately, my too-rapid power cycling blew the hard disk. At that point, I retired it.

Unfortunately, the box's yum repositories did not contain an lxml package. lxml depends on versions of libxml2 and libxslt more recent than those in the repository, so I had to install those before I could install lxml from source.

I tried binary RPMs of libxml2 and libxslt, but I think they depended on libraries newer than those in the yum repository:

# rpm -i libxml2-2.7.6-1.x86_64.rpm
warning: libxml2-2.7.6-1.x86_64.rpm: Header V3 DSA signature: NOKEY, key ID de95bc1f
error: Failed dependencies:
libc.so.6(GLIBC_2.7)(64bit) is needed by libxml2-2.7.6-1.x86_64
rpmlib(FileDigests) <= 4.6.0-1 is needed by libxml2-2.7.6-1.x86_64

Then I tried the source packages:

# rpmbuild --rebuild libxml2-2.7.6-1.src.rpm
Installing libxml2-2.7.6-1.src.rpm
warning: InstallSourcePackage: Header V3 DSA signature: NOKEY, key ID de95bc1f
warning: user veillard does not exist - using root
error: unpacking of archive failed on file /usr/src/redhat/SOURCES/libxml2-2.7.6.tar.gz;4aea24a1: cpio: MD5 sum mismatch
error: libxml2-2.7.6-1.src.rpm cannot be installed

Googling this error did not reveal anything useful, so I elected to install both packages from source. Fortunately, this wasn't too complicated. Here's a cleaned-up version of what I did.

(cd $HOME; mkdir mylib) # don't clobber the rpm-installed versions
(cd libxml2-2.7.6; ./configure --prefix=$HOME/mylib ; make ; make install)
(cd libxslt-1.1.26; ./configure --prefix=$HOME/mylib/ --with-libxml-prefix=$HOME/mylib/ ; make ; make install)
(cd lxml-2.2.2; python2.6 setup.py build --with-xslt-config=$HOME/mylib/bin/xslt-config --with-xml2-config=$HOME/mylib/bin/xml2-config)

Invoke Python as follows, and lxml should work:

env LD_LIBRARY_PATH=$HOME/mylib/lib/ PYTHONPATH=$HOME/lxml-2.2.2/build/lib.linux-x86_64-2.6/ python2.6

1 comment:

  1. Dude this was *extremely* helpful. I have to do this on a Redhat box with an ancient version of libxml which I can't touch, and your instructions worked perfectly. Thanks!


About Me

blog at barillari dot org Older posts at http://barillari.org/blog