Crawling LJ
« previous entry | next entry »
Apr. 24th, 2006 | 07:35 pm
For the final project for my web architecture class, I can choose what I want to do as long as it's sufficiently webby. I have a lot of ideas saved, but I'll probably work alone and the project is due in less than a month; the proposal is due Thursday. I'd love to do some text analysis, but I'm pretty sure it would take more time than I have.
Instead, I'm considering writing a recommendation system. The obvious problem is how to get enough data to do something. I could beg for people to rate things, but I'd rather not wager my grade on it. The obvious solution is to pull data from some existing source. I've decided to tap LiveJournal for lists of subscriptions. Instead of just using /misc/fdata.bml, I'm using the relatively new /tools/opml.bml . This will allow me to recommend communities and syndicated accounts as well as users.
Crawling is straightforward; I just pick out usernames and put them into the queue. Sadly, it took me a whole day to write the spider. I decided to use Ruby for the first time in quite a while, so I had to go look up parts of the language that I've forgotten, and I had quite a bit of trouble with libraries. sqlite-ruby kept compiling against OS X's bundled copy of SQLite, which is about a year old, instead of the latest version. I finally decided to abandon the gems package manger and install it semi-manually. When I did, I discovered that there's an SQLite 3.x version that I should be using instead.
The other problem I had was XML parsing. REXML may have a nice API, but I'm worried about it's competence (regex-based parsing seems likely to always have problems). The libxml bindings I found were crufty, so I turned to expat bindings. They seem crufty too. Fortunately, libxml has been getting some love under a new name.
The last time I checked, it had crawled through about 16,000 users and the database was 100MB. On the other hand, it has 400,000 to go ... and that number keeps growing. Based on stuff posted to
lj_research, I expect it to find four or five million users. I estimate it will take ten to fourteen days to complete.
I was surprised to have quite a few errors occur. The most common one, by far, is caused by non-UTF8 characters in the titles of some journals. There are a few other areas, like interests, that have this problem on LJ. I'm just as uncertain as ever about what to do. The strings are short enough that character set detection is unlikely to be effective. So far it only affects 0.5% of users crawled, so I'm not terribly worried about it.
I've also had a handful HTTP errors, but they were more annoying than anything. After adjusting the exception handling it seems to be fine. I've taken care to be as friendly as requested. The bot uses persistent connections and takes regular one-second breaks. The only thing I see to improve is applying the pauses after every request, but I'm not sure how reliable Ruby's Kernel.sleep is. I tried to do signal trapping so I could gracefully stop it, but that doesn't seem to work; I probably need to catch some other signal.
I hope to implement a user-item and an item-item algorithm, but I still need to read some papers and write the proposal :p It just seemed sensible to get the data collection started immediately.
Instead, I'm considering writing a recommendation system. The obvious problem is how to get enough data to do something. I could beg for people to rate things, but I'd rather not wager my grade on it. The obvious solution is to pull data from some existing source. I've decided to tap LiveJournal for lists of subscriptions. Instead of just using /misc/fdata.bml, I'm using the relatively new /tools/opml.bml . This will allow me to recommend communities and syndicated accounts as well as users.
Crawling is straightforward; I just pick out usernames and put them into the queue. Sadly, it took me a whole day to write the spider. I decided to use Ruby for the first time in quite a while, so I had to go look up parts of the language that I've forgotten, and I had quite a bit of trouble with libraries. sqlite-ruby kept compiling against OS X's bundled copy of SQLite, which is about a year old, instead of the latest version. I finally decided to abandon the gems package manger and install it semi-manually. When I did, I discovered that there's an SQLite 3.x version that I should be using instead.
The other problem I had was XML parsing. REXML may have a nice API, but I'm worried about it's competence (regex-based parsing seems likely to always have problems). The libxml bindings I found were crufty, so I turned to expat bindings. They seem crufty too. Fortunately, libxml has been getting some love under a new name.
The last time I checked, it had crawled through about 16,000 users and the database was 100MB. On the other hand, it has 400,000 to go ... and that number keeps growing. Based on stuff posted to
I was surprised to have quite a few errors occur. The most common one, by far, is caused by non-UTF8 characters in the titles of some journals. There are a few other areas, like interests, that have this problem on LJ. I'm just as uncertain as ever about what to do. The strings are short enough that character set detection is unlikely to be effective. So far it only affects 0.5% of users crawled, so I'm not terribly worried about it.
I've also had a handful HTTP errors, but they were more annoying than anything. After adjusting the exception handling it seems to be fine. I've taken care to be as friendly as requested. The bot uses persistent connections and takes regular one-second breaks. The only thing I see to improve is applying the pauses after every request, but I'm not sure how reliable Ruby's Kernel.sleep is. I tried to do signal trapping so I could gracefully stop it, but that doesn't seem to work; I probably need to catch some other signal.
I hope to implement a user-item and an item-item algorithm, but I still need to read some papers and write the proposal :p It just seemed sensible to get the data collection started immediately.