Kevin Vance - King Scraper 2

Entries | Archive | Friends | Friends' Friends | User Info

02:23 pm

King Scraper 2

Tuesday, April 18th, 2006
Previous Entry Share Next Entry
So last night lj_filter_finder (possible tagline: The Gentleman's Privacy Invader) was good enough for my friends list:

% ./lj_filter_finder -f kvance -d yesterday kvance $PASS
Logging in: ok
Retrieving kvance's friends list: ok
edanya showing 3 / 5 entries.
evan showing 0 / 1 entries.
2 users filtered kvance for friends of kvance on 2006-04-17.

There are still 18 or 19 free styles I haven't written parsers for. Depending on how well structured the style is, they can be easy or difficult to write. Styles like S1 Notepad have no structure at all, and you pretty much have to guess. But in the best case, it looks like this:

class Nebula_DayParser:
    """Parse the Nebula S2 day style."""
    def __init__(self, soup):
        container = soup.first('body').\
                    first('div',{'id':'mainContainer'},recursive=False).\
                    first('div',{'id':'content'},recursive=False)
        if container == Null:
            raise ValueError
        self.entries = container.\
                       fetch('div',{'class':'entry'},recursive=False)

    def parse(self):
        entries = len(self.entries) - 2 # Header/footer
        return {"entries": entries}

It takes a few minutes to run with non-cached data, and a few seconds if it's all cached. My test suite is good but far from perfect, because you can still customize a lot as a free user. I figure once I get the rest of the parsers written, I'll point it at... say, friends of wigu (not wigu's friends) and verify the results by hand. That will be one hell of a test :P

Link )Reply )