02:15 pm

King Scraper

Sunday, April 16th, 2006
More fun with BeautifulSoup: scraping your way into LJ to find out if you got filter'd:

>>> cache = Cache('/tmp/cache', 5)
>>> lj = LJ('', cache)
>>> lj.login('kvance', password)
>>> lj.get_calendar_day('axiem', 2006, 4, 13)['entries']
>>> lj.get_day('axiem', 2006, 4, 13)['entries']

Not to pick on axiem. I just know he has an extensive filtering system.

I made an S2 style that dumps out the required information, bypassing lots of horrible scraping for paid styles. But what about those free users? Aren't they still safe?

Not exactly:

% python
.oO  PARSER TEST  Oo.                        
Calendar parsers: ['CheaterTitle', 'Cheater', 'S2Dump']
Day parsers: ['S2Dump']                      
S1 Archive Pages                             
Clean and Simple.html            : Cheater   
Default.html                     : Cheater   
Disjointed.html                  : Cheater   
Generator.html                   : Cheater   
Magazine.html                    : Cheater   
Notepad.html                     : Cheater   
Punquin Elegant with Sidebar.html: Cheater   
Refried Paper.html               : Cheater
Tabular Indent.html              : Cheater

S1 Archive Pages 9/9 passed.

S2 Archive Pages
3 column.html         : Cheater
A Novel Conundrum.html: Cheater
A Sturdy Gesture.html : Cheater
Bloggish.html         : Cheater
Classic.html          : Cheater
Clean and Simple.html : Cheater
Cuteness Attack.html  : Cheater
Dear Diary.html       : Cheater
Digital Multiplex.html: Cheater
Flexible Squares.html : Cheater
Generator.html        : Cheater
Gradient Strip.html   : Cheater
Haven.html            : Cheater
Magazine.html         : Cheater
Nebula.html           : CheaterTitle
Notepad.html          : Cheater
Opal.html             : Cheater
Punquin Elegant.html  : Cheater
Quite Lickable.html   : Cheater
Smooth Sailing.html   : Cheater
Tabular Indent.html   : Cheater
Tranquility II.html   : Cheater
Unearthed.html        : Cheater
Variable Flow.html    : Cheater

S2 Archive Pages 24/24 passed.               

As it turns out, all of the free styles except one present the number of entries in the same way. You know what the day page's link should be ahead of time ("cheating!"), so you search for it. The text in the hyperlink is the number of entries. The only exception is Nebula, which places it in the "title" attribute. So scraping every single free calendar style was actually easy!

I expect there will be a similar deal with the day pages, with <h1> or something. Except that you can put whatever you want in an entry, so I take that back. Day parsers will probably have to do it correctly.

From: ex_md744
2006-04-16 06:54 pm (UTC)
What the hell?
From: thedexter
2006-04-16 07:01 pm (UTC)
I don't understand anything after the bit about filters, myself.
[User Picture]From: casey
2006-04-16 07:36 pm (UTC)
The calendar page lists the full number of entries on a given day, regardless of security settings. If I make 4 posts, regardless of who you are, it will show '4' on the calendar page, but when I actually try to retrieve the posts for a given day, I will only get however many I have access to. In doing so, I can see if people are making posts that I am not allowed to see.

Apologies if you understood that much. The rest is just a bunch of parsing output that he had to do-- he wrote an S2 style that he can apply to any paid users that outputs the necessary information to check this, but in the case of free users he's stuck doing some... messy screen-scraping.
From: thedexter
2006-04-16 08:10 pm (UTC)
I do understand that. I didn't get what "cheater" meant, but now I follow.
[User Picture]From: kartos
2006-04-17 12:48 am (UTC)
I actually like this flaw. It has bad points, but it has helped me to see someone's true nature in certain situations. Ofcourse it is a double edged sword.
