Mahboob's Journal

Existential Thoughts, Experiential Inferences and Occasional Whacky Connections


Previous Entry Share Next Entry
Valentine Day Scraping
mahboob

Recently when I was watching a discussion on TV, there was a mention of the Sachar committee report. I wondered if I had to read the report myself, where would I get it. After sometime, I remembered seeing the report somewhere and after scratching my head a few times, realized it was indeed in one of Rupa Subramanya’s blog posts - she had given a link to the report.


I noticed Rupa’s name first, I think, when I got a retweet of her tweet from someone else (probably Nidhi Razdan) whom I was following on Twitter. I read a post announcing that she was starting a blog on Economics. One of the first entries that I read with a lot of interest was on cash-transfer schemes. In fact, after reading that piece, I thought a lot about the whole mechanism, and collected plenty of news items and articles on the subject.


Rupa’s posts have a trove of links to docs and external webpages and they are very informative. Not only her posts, but these links are valuable for reference. So I had a thought - why don’t I just document each blog post and the links it has? Didn't want to do it manually. Though I haven’t used it for years now, the programming language Perl is very ideal to solve this problem as it has extra horse-power for pattern matching and string parsing. A programmer has got to do what a programmer has got to do.


So I sat down on Saturday (11-Feb-12) morning, and began with Example 20.1 - Fetching a URL from a Perl Script and Example 20.18 - Parsing HTML of Perl Cookbook, 2nd Edition. The code uses the match operator (m) for extraction:

m{

                <a\ href="

                ([^\"]+)   # link to doc or external source

                \">

                ([^<]+)    # link text in the post

                </a>

               }gx)

After this Perl magic, $1 has the hyperlink text and $2 has the actual url.


The blog has the  starting page, the base url as I named the variable, as http://blogs.wsj.com/indiarealtime/tag/Economics-Journal/ which shows the most recent posts. In fact, the url is equivalent to http://blogs.wsj.com/indiarealtime/tag/Economics-Journal/page/1 and the previous set of posts are obtained with http://blogs.wsj.com/indiarealtime/tag/Economics-Journal/page/2 and so on.


So we take the base url, sprintf an integer counter, and in a loop, fetch the set of blog posts using get_html function, until we reach a non-existent page. get_html does what its name says - give it an url and it will fetch the webpage’s content. Since we are passing the urls as page/1 and page/2 etc appended, we get a set of posts form every call. In the set, for each blog post, we call the function get_links_in_post.


Function get_links_in_post is also self-explanatory by its name. The Perl magic is in the following line:

($post) = $html =~ m{By\s+Rupa\s+Subramanya(.*?)Rupa\s+Subramanya}s

What this does is, take the content between “By Rupa Subramanya” and “Rupa Subramanya”, however many spaces there maybe between the words, across lines and give the content to the variable $post.


Sometimes there are embedded links to previous posts or related posts. They are in the division 'insetContent, and need to be got rid of, which is a simple sub-string replacement with “nothing”.

$post =~ s/insetContent(.*?)div>//s;


After that, I split the post into an array of strings based on newline, and then process each line with the match-operator-extraction and prints the text and links.


I finished the program, late on Saturday evening. Sunday (12-Feb-2012) morning I got up, added html tags to the print statements (retaining the normal print statements as commented code). The program prints html which I posted on my personal page on googlepages. The page having blogs and links is here -> https://sites.google.com/site/mahboobh06/home/rupa-subramanya-blogs-links

The Perl program can be seen at this link ->
https://sites.google.com/site/mahboobh06/home/rupa-blogs-links-perl-program


I am not sure if this kind of work qualifies as screen scraping, but it does sound close to it. As I look at the information, I realize that Rupa had started the blog last year on Valentine’s day. It’s been a steady, consistent and sincere effort by her to make sense of the world from an economist’s perspective. Hope that her writing continues to be prolific and she continues to evolve as an economist and an author.

To all of you: Happy Valentine’s Day.

Tags:

  • 1
I think other site proprietors should take mahboob.livejournal.com as an model, very clean and excellent user friendly style and design, let alone the content. You are an expert in this topic!

Antibiotic treatment for chlamydia

(Anonymous)
Hello! I really like your blog! Continue to write more! Very interesting!

Светодиодные уличные консольные светильники "кобра" п

(Anonymous)
Наконец-то у Вас появилась возможность приобрести одну из наших моделей для уличного освещения, консольный светильник
IML-54200060 тип "кобра" мощностью 60Вт и световым потоком 9000лм всего за [b]2700[/b] руб. со склада в ЦФО

Преимущества:

- Корпус из литого алюминия с порошковой окраской (а не нарезка из экструдированных хлыстов
с линейками для офисных светильников и, в итоге, косинусной, а не широкой КСС, плохой герметичностью)

- Формирующая широкую КСС линза из боросиликатного стекла. Под заказ - линзы для других КСС

- Схема независимых лед-драйверов. Очень высокая надежность

- Кластер Bridgelux с эффективностью 150 лм/Вт

- Гарантия 5 лет

Ищем дилеров, торговых представителей, агентов

INFO @ IMLED . RU

WWW . IMLED . RU

  • 1
?

Log in