Web Scraping : A basic know-how

A Web Scraper is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web Scrapers are ants, automatic indexers, bots, web spiders, web robots, etc. The process is termed “web crawling”, and most site engines use it as a means to provide up-to-date data, in order to create a copy of all pages that have been visited. These are later processed, and the search engine will index the downloaded pages.
This helps in :

faster search
automating maintenance task on a web site
gathering specific types of information from websites

The bot starts with seeds, which are a list of URLs to visit. Once the “Scraper” is on one of the listed URLs, the hyperlinks in that page are identified and added to the “crawl frontier” which is the set of URLs that are to be visited. These are later visited according to a pre-defined set of policies.

Web Scraper s can be developed using any language : perl, python, java, asp,php etc. Among these, we chose perl to develop a web Scraper . Lets see what happened next.

Why Perl?

Perl is well suited for web scraping because of its highly powerful RegEx and availability of CPAN modules .

In this session, we will deal with :

Mechanize(Perl Module),
Process spawning
Anonymous scraping

Mechanize module: Mechanize is one of the main modules used, for stateful programmatic web browsing, used for automating interaction with websites. Mechanize supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you’ve visited, which can be queried and revisited. Useful functions described in bottom

For more info:http://search.cpan.org/~petdance/WWW-Mechanize-1.62/

Sample Script

[perl]
#!/usr/bin/perl -w
use WWW::Mechanize;

$url = ‘http://chato.cl/research/crawling_thesis ‘;

$m = WWW::Mechanize->new();

$m->get($url);

$c = $m->content; # Will display source code of the above link

exit;

[/perl]

[perl]
#Useful Function of mechanize module

my $mech = WWW::Mechanize->new(); #Creating new object of Mechanize.
$mech->agent_alias(‘Linux Mozilla’); #Creating a new agent like firefox
$mech->get(‘www.google.com’); #Download content in the link (www.google.com)
$mech->content; # This has the content of www.google.com link
$mech->submit_form # for form submission
$mech->find_link(text =>’Next’) #Follow the link with text ‘Next’ there are so many options for this like regular expression ,class,etc
[/perl]

Process spawning :

Most of the bots have a main process and a number of child processes. Main processes deal with creating child processes based on our requirement, while the child processes scrape our target locations simultaneously.

Why Process spawning?

Process spawning is used simply for simultaneous scraping at different levels of a web site (i.e. at different page/sections etc.
It has a number of advantages like nitro-boosting of scraping speed and easier management of server load.
In case the target is an e-commerce portal with a million section (like review page) with some pages or sections (or any other target) missing. Here, the child process will simply die, without effecting the total crawling process, while the main continues with a new child and new section.

Anonymous scraping with TOR

Tor is a free software and an open network that helps in defending your site against a form of network surveillance known as traffic analysis. This surveillance threatens personal freedom, privacy, confidential business activities and relationships.
Tor is a network of virtual tunnels that allows people and groups to improve their privacy and security on the Internet. It also enables software developers to create new communication tools with built-in privacy features. Tor provides the foundation for a range of applications that allow organizations and individuals to share information over public networks without compromising their privacy.

For more info please go through
http://www.torproject.org/docs/tor-doc-unix.html.en#polipo