Thursday, 30 May 2013

Web scraping tutorial

Web scraping is the act of programmatically harvesting data from a webpage. It consists of finding a way to format the URLs to pages containing useful information, and then parsing the DOM tree to get at the data. It’s a bit finicky, but our experience is that this is easier than it sounds. That’s especially true if you take some of the tips from this web scraping tutorial.

It is more of an intermediate tutorial as it doesn’t feature any code. But if you can bring yourself up to speed on using BeautifulSoup and Python the rest is not hard to implement by trial and error. [Hartley Brody] discusses investigating how the GET requests are formed on your webpage of choice. Once that URL syntax has been figured out just look through the source code for tags (css or otherwise) that can be used as hooks to get at your target data.

So what can this be used for? A lot of things. We’d suggest reading the Reddit comments as there are several real world uses discussed there. But one that immediately pops to mind is the picture harvesting [Mark Zuckerburg] used when he created Facemash.


Source: http://hackaday.com/2012/12/10/web-scraping-tutorial/

Monday, 27 May 2013

Scraping a website into Drupal using Perl

 Perl has been at the root of web development since the beginning: even Amazon is built on Perl. Today, Perl gives you access via CPAN to a set of over 18,000 mature modules on just about anything. There is even an Acme:: namespace reserved for joke modules.

Perl has a lots of benefits for a Drupal developer. First, the syntax of PHP has been greatly influenced by Perl, so most PHP programmers should feel comfortable in Perl. It is easy to install extra Perl modules on any Linux distribution from the command-line using CPAN, or on share hosts using the administration interface. And Perl is faster than PHP, which makes it an excellent candidate for the heavy-lifting part of a website.

Let's build a small perl script to:

    Log into a website
    Parse a page and search for specific content
    Format the content as an RSS feed
    Load the feed into Drupal

This solution would be extremely simple to build using only four Perl CPAN modules. Here is how it goes:

STEP 1: The first line in the perl script is the shebang, which points to the location of Perl on your system.
#!/usr/bin/perl -w

On shared hosts, you might have to use something like this to tell Perl to look inside your home directory:
#!/ramdisk/bin/perl -w # # Hostmonster fix BEGIN { my $homedir = ( getpwuid($>) )[7]; my @user_include; foreach my $path (@INC) { if ( -d $homedir . '/perl' . $path ) { push @user_include, $homedir . '/perl' . $path; } } unshift @INC, @user_include; }

STEP 2: Declare the modules you intend to use (these must be installed first):
use CGI::Minimal; use WWW::Mechanize; use XML::RSS; use HTTP::Message;

STEP 3: Define some constants. We'll provide a user agent (here IE8) to make sure the system will not reject us by mistake.
my $login_url = "https://example.com"; my $login_agent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)"; # my $login_form_name = "form_login"; my $login_field_user = "login_id"; my $login_field_pass = "passwd";

STEP 4: Use CGI::Minimal to read the login and password coming from POST or GET:
# Gets us access to the HTTP request data my $cgi = CGI::Minimal->new; # # Get the name and value for each parameter: my $login_user = $cgi->param('user'); my $login_pass = $cgi->param('pass');

STEP5: WWW::Mechanize is our Swiss Army tool, allowing us to post forms, click on buttons, follow links etc. With only six lines of code, WWW::Mechanize can read the login page, find the login form on it, enter the user name ad password, submit the form, and return the next page.
# The autocheck => 1 tells Mechanize to die if any IO fails, so you don't have to manually check. my $mech = WWW::Mechanize->new(autocheck => 1, agent =>$login_agent); # # Fetch the login page $mech->get($login_url); # # Find and select the form by name, returning an HTML::Form object $mech->form_name($login_form_name); # # Fill specific fields on the form $mech->field($login_field_user,$login_user); $mech->field($login_field_pass,$login_pass); # # Click the submit button $mech->click();

STEP6: We could then navigate the site by following links using WWW::Mechanize, but let's say the content we are interested in is on the next page. We want to extract the following information:
Link to post 123

With the help of WWW::Mechanize we can extract all the links which have class "post":
my @links = $agent->find_all_links( tag => 'a', class => 'post', );

STEP7: Now build the RSS result using the XML::RSS:
# Syndication feed my $rss = XML::RSS->new(version => '2.0'); # # Create xml content foreach (@links) { $rss->add_item( title => $_->text, link => $_->url ); }

STEP8: The final steps simply return the result using HTTP::Message:
# Manage the HTTP response my $response = HTTP::Message->new; # # Create message with xml as text $response->header('Content-Type' => 'application/rss+xml'); $response->content($rss->as_string); # # Send message to client print $response->as_string;

STEP9: Finally, in Drupal, download and install FeedAPI and enable FeedAPI, FeedAPI Node and SimplePie Parser (external library required). Then create a Feed node with the URL pointing to your script:
http://localhost/feed.pl?user=foo&pass=bar

That's it! A very simple and strong foundation to build upon. For example, this can be used to perform a search on a site, or return the results in XML by replacing XML::Feed with XML::Generator.


Source: http://www.appnovation.com/scraping_website_drupal_using_perl

Friday, 24 May 2013

Scraping

Scraping, or "web scraping," is the process of extracting large amounts of information from a website. This may involve downloading several web pages or the entire site. The downloaded content may include just the text from the pages, the full HTML, or both the HTML and images from each page.

There are many different methods of scraping a website. The most basic is manually downloading web pages. This can be done by either copying and pasting the content from each page into a text editor or using your browser's File → Save As… command to save local copies of individual pages. Scraping can also be done automatically using web scraping software. This is the most common way to download a large number of pages from a website. In some cases, bots can be used to scrape a website a regular intervals.

Web scraping may be done for several different purposes. For instance, you may want to archive a section of a website for offline access. By downloading several pages to your computer, you can read them at a later time without being connected to the Internet. Web developers sometimes scrape their own websites when testing for broken links and images within each page. Scraping can also done for unlawful purposes, such as copying a website and republishing it under a different name. This type of scraping is viewed as a copyright violation and can lead to legal prosecution.

NOTE: While scraping a website for the purpose of republishing information is always wrong, scraping a site for other purposes may still violate the website's terms of use. Therefore, you should always read a website's terms of use before downloading content from the site.

Source: http://www.techterms.com/definition/scraping

Friday, 17 May 2013

Creating a Private Database of Proxies – Part 2: Scraping IP Addresses

In this section of our tutorial on creating a database of proxies, we’ll be walking through how we’re going to write our program.

What do I need to do this?

For this part of the tutorial, we’ll assume that you’ve read the tutorial introduction, and you know what we’re trying to do, and why. This section will not require you to have any programming knowledge; we’re only going to walk through the steps we’re going to take to get the IP addresses for the proxy server. You should have a VERY basic knowledge of HTML and CSS, or at least know what they are. If you don’t hopefully you will by the end of this section. Everything else that you’ll need for later sections will be explained then.

This should be easy! Why do I need a tutorial?

Initially when we set out to do this ourselves, we assumed the same thing. We thought we could set up the whole system in an hour or two, and have it running the same evening. It turns out, the folks over at HideMyAss.com thought about people doing what we were going to do, and made it a bit harder than we initially thought. Their primary method of stopping people from collecting the information from their site is by obfuscating the code that displays the IP address of each server. This makes the page so that the correct address is displayed, but in the HTML its much harder to tell, and thus harder for a computer to find automatically.

How does HideMyAss obfuscate the addresses?

To see how the IP addresses are hidden, we’re going to need to look at the HTML for the page. We can do this in a browser like Chrome or Firefox. In this tutorial, we’re going to use Chrome. Start by going to HideMyAss.com’s list of proxy servers, and open up the HTML for the page. Once there, isolate the <span> element that contains the first listed IP address, and lets look at what it contains. For us, our first <span> looks like this:
   
<span>
    <style>.r6cp{display:none}.Nz73{display:inline}</style>
    <span style="display:none">36</span>
    <div style="display:none">71</div>
    <span style="display:none">94</span>
    <span class="r6cp">94</span>
    <div style="display:none">94</div>
    <span style="display:none">118</span>
    <span></span>
    <span></span>
    <span style="display:none">185</span>
    <div style="display:none">185</div>
    <span style="display:none">194</span>
    <span class="r6cp">194</span>
    <span style="display:none">202</span>
    <span></span>
    203
    <div style="display:none">205</div>
    <span style="display:none">246</span>
    <span></span>
    <span style="display: inline">.</span>
    <span class="r6cp">57</span>
    <div style="display:none">57</div>
    <span class="110">156</span>
    <span class="Nz73">.</span>
    <span style="display:none">107</span>
    <span></span>
    <span class="208">250</span>
    <span class="123">.</span>
    <span class="r6cp">56</span>
    <div style="display:none">56</div>
    <span style="display: inline">101</span>
</span>

Clearly, that’s a lot more than the four numbers that make up an IP address. This is what all those numbers end up looking like in the browser to the user:

displayed_server_information

So how do we get from all the numbers in the HTML above, to what we see on the screen? There are two main methods being used here. The first, is creating <span> elements in the code, but not displaying them. We see that used quite a bit in the code, in this line for example:
1
   
<span style="display:none">36</span>

We can see than a <span> element is created that says “36″, with the display property set to display:none. Obviously, this tells the element not to display when the web page is rendered. Let’s take a look at the code for our single IP address again, but with all the elements with the display:none property removed. We’ll also remove all the empty <span&gt elements.
   
<span>
    <style>.r6cp{display:none}.Nz73{display:inline}</style>
    <span class="r6cp">94</span>
    <span class="r6cp">194</span>
    203
    <span style="display: inline">.</span>
    <span class="r6cp">57</span>
    <span class="110">156</span>
    <span class="Nz73">.</span>
    <span class="208">250</span>
    <span class="123">.</span>
    <span class="r6cp">56</span>
    <span style="display: inline">101</span>
</span>

That looks better, but there are still more numbers than we need. So what else isn’t being displayed? The answer lies in the very first line, between the <style> tags. Two classes are created, called r6cp and Nz73, each with the same display property we saw earlier. This time however, only the r6cp class has display:none; the Nz73 class has the display property set as display:inline, meaning an element in that class WILL be displayed. Additionally, any element with the display property set as display:inline in the <span> tag will also be displayed. Let’s see what the code looks like without the elements in class r6cp:
   
<span>
    <style>.r6cp{display:none}.Nz73{display:inline}</style>
    203
    <span style="display: inline">.</span>
    <span class="110">156</span>
    <span class="Nz73">.</span>
    <span class="208">250</span>
    <span class="123">.</span>
    <span style="display: inline">101</span>
</span>

That looks much more like our displayed page than we started with. In fact, those elements are exactly the ones that are displayed! This technique works for all of the IP addresses listed, and is what we will be writing our program to do in the next section.

How exactly are we going to write this program?

To put it simply, we’re going to create our program to do exactly what we just did by hand. First, identify and remove all elements with the display:none property in the tag. Then, find and get rid of all elements that are part of a class that has the property. Finally, take whats left and put it into one line, and we should have our address! Luckily, the rest of the information about each server is not obfuscated, so we can just get that normally. Check out Part 3 for instructions on where to go from here!

Source: http://blueshellgroup.wordpress.com/2013/04/15/creating-a-private-database-of-proxies-part-2/

Monday, 6 May 2013

How to Scrape Website

More and more advanced Internet users ask us a question how to scrape a website. Let us try to understand what web scraping is and how to scrape a website with a great profit and faster than ever and which software or service to use for this purpose.

Web scraping is a process of collecting of different kind of text data and images (usually bad-structured or unstructured data) and storing it in the format you need (it can be Excel file as well as database MySQL, MSSQL, Oracle, etc.). Web scraping is widely used now by millions of people busy in e-commerce, retail, real estate, marketing, and other business fields. When you get some information on web scraping from the World Wide Web you're still wondering how to scrape a website you need. Don't worry, there're lots of software products and services nowadays which are designed to perform this task.

Let us take a closer look at WebSundew web data extraction tool. Our technical support employees know for sure how to scrape a website. First you need to select a target web site and then to decide which data you need (in most cases it is the product information like Name, Make, Model, Price or it can be property or business information like Name, Address, Phone Number, Geolocation). There is a tendency recently to extract even Google Maps directories for this or that business or property (place of interest, hotel) or many other things.

When all the preparatory work is done you can easily start with extracting. For this it is enough to run WebSundew web scraping program and set up an Agent which will perform all the necessary operations for you. It will visit all the web pages and extract data from the fields you had mentioned and then store the data into the Excel file or to the database. That is why we always say that extraction with WebSundew has never been easier. Do you sill have doubts how to scrape a website? Then just download and install WebSundew 15 days trial and try it yourself.

Source: http://www.websundew.com/how-to-scrape-website

Wednesday, 1 May 2013

OutWit Hub: Web-scraping made easy

I read a blog earlier this term on web-scraping and decided to check it out. I started with the suggested software, and quickly realized that there are only a few really good tools available for web-scraping and that are supported by Max OS. So, after reading a few reviews, I landed on OutWit Hub.

OutWit Hub has 2 versions: Basic and Pro. The difference is in available tools. In basic, the "words" tools isn't available. This aspect allows you to see the frequency of any word as it occurs on the page you are currently viewing. Several of the scraping tools are offline as well. I've upgraded to Pro, it's only $60 per year and I was curious to see what else it can do.

I'm not a computer scientist, by a long shot, but I have a general grasp on coding and how computers operate. For this reason, I really like OutWit Hub. The tutorials on this site are incredible. They walk you through examples and you can interact with the UI while the tutorial is going. Also, a lot of the tools are pretty intuitive to use. If you're not sold on getting the Pro version, I'd encourage you to visit their website and download the free version just to check out the tutorials. They're really great.

I've used the site for several examples just to test. I needed to get all of the emails off of an organization's website, so instead of copy/pasting everything and praying for the best, I used the "email" feature on OutWit and all of the names and emails of every member on the page populated an exportable table. #boom

Then, I wanted to see if it could be harnessed for Twitter and Facebook. So, using the source-code approach to scraping, I was able to extract text from the loaded parts of my Twitter and Facebook feeds. The problems I encountered were: Not knowing enough about the coding to make the scraper dynamic enough to peruse through unloaded pages, and not knowing how to automate and build a larger dataset (i.e. continuously run the scraper over a set amount of time by continuously reloading the page and harvesting the data. It's possible, I just didn't figure it out).

So, I've videoed a tutorial on how to use OutWit Hub Pro's scraper feature to scrape the loaded part of your Facebook news feed. Below are the written instructions and the video at the bottom gives you the visual.

Essentially, you will:
1.) Launch OutWit Hub (presuming you've downloaded and upgraded to Pro).
2.) Login to your profile on Facebook.
3.) Take note of whatever text you want to capture as a reference point when you go to look in the code. This is assuming you don't know how to read html. For example, if the first person on your news feed says: "Hey check out this video!", then take note of their statement "Hey check out this video!"
4.) Click the "scrapers" item on the left side of the screen.
5.) In the search window, type in the text "Hey check out this video" and observe the indicators in the code that mark the beginning and end of that text.
5.) In the window below the code, click the "New" button.
6.) Type in a name for the scraper
7.) Click the checkbox in row 1 of the window.
8.) Enter a title/description for the information you're collecting in the first column. Using the same example: "Stuff friends say on FB" or "Text". It really only matters if you're going to be extracting other data from the same page and want to keep it separate.
9.) Type in the html code that you indicated as the beginning to the data that you want to extract under the "Marker Before" column.
10.) Repeat step 9 for the next column using the html code that you indicated as the end to the data.
11.) Click "Execute".
12.) Your data is now available for export in several templates - CSV, Excel, SQL, HTML, TXT

Here is a Youtube video example of me using it to extract and display comments made by my Facebook friends that appeared on my news feed.

Source: http://auburnbigdata.blogspot.in/2013/04/outwit-hub-web-scraping-made-easy.html

Note:

Roze Tailer is experienced web scraping consultant and writes articles on web data scraping, website data scraping, web scraping services, data scraping services, website scraping, eBay product scraping, Forms Data Entry etc.