Scrape A Website: 2015

Friday, 3 July 2015

Web Scraping : To Scrape or Not to Scrape?

Web scraping is on the rise and its legality is being debated. The future of big data could hang in the balance.

I decided that I might want to stop writing about all these successful dot-com businesses and get into the act myself. I mean, how hard could it be, aside from the fact that I have no expertise in any particular vertical, no technological knowledge, and no money? That last was a bit of a problem because I was going to need big data, and big data doesn't come cheap.

So, one day I'm talking to a direct-tech wizard and he says, “Why don't you just find a business you want to be in, find the most successful company, go to their website, and scrape some of their data?”

Scrape? My dad was a house painter. I used to help him during summers. The only scraping I knew was done with a putty knife. But that's what God invented the Internet for. Google turned up endless Web scraping services and I went to one called Automation Anywhere.

Its homepage told me not to try scraping on my own, that I could pay them as little as $1,995 for a program that would have me scraping away in minutes with no programming expertise. A video showed me how. Suppose I wanted to locate all the assisted living facilities in Detroit? (Bedpans! There's a wide-open Web business!) Automation Anywhere showed me how I could request a data pattern—name, address, phone, and service area of targets—apply their program to a rehab facility listing, and minutes later be in possession of tidy customer list on a spreadsheet. A box popped up asking if I had any questions for a live account manager. I did.

“I'm interested in this, but is it totally legal?” I asked Shine, the account rep.

Shine was slow to respond and came back disappointingly noncommittal: “You need to install the software at your end. Hence, you will need to check at your end for legal documents for the website.”

Shine had obviously been reading the same European news sites that I had. Last year Irish airline Ryanair filed suit against PR Aviation, a Dutch airfare comparison site, charging it with copyright infringement and breach of contract for scraping flight data from its site. The Court of Justice of the European Union (ECJ) dismissed the suit, saying the scraping amounted to “normal use” of a website. However, the ECJ did potentially leave the door open for businesses with unprotected databases, such as Ryanair, to establish contractual limitations on use of their databases by third parties. That opening, should it be entered into by airlines, might have businesses such as Expedia, Orbitz, and Priceline reimagining their business plans.

And it could have any other businesses that load up third-party data files through scraping activities doing the same. The reality, however, is that that's a big “and.”

“The airlines have been halfway successful taking travel agents to court, but it can take five years and then they lose,” said Gus Cunningham, CEO of ScrapeSentry, one of only “three and a half” companies, in his words, that block scrapers from websites.

Professional scrapers are not only out-front and plentiful—as the Google search demonstrated—they're also nimble and expensive to chase away. Basically, Cunningham said, it's a matter of stopping scrapers at the website door among the airlines, e-coms, real estate sites, and online gambling companies that are scraped the most. Cunningham's company monitors inbound Web traffic and uses an analysis engine to block suspect visitors per parameters set down by clients. ScanSentry has a nine-year-old database that helps it identify bad actors, much like Interpol with its criminal database. Then a human element must enter the process.

Some of Cunningham's clients feed bogus information to competitors identified as scrapers to ferret them out. In many cases, though, they opt to turn their heads. “Airlines have some flights where they just want to get as many butts as they can in the seats, so they won't concentrate their blocking efforts on those. They'll concentrate on the routes that are always jammed,” Cunningham said.

Unlike botnets that steal money by, say, serving bogus websites to siphon off programmatic ad dollars, scraping is not overtly criminal. In the wide sphere of digital commerce, it's probably most common that the scrap-ee is also a scrap-er. How vigilant vulnerable industries become, and how protective courts and law enforcement agencies grow, will depend on how much scraping activity increases. Cunningham said it's growing fast. More than one fifth of visitors to client websites last year were scrapers, according to a ScanSentry study. Among travel companies, meanwhile, scrapers doubled from 15% in 2013 to 33% last year.

“And,” Cunningham noted, “It is stealing.”

Source: http://www.dmnews.com/direct-line-blog/to-scrape-or-not-to-scrape/article/422662/

Thursday, 25 June 2015

Data Scraping - About Hand Scraped Flooring

Hand scraped hardwood flooring is one of the best floors that you can install in your house.

Advantages of Hand Scraped Hardwood Flooring

The product comes with a number of advantages which include:

Antique and modern technology: The floor professionally brings out the best elements of both antique and modern technology. The modern elements are in the quality of the product.

Unique patterns: Who doesn't want to be unique? These floors allow you to create your unique design. If you are going to use a machine, all you need to do is to set the machine such that it creates the pattern that you want. If the floor will be scraped by a craftsman, you should ask the craftsman to craft your desired pattern.

Character: The different depths in the floor provide you with character and color that you can't find in other types of floors. As the sun changes its angle during the day, the nooks and valleys on the board lit differently thus providing your board with an endless rich appearance.

Durability: Experts have been able to show that hand-scraped hardwood retains its look for a long time. If your kid or pet hits the floor, the dent just blends with the rest of the character making it hard for people to tell that there is a dent.

Making the floors shine again

Although, the scraped floors are designed to look worn and aged, they are made from modern wood which needs to be taken care of in order to retain its original look.

To make the floors shine again you need to remove all the dust and dirt that might be causing the wood to look dull.

After doing this you should mix 1 gallon of warm water with ½ teaspoon of dishwashing detergent and use it to clean the surface of the floor. The aim of doing this is to remove any stains that might be on the floor. When you complete doing this you should dampen the piece of cloth with club soda and then use another piece of cloth to buff the wood until it shines.

Conclusion

This is what you need to know about hand scraped hardwood flooring. When cleaning the floors you should avoid using oil based soaps as they dull the surface making your efforts worthless.

If the above method of shining the floor doesn't work, you should mix one part white vinegar and one part of cooking oil and use it to clean the floor.

Source: http://ezinearticles.com/?About-Hand-Scraped-Flooring&id=8990255

Saturday, 20 June 2015

Making data on the web useful: scraping

Introduction

Many times data is not easily accessible – although it does exist. As much as we wish everything was available in CSV or the format of our choice – most data is published in different forms on the web. What if you want to use the data to combine it with other datasets and explore it independently?

Scraping to the rescue!

Scraping describes the method to extract data hidden in documents – such as Web Pages and PDFs and make it useable for further processing. It is among the most useful skills if you set out to investigate data – and most of the time it’s not especially challenging. For the most simple ways of scraping you don’t even need to know how to write code.

This example relies heavily on Google Chrome for the first part. Some things work well with other browsers, however we will be using one specific browser extension only available on Chrome. If you can’t install Chrome, don’t worry the principles remain similar.

Code-free Scraping in 5 minutes using Google Spreadsheets & Google Chrome

Knowing the structure of a website is the first step towards extracting and using the data. Let’s get our data into a spreadsheet – so we can use it further. An easy way to do this is provided by a special formula in Google Spreadsheets.

Save yourselves hours of time in copy-paste agony with the ImportHTML command in Google Spreadsheets. It really is magic!

Recipes

In order to complete the next challenge, take a look in the Handbook at one of the following recipes:

    Extracting data from HTML tables.

    Scraping using the Scraper Extension for Chrome

Both methods are useful for:

    Extracting individual lists or tables from single webpages

The latter can do slightly more complex tasks, such as extracting nested information. Take a look at the recipe for more details.

Neither will work for:

    Extracting data spread across multiple webpages

Challenge

Task: Find a website with a table and scrape the information from it. Share your result on datahub.io (make sure to tag your dataset with schoolofdata.org)

Tip

Once you’ve got your table into the spreadsheet, you may want to move it around, or put it in another sheet. Right click the top left cell and select “paste special” – “paste values only”.

Scraping more than one webpage: Scraperwiki

Note: Before proceeding into full scraping mode, it’s helpful to understand the flesh and bones of what makes up a webpage. Read the Introduction to HTML recipe in the handbook.

Until now we’ve only scraped data from a single webpage. What if there are more? Or you want to scrape complex databases? You’ll need to learn how to program – at least a bit.

It’s beyond the scope of this course to teach how to scrape, our aim here is to help you understand whether it is worth investing your time to learn, and to point you at some useful resources to help you on your way!

Structure of a scraper

Scrapers are comprised of three core parts:

1.    A queue of pages to scrape
2.    An area for structured data to be stored, such as a database
3.    A downloader and parser that adds URLs to the queue and/or structured information to the database.

Fortunately for you there is a good website for programming scrapers: ScraperWiki.com

ScraperWiki has two main functions: You can write scrapers – which are optionally run regularly and the data is available to everyone visiting – or you can request them to write scrapers for you. The latter costs some money – however it helps to contact the Scraperwiki community (Google Group) someone might get excited about your project and help you!.

If you are interested in writing scrapers with Scraperwiki, check out this sample scraper – scraping some data about Parliament. Click View source to see the details. Also check out the Scraperwiki documentation: https://scraperwiki.com/docs/python/

When should I make the investment to learn how to scrape?

A few reasons (non-exhaustive list!):

1.    If you regularly have to extract data where there are numerous tables in one page.

2.    If your information is spread across numerous pages.

3.    If you want to run the scraper regularly (e.g. if information is released every week or month).

4.    If you want things like email alerts if information on a particular webpage changes.

…And you don’t want to pay someone else to do it for you!

Summary:

In this course we’ve covered Web scraping and how to extract data from websites. The main function of scraping is to convert data that is semi-structured into structured data and make it easily useable for further processing. While this is a relatively simple task with a bit of programming – for single webpages it is also feasible without any programming at all. We’ve introduced =importHTML and the Scraper extension for your scraping needs.

Further Reading

1.    Scraping for Journalism: A Guide for Collecting Data: ProPublica Guides

2.    Scraping for Journalists (ebook): Paul Bradshaw

3.    Scrape the Web: Strategies for programming websites that don’t expect it : Talk from PyCon

4.    An Introduction to Compassionate Screen Scraping: Will Larson

Any questions? Got stuck? Ask School of Data!

ScraperWiki has two main functions: You can write scrapers – which are optionally run regularly and the data is available to everyone visiting – or you can request them to write scrapers for you. The latter costs some money – however it helps to contact the Scraperwiki community (Google Group) someone might get excited about your project and help you!.

If you are interested in writing scrapers with Scraperwiki, check out this sample scraper – scraping some data about Parliament. Click View source to see the details. Also check out the Scraperwiki documentation: https://scraperwiki.com/docs/python/

When should I make the investment to learn how to scrape?

A few reasons (non-exhaustive list!):

1.    If you regularly have to extract data where there are numerous tables in one page.

2.    If your information is spread across numerous pages.

3.    If you want to run the scraper regularly (e.g. if information is released every week or month).

4.    If you want things like email alerts if information on a particular webpage changes.

…And you don’t want to pay someone else to do it for you!

Summary:

In this course we’ve covered Web scraping and how to extract data from websites. The main function of scraping is to convert data that is semi-structured into structured data and make it easily useable for further processing. While this is a relatively simple task with a bit of programming – for single webpages it is also feasible without any programming at all. We’ve introduced =importHTML and the Scraper extension for your scraping needs.

Source: http://schoolofdata.org/handbook/courses/scraping/

Monday, 8 June 2015

Scraping Services - Assuring Scraping Success with Proxy Data Scraping

Have you ever heard of "Data Scraping?" Data Scraping is the process of collecting useful data that has been placed in the public domain of the internet (private areas too if conditions are met) and storing it in databases or spreadsheets for later use in various applications. Data Scraping technology is not new and many a successful businessman has made his fortune by taking advantage of data scraping technology.

Sometimes website owners may not derive much pleasure from automated harvesting of their data. Webmasters have learned to disallow web scrapers access to their websites by using tools or methods that block certain ip addresses from retrieving website content. Data scrapers are left with the choice to either target a different website, or to move the harvesting script from computer to computer using a different IP address each time and extract as much data as possible until all of the scraper's computers are eventually blocked.

Thankfully there is a modern solution to this problem. Proxy Data Scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program executes an extraction from a website, the website thinks it is coming from a different IP address. To the website owner, proxy data scraping simply looks like a short period of increased traffic from all around the world. They have very limited and tedious ways of blocking such a script but more importantly -- most of the time, they simply won't know they are being scraped.

You may now be asking yourself, "Where can I get Proxy Data Scraping Technology for my project?" The "do-it-yourself" solution is, rather unfortunately, not simple at all. Setting up a proxy data scraping network takes a lot of time and requires that you either own a bunch of IP addresses and suitable servers to be used as proxies, not to mention the IT guru you need to get everything configured properly. You could consider renting proxy servers from select hosting providers, but that option tends to be quite pricey but arguably better than the alternative: dangerous and unreliable (but free) public proxy servers.

There are literally thousands of free proxy servers located around the globe that are simple enough to use. The trick however is finding them. Many sites list hundreds of servers, but locating one that is working, open, and supports the type of protocols you need can be a lesson in persistence, trial, and error. However if you do succeed in discovering a pool of working public proxies, there are still inherent dangers of using them. First off, you don't know who the server belongs to or what activities are going on elsewhere on the server. Sending sensitive requests or data through a public proxy is a bad idea. It is fairly easy for a proxy server to capture any information you send through it or that it sends back to you. If you choose the public proxy method, make sure you never send any transaction through that might compromise you or anyone else in case disreputable people are made aware of the data.

A less risky scenario for proxy data scraping is to rent a rotating proxy connection that cycles through a large number of private IP addresses. There are several of these companies available that claim to delete all web traffic logs which allows you to anonymously harvest the web with minimal threat of reprisal. Companies such as offer large scale anonymous proxy solutions, but often carry a fairly hefty setup fee to get you going.

The other advantage is that companies who own such networks can often help you design and implementation of a custom proxy data scraping program instead of trying to work with a generic scraping bot. After performing a simple Google search, I quickly found one company (www.ScrapeGoat.com) that provides anonymous proxy server access for data scraping purposes. Or, according to their website, if you want to make your life even easier, ScrapeGoat can extract the data for you and deliver it in a variety of different formats often before you could even finish configuring your off the shelf data scraping program.

Whichever path you choose for your proxy data scraping needs, don't let a few simple tricks thwart you from accessing all the wonderful information stored on the world wide web!

Source: http://ezinearticles.com/?Assuring-Scraping-Success-with-Proxy-Data-Scraping&id=248993

Friday, 29 May 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get HTML Document text and that text then parsed to extract data from the HTML codes. Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service. If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library. PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Tuesday, 26 May 2015

Web Scraping Services : What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/

Monday, 25 May 2015

Screen Scraping with BeautifulSoup and lxml

Please enjoy this — a free Chapter of the Python network programming book that I revised for Apress in 2010!

I completely rewrote this chapter for the book's second edition, to feature two powerful libraries that have appeared since the book first came out. I show how to screen-scrape a real-life web page using both BeautifulSoup and also the powerful lxml library (their web sites are here and here).

I chose this chapter for release because screen scraping is often the first network task that a novice Python programmer tackles. Because this material is oriented towards beginners, it explains the entire process — from fetching web pages, to understanding HTML, to querying for specific elements in the document.

Program listings are available for this chapter in both Python 2 and also in Python 3. Let me know if you have any questions!

Most web sites are designed first and foremost for human eyes. While well-designed sites offer formal APIs by which you can construct Google maps, upload Flickr photos, or browse YouTube videos, many sites offer nothing but HTML pages formatted for humans. If you need a program to be able to fetch its data, then you will need the ability to dive into densely formatted markup and retrieve the information you need—a process known affectionately as screen scraping.

In one's haste to grab information from a web page sitting open in your browser in front of you, it can be easy for even experienced programmers to forget to check whether an API is provided for data that they need. So try to take a few minutes investigating the site in which you are interested to see if some more formal programming interface is offered to their services. Even an RSS feed can sometimes be easier to parse than a list of items on a full web page.

Also be careful to check for a “terms of service” document on each site. YouTube, for example, offers an API and, in return, disallows programs from trying to parse their web pages. Sites usually do this for very important reasons related to performance and usage patterns, so I recommend always obeying the terms of service and simply going elsewhere for your data if they prove too restrictive.

Regardless of whether terms of service exist, always try to be polite when hitting public web sites. Cache pages or data that you will need for several minutes or hours, rather than hitting their site needlessly over and over again. When developing your screen-scraping algorithm, test against a copy of their web page that you save to disk, instead of doing an HTTP round-trip with every test. And always be aware that excessive use can result in your IP being temporarily or permanently blocked from a site if its owners are sensitive to automated sources of load.

Fetching Web Pages

Before you can parse an HTML-formatted web page, you of course have to acquire some. Chapter 9 provides the kind of thorough introduction to the HTTP protocol that can help you figure out how to fetch information even from sites that require passwords or cookies. But, in brief, here are some options for downloading content.

From the Future

If you need a simple way to fetch web pages before scraping them, try Kenneth Reitz's requests library!

The library was not released until after the book was published, but has already taken the Python world by storm. The simplicity and convenience of its API has made it the tool of choice for making web requests from Python.

 You can use urllib2, or the even lower-level httplib, to construct an HTTP request that will return a web page. For each form that has to be filled out, you will have to build a dictionary representing the field names and data values inside; unlike a real web browser, these libraries will give you no help in submitting forms.

 You can to install mechanize and write a program that fills out and submits web forms much as you would do when sitting in front of a web browser. The downside is that, to benefit from this automation, you will need to download the page containing the form HTML before you can then submit it—possibly doubling the number of web requests you perform!

 If you need to download and parse entire web sites, take a look at the Scrapy project, hosted at scrapy.org, which provides a framework for implementing your own web spiders. With the tools it provides, you can write programs that follow links to every page on a web site, tabulating the data you want extracted from each page.

 When web pages wind up being incomplete because they use dynamic JavaScript to load data that you need, you can use the QtWebKit module of the PyQt4 library to load a page, let the JavaScript run, and then save or parse the resulting complete HTML page.

 Finally, if you really need a browser to load the site, both the Selenium and Windmill test platforms provide a way to drive a standard web browser from inside a Python program. You can start the browser up, direct it to the page of interest, fill out and submit forms, do whatever else is necessary to bring up the data you need, and then pull the resulting information directly from the DOM elements that hold them.

These last two options both require third-party components or Python modules that are built against large libraries, and so we will not cover them here, in favor of techniques that require only pure Python.

For our examples in this chapter, we will use the site of the United States National Weather Service, which lives at www.weather.gov.

Among the better features of the United States government is its having long ago decreed that all publications produced by their agencies are public domain. This means, happily, that I can pull all sorts of data from their web site and not worry about the fact that copies of the data are working their way into this book.

Of course, web sites change, so the online source code for this chapter includes the downloaded web page on which the scripts in this chapter are designed to work. That way, even if their site undergoes a major redesign, you will still be able to try out the code examples in the future. And, anyway—as I recommended previously—you should be kind to web sites by always developing your scraping code against a downloaded copy of a web page to help reduce their load.

Downloading Pages Through Form Submission

The task of grabbing information from a web site usually starts by reading it carefully with a web browser and finding a route to the information you need. Figure 10–1 shows the site of the National Weather Service; for our first example, we will write a program that takes a city and state as arguments and prints out the current conditions, temperature, and humidity. If you will explore the site a bit, you will find that city-specific forecasts can be visited by typing the city name into the small “Local forecast” form in the left margin.

Figure 10–1. The National Weather Service web site

(click to enlarge)

When using the urllib2 module from the Standard Library, you will have to read the web page HTML manually to find the form. You can use the View Source command in your browser, search for the words “Local forecast,” and find the following form in the middle of the sea of HTML:

<form method="post" action="http://forecast.weather.gov/zipcity.php" ...>

...

<input type="text" id="zipcity" name="inputstring" size="9"

 value="City, St" onfocus="this.value='';" />

<input type="submit" name="Go2" value="Go" />

</form>

The only important elements here are the <form> itself and the <input> fields inside; everything else is just decoration intended to help human readers.

This form does a POST to a particular URL with, it appears, just one parameter: an inputstring giving the city name and state. Listing 10–1 shows a simple Python program that uses only the Standard Library to perform this interaction, and saves the result to phoenix.html.

Listing 10–1. Submitting a Form with “urllib2”

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - fetch_urllib2.py

# Submitting a form and retrieving a page with urllib2

import urllib, urllib2

data = urllib.urlencode({'inputstring': 'Phoenix, AZ'})

info = urllib2.urlopen('http://forecast.weather.gov/zipcity.php', data)

content = info.read()

open('phoenix.html', 'w').write(content)

On the one hand, urllib2 makes this interaction very convenient; we are able to download a forecast page using only a few lines of code. But, on the other hand, we had to read and understand the form ourselves instead of relying on an actual HTML parser to read it. The approach encouraged by mechanize is quite different: you need only the address of the opening page to get started, and the library itself will take responsibility for exploring the HTML and letting you know what forms are present. Here are the forms that it finds on this particular page:

>>> import mechanize

>>> br = mechanize.Browser()

>>> response = br.open('http://www.weather.gov/')

>>> for form in br.forms():

... print '%r %r %s' % (form.name, form.attrs.get('id'), form.action)

... for control in form.controls:

... print ' ', control.type, control.name, repr(control.value)

None None http://search.usa.gov/search

 hidden v:project 'firstgov'

 text query ''

 radio affiliate ['nws.noaa.gov']

 submit None 'Go'

None None http://forecast.weather.gov/zipcity.php

 text inputstring 'City, St'

 submit Go2 'Go'

'jump' 'jump' http://www.weather.gov/

 select menu ['http://www.weather.gov/alerts-beta/']

 button None None

Here, mechanize has helped us avoid reading any HTML at all. Of course, pages with very obscure form names and fields might make it very difficult to look at a list of forms like this and decide which is the form we see on the page that we want to submit; in those cases, inspecting the HTML ourselves can be helpful, or—if you use Google Chrome, or Firefox with Firebug installed—right-clicking the form and selecting “Inspect Element” to jump right to its element in the document tree.

Once we have determined that we need the zipcity.php form, we can write a program like that shown in Listing 10–2. You can see that at no point does it build a set of form fields manually itself, as was necessary in our previous listing. Instead, it simply loads the front page, sets the one field value that we care about, and then presses the form's submit button. Note that since this HTML form did not specify a name, we had to create our own filter function—the lambda function in the listing—to choose which of the three forms we wanted.

Listing 10–2. Submitting a Form with mechanize

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - fetch_mechanize.py

# Submitting a form and retrieving a page with mechanize

import mechanize

br = mechanize.Browser()

br.open('http://www.weather.gov/')

br.select_form(predicate=lambda(form): 'zipcity' in form.action)

br['inputstring'] = 'Phoenix, AZ'

response = br.submit()

content = response.read()

open('phoenix.html', 'w').write(content)

Many mechanize users instead choose to select forms by the order in which they appear in the page—in which case we could have called select_form(nr=1). But I prefer not to rely on the order, since the real identity of a form is inherent in the action that it performs, not its location on a page.

You will see immediately the problem with using mechanize for this kind of simple task: whereas Listing 10–1 was able to fetch the page we wanted with a single HTTP request, Listing 10–2 requires two round-trips to the web site to do the same task. For this reason, I avoid using mechanize for simple form submission. Instead, I keep it in reserve for the task at which it really shines: logging on to web sites like banks, which set cookies when you first arrive at their front page and require those cookies to be present as you log in and browse your accounts. Since these web sessions require a visit to the front page anyway, no extra round-trips are incurred by using mechanize.

The Structure of Web Pages

There is a veritable glut of online guides and published books on the subject of HTML, but a few notes about the format would seem to be appropriate here for users who might be encountering the format for the first time.

The Hypertext Markup Language (HTML) is one of many markup dialects built atop the Standard Generalized Markup Language (SGML), which bequeathed to the world the idea of using thousands of angle brackets to mark up plain text. Inserting bold and italics into a format like HTML is as simple as typing eight angle brackets:

The very strange book Tristram Shandy.

In the terminology of SGML, the strings and are each tags—they are, in fact, an opening and a closing tag—and together they create an element that contains the text very inside it. Elements can contain text as well as other elements, and can define a series of key/value attribute pairs that give more information about the element:

I am reading Hamlet.

There is a whole subfamily of markup languages based on the simpler Extensible Markup Language (XML), which takes SGML and removes most of its special cases and features to produce documents that can be generated and parsed without knowing their structure ahead of time. The problem with SGML languages in this regard—and HTML is one particular example—is that they expect parsers to know the rules about which elements can be nested inside which other elements, and this leads to constructions like this unordered list <ul>, inside which are several list items <li>:

<ul><li>First<li>Second<li>Third<li>Fourth</ul>

At first this might look like a series of <li> elements that are more and more deeply nested, so that the final word here is four list elements deep. But since HTML in fact says that <li> elements cannot nest, an HTML parser will understand the foregoing snippet to be equivalent to this more explicit XML string:

<ul><li>First</li><li>Second</li><li>Third</li><li>Fourth</li></ul>

And beyond this implicit understanding of HTML that a parser must possess are the twin problems that, first, various browsers over the years have varied wildly in how well they can reconstruct the document structure when given very concise or even deeply broken HTML; and, second, most web page authors judge the quality of their HTML by whether their browser of choice renders it correctly. This has resulted not only in a World Wide Web that is full of sites with invalid and broken HTML markup, but also in the fact that the permissiveness built into browsers has encouraged different flavors of broken HTML among their different user groups.

If HTML is a new concept to you, you can find abundant resources online. Here are a few documents that have been longstanding resources in helping programmers learn the format:

 http://www.w3.org/MarkUp/Guide/

 http://www.w3.org/MarkUp/Guide/Advanced.html

 http://www.w3.org/MarkUp/Guide/Style

The brief Bare Bones Guide, and the long and verbose HTML standard itself, are good resources to have when trying to remember an element name or the name of a particular attribute value:

 http://werbach.com/barebones/barebones.html

 http://www.w3.org/TR/REC-html40/

When building your own web pages, try to install a real HTML validator in your editor, IDE, or build process, or test your web site once it is online by submitting it to

 http://validator.w3.org/

You might also want to consider using the tidy tool, which can also be integrated into an editor or build process:

 http://tidy.sourceforge.net/

We will now turn to that weather forecast for Phoenix, Arizona, that we downloaded earlier using our scripts (note that we will avoid creating extra traffic for the NWS by running our experiments against this local file), and we will learn how to extract actual data from HTML.

Three Axes

Parsing HTML with Python requires three choices:

 The parser you will use to digest the HTML, and try to make sense of its tangle of opening and closing tags

 The API by which your Python program will access the tree of concentric elements that the parser built from its analysis of the HTML page

 What kinds of selectors you will be able to write to jump directly to the part of the page that interests you, instead of having to step into the hierarchy one element at a time

The issue of selectors is a very important one, because a well-written selector can unambiguously identify an HTML element that interests you without your having to touch any of the elements above it in the document tree. This can insulate your program from larger design changes that might be made to a web site; as long as the element you are selecting retains the same ID, name, or whatever other property you select it with, your program will still find it even if after the redesign it is several levels deeper in the document.

I should pause for a second to explain terms like “deeper,” and I think the concept will be clearest if we reconsider the unordered list that was quoted in the previous section. An experienced web developer looking at that list rearranges it in her head, so that this is what it looks like:

<ul>

<li>First</li>

<li>Second</li>

<li>Third</li>

<li>Fourth</li>

</ul>

Here the <ul> element is said to be a “parent” element of the individual list items, which “wraps” them and which is one level “above” them in the whole document. The <li> elements are “siblings” of one another; each is a “child” of the <ul> element that “contains” them, and they sit “below” their parent in the larger document tree. This kind of spatial thinking winds up being very important for working your way into a document through an API.

In brief, here are your choices along each of the three axes that were just listed:

 The most powerful, flexible, and fastest parser at the moment appears to be the HTMLParser that comes with lxml; the next most powerful is the longtime favorite BeautifulSoup (I see that its author has, in his words, “abandoned” the new 3.1 version because it is weaker when given broken HTML, and recommends using the 3.0 series until he has time to release 3.2); and coming in dead last are the parsing classes included with the Python Standard Library, which no one seems to use for serious screen scraping.

 The best API for manipulating a tree of HTML elements is ElementTree, which has been brought into the Standard Library for use with the Standard Library parsers, and is also the API supported by lxml; BeautifulSoup supports an API peculiar to itself; and a pair of ancient, ugly, event-based interfaces to HTML still exist in the Python Standard Library.

 The lxml library supports two of the major industry-standard selectors: CSS selectors and XPath query language; BeautifulSoup has a selector system all its own, but one that is very powerful and has powered countless web-scraping programs over the years.

Given the foregoing range of options, I recommend using lxml when doing so is at all possible—installation requires compiling a C extension so that it can accelerate its parsing using libxml2—and using BeautifulSoup if you are on a machine where you can install only pure Python. Note that lxml is available as a pre-compiled package named python-lxml on Ubuntu machines, and that the best approach to installation is often this command line:

STATIC_DEPS=true pip install lxml

And if you consult the lxml documentation, you will find that it can optionally use the BeautifulSoup parser to build its own ElementTree-compliant trees of elements. This leaves very little reason to use BeautifulSoup by itself unless its selectors happen to be a perfect fit for your problem; we will discuss them later in this chapter.

But the state of the art may advance over the years, so be sure to consult its own documentation as well as recent blogs or Stack Overflow questions if you are having problems getting it to compile.

From the Future

The BeautifulSoup project has recovered! While the text below — written in late 2010 — has to carefully avoid the broken 3.2 release in favor of 3.0, BeautifulSoup has now released a rewrite named beautifulsoup4 on the Python Package Index that works with both Python 2 and 3. Once installed, simply import it like this:

from bs4 import BeautifulSoup

I just ran a test, and it reads the malformed phoenix.html page perfectly.

Diving into an HTML Document

The tree of objects that a parser creates from an HTML file is often called a Document Object Model, or DOM, even though this is officially the name of one particular API defined by the standards bodies and implemented by browsers for the use of JavaScript running on a web page.

The task we have set for ourselves, you will recall, is to find the current conditions, temperature, and humidity in the phoenix.html page that we have downloaded. Here is an excerpt: Listing 10–3, which focuses on the pane that we are interested in.

Listing 10–3. Excerpt from the Phoenix Forecast Page

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"><html><head>

<title>7-Day Forecast for Latitude 33.45°N and Longitude 112.07°W (Elev. 1132 ft)</title><link rel="STYLESHEET" type="text/css" href="fonts/main.css">

...

<table cellspacing="0" cellspacing="0" border="0" width="100%"><tr align="center"><td><table width='100%' border='0'>

<tr>

<td align ='center'>

Phoenix, Phoenix Sky Harbor International Airport 

Last Update on 29 Oct 7:51 MST 

</td>
</tr>
<tr>

<td colspan='2'>

<table cellspacing='0' cellpadding='0' border='0' align='left'>

<tr>

<td class='big' width='120' align='center'>



A Few Clouds 

 71°F (22°C)</td>

<td rowspan='2' width='200'><table cellspacing='0' cellpadding='2' border='0' width='100%'>

<tr bgcolor='#b0c4de'>

<td>Humidity:</td>

<td align='right'>30 %</td>

</tr>

<tr bgcolor='#ffefd5'>

<td>Wind Speed:</td><td align='right'>SE 5 MPH 

</td>
</tr>

<tr bgcolor='#b0c4de'>

<td>Barometer:</td><td align='right' nowrap>30.05 in (1015.90 mb)</td></tr>

<tr bgcolor='#ffefd5'>

<td>Dewpoint:</td><td align='right'>38°F (3°C)</td>

</tr>
</tr>

<tr bgcolor='#ffefd5'>

<td>Visibility:</td><td align='right'>10.00 Miles</td>

</tr>

<tr><td nowrap><a href='http://www.wrh.noaa.gov/total_forecast/other_obs.php?wfo=psr&zone=AZZ023' class='link'>More Local Wx:</a> </td>

<td nowrap align='right'><a href='http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=psr&sid=KPHX&num=72' class='link'>3 Day History:</a> </td></tr>

</table>

...

There are two approaches to narrowing your attention to the specific area of the document in which you are interested. You can either search the HTML for a word or phrase close to the data that you want, or, as we mentioned previously, use Google Chrome or Firefox with Firebug to “Inspect Element” and see the element you want embedded in an attractive diagram of the document tree. Figure 10–2 shows Google Chrome with its Developer Tools pane open following an Inspect Element command: my mouse is poised over the element that was brought up in its document tree, and the element itself is highlighted in blue on the web page itself.

Image

Figure 10–2. Examining Document Elements in the Browser

(click to enlarge)

Note that Google Chrome does have an annoying habit of filling in “conceptual” tags that are not actually present in the source code, like the <tbody> tags that you can see in every one of the tables shown here. For that reason, I look at the actual HTML source before writing my Python code; I mainly use Chrome to help me find the right places in the HTML.

We will want to grab the text “A Few Clouds” as well as the temperature before turning our attention to the table that sits to this element's right, which contains the humidity.

A properly indented version of the HTML page that you are scraping is good to have at your elbow while writing code. I have included phoenix-tidied.html with the source code bundle for this chapter so that you can take a look at how much easier it is to read!

You can see that the element displaying the current conditions in Phoenix sits very deep within the document hierarchy. Deep nesting is a very common feature of complicated page designs, and that is why simply walking a document object model can be a very verbose way to select part of a document—and, of course, a brittle one, because it will be sensitive to changes in any of the target element's parent. This will break your screen-scraping program not only if the target web site does a redesign, but also simply because changes in the time of day or the need for the site to host different kinds of ads can change the layout subtly and ruin your selector logic.

To see how direct document-object manipulation would work in this case, we can load the raw page directly into both the lxml and BeautifulSoup systems.

>>> import lxml.etree
>>> parser = lxml.etree.HTMLParser(encoding='utf-8')
>>> tree = lxml.etree.parse('phoenix.html', parser)

The need for a separate parser object here is because, as you might guess from its name, lxml is natively targeted at XML files.

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup(open('phoenix.html'))

Traceback (most recent call last):
...
HTMLParseError: malformed start tag, at line 96, column 720

What on earth? Well, look, the National Weather Service does not check or tidy its HTML! I might have chosen a different example for this book if I had known, but since this is a good illustration of the way the real world works, let's press on. Jumping to line 96, column 720 of phoenix.html, we see that there does indeed appear to be some broken HTML:

<a href="http://www.weather.gov"www.weather.gov</a>

You can see that the tag starts before a closing angle bracket has been encountered for the <a> tag. But why should BeautifulSoup care? I wonder what version I have installed.

>>> BeautifulSoup.__version__

'3.1.0'

Well, drat. I typed too quickly and was not careful to specify a working version when I ran pip to install BeautifulSoup into my virtual environment. Let's try again:

$ pip install BeautifulSoup==3.0.8.1

And now the broken document parses successfully:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup(open('phoenix.html'))

That is much better!

Now, if we were to take the approach of starting at the top of the document and digging ever deeper until we find the node that we are interested in, we are going to have to generate some very verbose code. Here is the approach we would have to take with lxml:
>>> fonttag = tree.find('body').find('div').findall('table')[3] \

... .findall('tr')[1].find('td').findall('table')[1].find('tr') \

... .findall('td')[1].findall('table')[1].find('tr').find('td') \

... .find('table').findall('tr')[1].find('td').find('table') \

... .find('tr').find('td').find('font')

>>> fonttag.text

'\nA Few Clouds'

An attractive syntactic convention lets BeautifulSoup handle some of these steps more beautifully:

>>> fonttag = soup.body.div('table', recursive=False)[3] \

... ('tr', recursive=False)[1].td('table', recursive=False)[1].tr \

... ('td', recursive=False)[1]('table', recursive=False)[1].tr.td \

... .table('tr', recursive=False)[1].td.table \

... .tr.td.font

>>> fonttag.text

u'A Few Clouds71°F(22°C)'

BeautifulSoup lets you choose the first child element with a given tag by simply selecting the attribute .tagname, and lets you receive a list of child elements with a given tag name by calling an element like a function—you can also explicitly call the method findAll()—with the tag name and a recursive option telling it to pay attention just to the children of an element; by default, this option is set to True, and BeautifulSoup will run off and find all elements with that tag in the entire sub-tree beneath an element!

Anyway, two lessons should be evident from the foregoing exploration.

First, both lxml and BeautifulSoup provide attractive ways to quickly grab a child element based on its tag name and position in the document.

Second, we clearly should not be using such primitive navigation to try descending into a real-world web page! I have no idea how code like the expressions just shown can easily be debugged or maintained; they would probably have to be re-built from the ground up if anything went wrong with them—they are a painful example of write-once code.

And that is why selectors that each screen-scraping library supports are so critically important: they are how you can ignore the many layers of elements that might surround a particular target, and dive right in to the piece of information you need.

Figuring out how HTML elements are grouped, by the way, is much easier if you either view HTML with an editor that prints it as a tree, or if you run it through a tool like HTML tidy from W3C that can indent each tag to show you which ones are inside which other ones:
$ tidy phoenix.html > phoenix-tidied.html

You can also use either of these libraries to try tidying the code, with a call like one of these:

lxml.html.tostring(html)

soup.prettify()

See each library's documentation for more details on using these calls.

Selectors

A selector is a pattern that is crafted to match document elements on which your program wants to operate. There are several popular flavors of selector, and we will look at each of them as possible techniques for finding the current-conditions tag in the National Weather Service page for Phoenix. We will look at three:

• People who are deeply XML-centric prefer XPath expressions, which are a companion technology to XML itself and let you match elements based on their ancestors, their own identity, and textual matches against their attributes and text content. They are very powerful as well as quite general.

• If you are a web developer, then you probably link to CSS selectors as the most natural choice for examining HTML. These are the same patterns used in Cascading Style Sheets documents to describe the set of elements to which each set of styles should be applied.

• Both lxml and BeautifulSoup, as we have seen, provide a smattering of their own methods for finding document elements.

Here are standards and descriptions for each of the selector styles just described— first, XPath:

 http://www.w3.org/TR/xpath/

 http://codespeak.net/lxml/tutorial.html#using-xpath-to-find-text

 http://codespeak.net/lxml/xpathxslt.html

And here are some CSS selector resources:

 http://www.w3.org/TR/CSS2/selector.html

 http://codespeak.net/lxml/cssselect.html

And, finally, here are links to documentation that looks at selector methods peculiar to lxml and BeautifulSoup:

 http://codespeak.net/lxml/tutorial.html#elementpath

http://www.crummy.com/software/BeautifulSoup/documentation.html#Searching the Parse Tree

The National Weather Service has not been kind to us in constructing this web page. The area that contains the current conditions seems to be constructed entirely of generic untagged elements; none of them have id or class values like currentConditions or temperature that might help guide us to them.

Well, what are the features of the elements that contain the current weather conditions in Listing 10–3? The first thing I notice is that the enclosing <td> element has the class "big". Looking at the page visually, I see that nothing else seems to be of exactly that font size; could it be so simple as to search the document for every <td> with this CSS class? Let us try, using a CSS selector to begin with:

>>> from lxml.cssselect import CSSSelector

>>> sel = CSSSelector('td.big')

>>> sel(tree)

[<Element td at b72ec0a4>]

Perfect! It is also easy to grab elements with a particular class attribute using the peculiar syntax of BeautifulSoup:

>>> soup.find('td', 'big')

<td class="big" width="120" align="center">



A Few Clouds 

 71°F (22°C)</td>

Writing an XPath selector that can find CSS classes is a bit difficult since the class="" attribute contains space-separated values and we do not know, in general, whether the class will be listed first, last, or in the middle.

>>> tree.xpath(".//td[contains(concat(' ', normalize-space(@class), ' '), ' big ')]")

[<Element td at a567fcc>]

This is a common trick when using XPath against HTML: by prepending and appending spaces to the class attribute, the selector assures that it can look for the target class name with spaces around it and find a match regardless of where in the list of classes the name falls.

Selectors, then, can make it simple, elegant, and also quite fast to find elements deep within a document that interest us. And if they break because the document is redesigned or because of a corner case we did not anticipate, they tend to break in obvious ways, unlike the tedious and deep procedure of walking the document tree that we attempted first.

Once you have zeroed in on the part of the document that interests you, it is generally a very simple matter to use the ElementTree or the old BeautifulSoup API to get the text or attribute values you need. Compare the following code to the actual tree shown in Listing 10–3:

>>> td = sel(tree)[0]

>>> td.find('font').text

'\nA Few Clouds'

>>> td.find('font').findall('br')[1].tail

u'71°F'

If you are annoyed that the first string did not return as a Unicode object, you will have to blame the ElementTree standard; the glitch has been corrected in Python 3! Note that ElementTree thinks of text strings in an HTML file not as entities of their own, but as either the .text of its parent element or the .tail of the previous element. This can take a bit of getting used to, and works like this:



My favorite play is # the element's .text



 Hamlet # the element's .text



which is not really # the element's .tail



 Danish # the element's .text



but English. # the element's .tail



This can be confusing because you would think of the three words favorite and really and English as being at the same “level” of the document—as all being children of the element somehow—but lxml considers only the first word to be part of the text attached to the element, and considers the other two to belong to the tail texts of the inner and elements. This arrangement can require a bit of contortion if you ever want to move elements without disturbing the text around them, but leads to rather clean code otherwise, if the programmer can keep a clear picture of it in her mind.

BeautifulSoup, by contrast, considers the snippets of text and the elements inside the tag to all be children sitting at the same level of its hierarchy. Strings of text, in other words, are treated as phantom elements. This means that we can simply grab our text snippets by choosing the right child nodes:

>>> td = soup.find('td', 'big')

>>> td.font.contents[0]

u'\nA Few Clouds'

>>> td.font.contents[4]

u'71°F'

Through a similar operation, we can direct either lxml or BeautifulSoup to the humidity datum. Since the word Humidity: will always occur literally in the document next to the numeric value, this search can be driven by a meaningful term rather than by something as random as the big CSS tag. See Listing 10–4 for a complete screen-scraping routine that does the same operation first with lxml and then with BeautifulSoup.

This complete program, which hits the National Weather Service web page for each request, takes the city name on the command line:

$ python weather.py Springfield, IL

Condition:

Traceback (most recent call last):

..

AttributeError: 'NoneType' object has no attribute 'text'

And here you can see, superbly illustrated, why screen scraping is always an approach of last resort and should always be avoided if you can possibly get your hands on the data some other way: because presentation markup is typically designed for one thing—human readability in browsers—and can vary in crazy ways depending on what it is displaying.

What is the problem here? A short investigation suggests that the NWS page includes only a element inside of the <tr> if—and this is just a guess of mine, based on a few examples—the description of the current conditions is several words long and thus happens to contain a space. The conditions in Phoenix as I have written this chapter are “A Few Clouds,” so the foregoing code has worked just fine; but in Springfield, the weather is “Fair” and therefore does not need a wrapper around it, apparently.

Listing 10–4. Completed Weather Scraper

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - weather.py

# Fetch the weather forecast from the National Weather Service.

import sys, urllib, urllib2

import lxml.etree

from lxml.cssselect import CSSSelector

from BeautifulSoup import BeautifulSoup

if len(sys.argv) < 2:

 print >>sys.stderr, 'usage: weather.py CITY, STATE'

 exit(2)

data = urllib.urlencode({'inputstring': ' '.join(sys.argv[1:])})

info = urllib2.urlopen('http://forecast.weather.gov/zipcity.php', data)

content = info.read()

# Solution #1

parser = lxml.etree.HTMLParser(encoding='utf-8')

tree = lxml.etree.fromstring(content, parser)

big = CSSSelector('td.big')(tree)[0]

if big.find('font') is not None:

 big = big.find('font')

print 'Condition:', big.text.strip()

print 'Temperature:', big.findall('br')[1].tail

tr = tree.xpath('.//td[b="Humidity"]')[0].getparent()

print 'Humidity:', tr.findall('td')[1].text

print

# Solution #2

soup = BeautifulSoup(content) # doctest: +SKIP

big = soup.find('td', 'big')

if big.font is not None:

 big = big.font

print 'Condition:', big.contents[0].string.strip()

temp = big.contents[3].string or big.contents[4].string # can be either

print 'Temperature:', temp.replace('°', ' ')

tr = soup.find('b', text='Humidity').parent.parent.parent

print 'Humidity:', tr('td')[1].string

print

If you look at the final form of Listing 10–4, you will see a few other tweaks that I made as I noticed changes in format with different cities. It now seems to work against a reasonable selection of locations; again, note that it gives the same report twice, generated once with lxml and once with BeautifulSoup:

$ python weather.py Springfield, IL

Condition: Fair

Temperature: 54 °F

Humidity: 28 %</code>

<code>Condition: Fair

Temperature: 54 F

Humidity: 28 %

$ python weather.py Grand Canyon, AZ

Condition: Fair

Temperature: 67°F

Humidity: 28 %

Condition: Fair

Temperature: 67 F

Humidity: 28 %

You will note that some cities have spaces between the temperature and the F, and others do not. No, I have no idea why. But if you were to parse these values to compare them, you would have to learn every possible variant and your parser would have to take them into account.

I leave it as an exercise to the reader to determine why the web page currently displays the word “NULL”—you can even see it in the browser—for the temperature in Elk City, Oklahoma. Maybe that location is too forlorn to even deserve a reading? In any case, it is yet another special case that you would have to treat sanely if you were actually trying to repackage this HTML page for access from an API:

$ python weather.py Elk City, OK

Condition: Fair and Breezy

Temperature: NULL

Humidity: NA

Condition: Fair and Breezy

Temperature: NULL

Humidity: NA

I also leave as an exercise to the reader the task of parsing the error page that comes up if a city cannot be found, or if the Weather Service finds it ambiguous and prints a list of more specific choices!

Summary

Although the Python Standard Library has several modules related to SGML and, more specifically, to HTML parsing, there are two premier screen-scraping technologies in use today: the fast and powerful lxml library that supports the standard Python “ElementTree” API for accessing trees of elements, and the quirky BeautifulSoup library that has powerful API conventions all its own for querying and traversing a document.

If you use BeautifulSoup before 3.2 comes out, be sure to download the most recent 3.0 version; the 3.1 series, which unfortunately will install by default, is broken and chokes easily on HTML glitches.

Screen scraping is, at bottom, a complete mess. Web pages vary in unpredictable ways even if you are browsing just one kind of object on the site—like cities at the National Weather Service, for example.

To prepare to screen scrape, download a copy of the page, and use HTML tidy, or else your screen-scraping library of choice, to create a copy of the file that your eyes can more easily read. Always run your program against the ugly original copy, however, lest HTML tidy fixes something in the markup that your program will need to repair!

Once you find the data you want in the web page, look around at the nearby elements for tags, classes, and text that are unique to that spot on the screen. Then, construct a Python command using your scraping library that looks for the pattern you have discovered and retrieves the element in question. By looking at its children, parents, or enclosed text, you should be able to pull out the data that you need from the web page intact.

When you have a basic script working, continue testing it; you will probably find many edge cases that have to be handled correctly before it becomes generally useful. Remember: when possible, always use true APIs, and treat screen scraping as a technique of last resort!

Source: http://rhodesmill.org/brandon/chapters/screen-scraping/

Friday, 22 May 2015

5 tips for scraping big websites

Scraping bigger websites can be a challenge if done the wrong way.

Bigger websites would have more data, more security and more pages. We’ve learned a lot from our years of crawling such large complex websites, and these tips could help solve some of your challenges

1. Cache pages visited for scraping

When scraping big websites, its always a good idea to cache the data you have already downloaded. So you don’t have to put load on the website again, in case you have start over again or that page is required again during scraping. Its effortless to cache to a key value store like redis.But Databases and filesystem caches are also good.

2. Don’t flood the website with large number of parallel requests , take it slow

Big websites posses algorithms to detect webscraping, large number of parallel requests from the same ip address will identify you as a Denial Of Service Attack on their website, and blacklist your IPs immediately. A better idea is to time your requests properly one after the other, giving it some human behavior. Oh!.. but scraping like that will take you ages. So balance requests using the average response time of the websites, and play around with the number of parallel requests to the website to get the right number.

3. Store the URLs that you have already fetched

You may want to keep a list of URLs you have already fetched, in a database or a key value store. What would you do if you scraper crashes after scraping 70% of the website. If you need to complete the remaining 30%, with out this list of URLs, you’ll waste a lot of time and bandwidth. Make sure you store this list of URLs

somewhere permanent, till you have all the required data. This could also be combined with the cache. This way, you’ll be able to resume scraping.

4. Split scraping into different phases

Its easier and safer if you split the scraping into multiple smaller phases. For example, you could split scraping a huge site into two. One for gathering links to the pages from which you need to extract data and another for downloading these pages to scrape content.

5. Take only whats required

Don’t grab or follow every link unless its required. You can define a proper navigation scheme to make the scraper visit only the pages required. Its always tempting to grab everything, but its just a waste of bandwidth, time and storage.

Source: http://learn.scrapehero.com/5-tips-for-scraping-big-websites/

Wednesday, 20 May 2015

Hard-Scraped Hardwood Flooring: Restoration of History

Throughout History hardwood flooring has undergone dramatic changes from the meticulous hard-scraped hardwood polished floors of majestic plantations of the Deep South, to modern day technology providing maintenance free wood flooring designed for comfort and appearance. The hand-scraped hardwood floors of the South, depicted charm with old rustic nature and character that was often associated with this time era. To date, hand-scraped hardwood flooring is being revitalized and used in up-scale homes and places of businesses to restore the old country charm that once faded into oblivion.

As the name implies, hand-scraped flooring involves the retexturing the top layer of flooring material by various methods in an attempts to mimic the rustic appearance of flooring in yesteryears. Depending on the degree of texture required, hand scraping hardwood material is often accomplished by highly skilled craftsmen with specialized tools and years of experience perfecting this procedure. When properly done, hand-scraped hardwood floors add texture, richness and uniqueness not offered in any similar hardwood flooring product.

Rooted with history, these types of floors are available in finished or unfinished surfaces. The majority of the individuals selecting hand-scraped hardwood flooring elect a prefinished floor to reduce costs per square foot in installation and finishing labor charges, allowing for budget guidelines to bend, not break. As expected, hand-scraped flooring is expensive and depending on the grade and finish selected, can range from $15-40$ per square foot and beyond for material only. Preparation of the material is labor intensive adding to the overall cost per square foot dramatically. Recommended professional installation can and often does increase the cost per square foot as well, placing this method of hardwood flooring well out of reach of the average hardwood floor purchaser.

With numerous selections of hand-scraped finishes available, each finish is designed to bring out a different appearance making it a one-of-a-kind work of art. These numerous finish selections include:

• Time worn aged, dark coloring stain application bringing out grain characteristics

• Wire brushed, providing a highlighted "grainy" effect with obvious rough texture

• Hand sculpted, smoother distressed uniform appearance

• French Bleed, staining of edges and side joints with a much darker stain to give a bleeding effect to the wood

• Hand Hewn or Rough Sawn, with visible and noticeable saw marks

Regardless of the selection made, scraped flooring cannot be compared to any other available flooring material based on durability, strength and visual appearance. Limited by only the imagination and creativity, several wood species can be used to create unusual floor patterns, highlighting main focal points of personal libraries and art collections.

The precise process utilized in the creation of scraped floors projects a custom look with deep color and subtle warm highlights. With radiant natural light reflecting off this type of floor, the effect of beauty and depth is radiated in a fashion that fills the room with solitude and serenity encompassing all that enter. Hand-scraped hardwood floors speak of the past, a time of decent, a time or war and ambiguity towards other races and the blood- shed so that all men could be treated as equals. More than exquisite flooring, hand-scraped hardwood flooring is the restoration of History.

Source: http://ezinearticles.com/?Hard-Scraped-Hardwood-Flooring:-Restoration-of-History&id=6333218

Sunday, 17 May 2015

Metadata Scraping Service

As mentioned in Robert's last blog post we set up a scraping service which supports users working with citations by extracting automatically references from digital library or publisher websites. We use a very similar service in BibSonomy to support our users while posting a new reference. However, the service is independent from BibSonomy. Our main goal is to make the metadata of other websites easily accessible to every user who needs bibliographic metadata. Therefore we offer the extracted information in BibTeX format. Most tools allow to import BibTeX so it should be very easy for everyone to get the data into his own tool. The service is running under the following URL:

http://scraper.bibsonomy.org/

Currently we support more than 60 different websites (here the full list) and we are working on further extensions. In the near future we will make the source code of our scrapers publicly available under GPL and we hope that other people will find it useful and start to help us by implementing their own scrapers.

How does the service work?

In principle there are two ways to use the service. One uses a so

called bookmarklet and the other is simply based on the URL. If you

have a webpage of a supported site e.g. from ACM digital library the

following page:

Logsonomy - social information retrieval with logdata

then you can copy this URL into the form on the service homepage and the service will return you the extracted BibTeX information. As this is not a very convenient way to access the data we provide a ScrapePublication button. This button is a small piece of JavaScript and can be copied to the toolbar of the browser. By pressing this button while visiting a digital library webpage the URL will be automatically copied and sent to the scraping service and the metadata is extracted.

The service has three options which can be used to customize it and to make it useful for other systems. Obviously one parameter is the URL itself which is used by the bookmarklet, too. The next is the selection parameter which allows to send text to the service and the last parameter allows to change the output format from html to plain BibTeX. This last parameter makes integration with other systems very simple.

If needed we can provide the metadata in other formats as well but currently we support only BibTeX.

Source: http://blog.bibsonomy.org/2008/11/metadata-scraping-service.html

Wednesday, 6 May 2015

Kimono Is A Smarter Web Scraper That Lets You “API-ify” The Web, No Code Required

A new Y Combinator-backed startup called Kimono wants to make it easier to access data from the unstructured web with a point-and-click tool that can extract information from webpages that don’t have an API available. And for non-developers, Kimono plans to eventually allow anyone track data without needing to understand APIs at all.

This sort of smarter “web scraper” idea has been tried before, and has always struggled to find more than a niche audience. Previous attempts with similar services like Dapper or Needlebase, for example, folded. Yahoo Pipes still chugs along, but it’s fair to say that the service has long since been a priority for its parent company.

But Kimono’s founders believe that the issue at hand is largely timing.

“Companies more and more are realizing there’s a lot of value in opening up some of their data sets via APIs to allow developers to build these ecosystems of interesting apps and visualizations that people will share and drive up awareness of the company,” says Kimono co-founder Pratap Ranade. (He also delves into this subject deeper in a Forbes piece here). But often, companies don’t know how to begin in terms of what data to open up, or how. Kimono could inform them.

Plus, adds Ranade, Kimono is materially different from earlier efforts like Dapper or Needlebase, because it’s outputting to APIs and is starting off by focusing on the developer user base, with an expansion to non-technical users planned for the future. (Meanwhile, older competitors were often the other way around).

The company itself is only a month old, and was built by former Columbia grad school companions Ranade and Ryan Rowe. Both left grad school to work elsewhere, with Rowe off to Frog Design and Ranade at McKinsey. But over the nearly half-dozen or so years they continued their careers paths separately, the two stayed in touch and worked on various small projects together.

One of those was Airpapa.com, a website that told you which movies were showing on your flights. This ended up giving them the idea for Kimono, as it turned out. To get the data they needed for the site, they had to scrape data from several publicly available websites.

“The whole process of cleaning that [data] up, extracting it on a schedule…it was kind of a painful process,” explains Rowe. “We spent most of our time doing that, and very little time building the website itself,” he says. At the same time, while Rowe was at Frog, he realized that the company had a lot of non-technical designers who needed access to data to make interesting design decisions, but who weren’t equipped to go out and get the data for themselves.

With Kimono, the end goal is to simplify data extraction so that anyone can manage it. After signing up, you install a bookmarklet in your browser, which, when clicked, puts the website into a special state that allows you to point to the items you want to track. For example, if you were trying to track movie times, you might click on the movie titles and showtimes. Then Kimono’s learning algorithm will build a data model involving the items you’ve selected.

That data can be tracked in real time and extracted in a variety of ways, including to Excel as a .CSV file, to RSS in the form of email alerts, or for developers as a RESTful API that returns JSON. Kimono also offers “Kimonoblocks,” which lets you drop the data as an embed on a webpage, and it offers a simple mobile app builder, which lets you turn the data into a mobile web application.

For developer users, the company is currently working on an API editor, which would allow you to combine multiple APIs into one.

So far, the team says, they’ve been “very pleasantly surprised” by the number of sign-ups, which have reached ten thousand*. And even though only a month old, they’ve seen active users in the thousands.

Initially, they’ve found traction with hardware hackers who have done fun things like making an airhorn blow every time someone funds their Kickstarter campaign, for instance, as well as with those who have used Kimono for visualization purposes, or monitoring the exchange rates of various cryptocurrencies like Bitcoin and dogecoin. Others still are monitoring data that’s later spit back out as a Twitter bot.

Kimono APIs are now making over 100,000 calls every week, and usage is growing by over 50 percent per week. The company also put out an unofficial “Sochi Olympics API” to showcase what the platform can do.

The current business model is freemium based, with pricing that kicks in for higher-frequency usage at scale.

The Mountain View-based company is a team of just the two founders for now, and has initial investment from YC, YC VC and SV Angel.

Source: http://techcrunch.com/2014/02/18/kimono-is-a-smarter-web-scraper-that-lets-you-api-ify-the-web-no-code-required/

Thursday, 30 April 2015

Web Data Scraping - Scrape Business Data in no time

The Internet has evolved as one of the largest repositories of information for your business. You can design intelligent business processes to access a whole host of relevant information sources that will help you strategize, implement and deliver effective business objectives. Leveraging the benefits and usefulness of Web Scraping Tools is one such methodology that most businesses have adopted. Let us take a look at some of the ways it helps you easily scrape data relevant for your business.

Scraping for Business Information

Web Data Scraping is a technique, employed by most organizations. It involves the implementation of tools that help businesses extract unstructured data and convert them into usable business information. The focus of most scraping initiatives revolves around the organization’s need to glean the following information:

•    Competitor analysis to structure and strategist effectively

•    Price comparisons to price their products competitively

•    Customer feedbacks to enhance their product portfolio and provide customers with better brand experience   Market dynamics to help them identify areas of opportunities and threats

Using Scraping Tools

The abundance of information available on the Internet that helps you build up a productive business strategy can be easily extracted and leveraged to benefit your business. Tools have been designed with intuitive interface and intelligent algorithms which help in furthering this end.

Website Data Scraping tools are equipped for compatibility with a wide variety of applications so as to be able to explore a huge range of information sources. These tools are fully automated and display the drag and drop facility ensuring users get to leverage the benefits of speed and convenience.

Data extraction tools are not only adept at extracting data, but are also equally well-equipped to combine relevant statistics from several social media platforms like YouTube, Twitter, and Google Analytics and so on. This helps businesses to analyse trends and plan strategies accordingly.

Challenges of the Data Scraping Process

Just as there is no dearth of data to be collected from the Web, there is also an abundance of web scraping tools to execute the data collection process. However, the capability of the tool to help you collect the appropriate data needs to be assured before you can proceed with its implementation. Some of the challenges faced by most businesses owing to their wrong choice of tools include the following:

•    Run-of-the-mill extraction tools are unable to scale up sufficiently in order to capture large volumes of data

•    Some tools are also unable to establish compatibility with most data sources and therefore do not provide a holistic data collection approach

•    Some tools are also not equipped to conduct an automatic detection of updates made to a data source and therefore end up providing inaccurate data.

In the light of all this it is essential that you identify the right tool for your need and select one that is embedded with an updated technology to help you achieve the following:

•    Ensure that you are able to access the appropriate data that you want

•    Help you structure it in the format you want

•    Provide quick and easy access to all available data sources no matter how complex

•    Run accurately and is a reliable source to help you churn out usable information.

Source: http://scraping-solutions.blogspot.in/2014_07_01_archive.html

Tuesday, 28 April 2015

Web Scraping – Effective Way of Improving Market Presence

Web scraping is a technique that is fast making its presence felt in the world of internet by its sheer weight of being effective. It is a technique that uses software to crawl through the internet and gather up all the relevant and important information that one would need for their products.

The information gathered by the web scraping can be used for various things such as data integration, web mashup, online comparison of price and much more. Web scraping uses sophisticated software that crawls through the internet and gathers up all related information for the entity that you are looking for. The information that is gathered up is an automated, systematic, and very structured way. This allows for easy understanding of the gathered information. Though this is one of the best ways for data extraction there are quite a few things that one must be aware of before getting into web scraping.

Being aware of the following things keep you at a better position not only leverage the best deal, but also to negotiate properly.

• For data mining the first thing that one should be very sure of is the kind of data they want. One has to define properly what kind of data they want and also what would be the purpose of the same. For an instance if you wish to get a closer look at your competitors, it would be a wise to let the data scraping service providers know who your competitors are. This would allow them to gather better information. Similarly if you are looking for getting new customers getting contact data from existing players in the respective industry would be helpful.

• One should also be aware of the structure in which they want the data. A simple data structure has the entity name in the row and the property of the entity is kept in the cells of the rows. However, one can also opt for data structure in chart. Apart from the above, there is just one more thing that one needs to keep in mind while using the data mining services; it is the number of data extraction. At times a onetime data extraction would be sufficient whereas at other times periodic extractions or general reports are required.

If you are aware of all the above points, then you are very much inline of going ahead and taking the help of scrape website data. Knowing the above points would allow you to know what exactly to ask from your vendor and likewise quote. One can make the most of the data extraction services with the help of either the web scraping or web crawling services.

Source: https://3idatascraping.wordpress.com/2014/01/07/web-scraping-effective-way-of-improving-market-presence/

Saturday, 25 April 2015

Data Mining and Market Research

Online market research attributes to success and growth of many businesses. Online market research in simple terms we can say it is the learning of current and the latest market situations which involve surveys, web and data mining modules. To date research by use of the internet it is very important since it depends on data gathered from internet services and then one can recognize that market research keeps the business successful.

A number of managers in small businesses have a mental deem in that online market research is obligatory to big or larger companies. For true you will understand that whether the businesses is medium, large or small actually need online marketing research and this is a reason why the significance of the process and allegation will approve the targeted and potential clients. In this case Data mining progression is employed to streamline on what targeted and potential clientele needs. Areas where data mining is used:

Preferences. In any given service or product, you will learn what a customer looking for and how your product or service is different from other competitors. By use of Data mining you will be able to determine the customer preferences and you will be able to modify your services and products meet the customer choice.

Buying patterns. What is known and created for purchasing patterns from different customers. A situation can be that customers try to spend a lot on certain products and little on others. But through Data mining it is easy to understand such purchasing patterns and finally plan the appropriate techniques to be used in the marketing.

Prices. To find out whether the company is selling its products to the clients or not prices are the key factors to take into account. One should understand on the right selling price of the products. Web scrapping is easier to find the suitable pricing.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/data-mining-market-research/

Wednesday, 22 April 2015

Hand Scraped Versus Machine Scraped Floors - The Distinction

In society today hardwood flooring has become the new must have. The days of carpet are gone, and if you have looked into bringing your home up to date with the styling of today you will have noticed by now that there are many different options. At times this may become very overwhelming, especially if you are not a hardwood specialist like most people are not. That is why this article is here to help you understand the many different options available to you.

The flooring type covered in this article is hand scraped flooring. This flooring type is a custom look flooring that is in very high demand in flooring marketplace, which is understandable because it is probably the most unique flooring there is. You can choose from many different types of wood species such as oak, maple, hickory, and most exotic species. There is computerized hand scraped that is when the manufacturer makes one piece of wood and places it into a computer that will cut thousands of different wood types with that one design. This type of process is also known as machine scraping. Hardwood floors employing this type of technology usually cost less, but most of the pieces look the same because the hand scraping is done by a machine.

Then you have actual hand scraped flooring that is done all by hand and takes more time and effort than machine scraped. This flooring is made custom each individual piece is scraped and notched in different ways, so every piece is unique. If you decide to purchase actual hand scraped flooring it will cost you more than mass produced computerized version but it will definitely be the more unique option. If you are the type of person who wants to have a one of kind floor then an actual hand scraped floor is the way to go.

So in conclusion hand scraped flooring is a great option for a lot of people. It comes in several different wood types, and several different colors. You can find flooring options for every budget and to meet every style. If having a custom floor in your home it may be important or not important on whether it be computer or done by hand. Most consumers cannot tell the difference between actual hand scraped flooring and machine scraped when just looking at a small sample. So when shopping at your local retailer ask the tough questions and find out if the manufacturer uses machine or authentic hand scrapping on their products.

To view your many options on hand scraped flooring please check out our website that covers all hardwood flooring options.

Source: http://ezinearticles.com/?Hand-Scraped-Versus-Machine-Scraped-Floors---The-Distinction&id=4151157

Tuesday, 7 April 2015

The Coal Mining Industry And Investing In It

The History Of Coal Usage

Coal was initially used as a domestic fuel, until the industrial revolution, when coal became an integral part of manufacturing for creating electricity, transportation, heating and molding purposes. The large scale mining aspect of coal was introduced around the 18th century, and Britain was the first nation to successfully use advanced coal mining techniques, which involved underground excavation and mining.

Initially coal was scraped off the surface by different processes like drift and shaft mining. This has been done for centuries, and since the demand was quite low, these mining processes were more than enough to accommodate the demand in the market.

However, when the practical uses of using coal as fuel sparked industrial revolution, the demand for coal rose abruptly, leading to severe shortage of the coal output, gradually paving the way for new ways to extract coal from under the ground.

Coal became a popular fuel for all purposes, even to this day, due to their abundance and their ability to produce more energy per mass than other conventional solid fuels like wood. This was important as far as transportation, creating electricity and manufacturing processes are concerned, which allowed industries to use up less space and increase productivity. The usage of coal started to dwindle once alternate energies such as oil and gas began to be used in almost all processes, however, coal is still a primary fuel source for manufacturing processes to this day.

The Process Of Coal Mining

Extracting coal is a difficult and complex process. Coal is a natural resource, a fossil fuel that is a result of millions of years of decay of plants and living organisms under the ground. Some can be found on the surface, while other coal deposits are found deep underground.

Coal mining or extraction comes broadly in two different processes, surface mining, and deep excavation. The method of excavation depends on a number of different factors, such as the depth of the coal deposit below the ground, geological factors such as soil composition, topography, climate, available local resources, etc.

Surface mining is used to scrape off coal that is available on the surface, or just a few feet underground. This can even include mountains of coal deposit, which is extracted by using explosives and blowing up the mountains, later collecting the fragmented coal and process them.

Deep underground mining makes use of underground tunnels, which is built, or dug through, to reach the center of the coal deposit, from where the coal is dug out and brought to the surface by coal workers. This is perhaps the most dangerous excavation procedure, where the lives of all the miners are constantly at a risk.

Investing In Coal

Investing in coal is a safe bet. There are still large reserves of coal deposits around the world, and due to the popularity, coal will be continued to be used as fuel for manufacturing process. Every piece of investment you make in any sort of industry or a manufacturing process ultimately depends on the amount of output the industry can deliver, which is dependent on the usage of any form of fuel, and in most cases, coal.

One might argue that coal usage leads to pollution and lower standards of hygiene for coal workers. This was arguably true in former years; however, newer coal mining companies are taking steps to assure that the environmental aspects of coal mining and usage are kept minimized, all the while providing better working environment and benefits package for their workers. If you can find a mining company that promises all these, and the one that also works within the law, you can be assured safety for your investments in coal.

Source: http://ezinearticles.com/?The-Coal-Mining-Industry-And-Investing-In-It&id=5871879

Scrape A Website