Wisconsin Web Scraping

Wisconsin Data Scraping, Web Scraping Tennessee, Data Extraction Tennessee, Scraping Web Data, Website Data Scraping, Email Scraping Tennessee, Email Database, Data Scraping Services, Scraping Contact Information, Data Scrubbing

Friday, 28 November 2014

Scraping R-bloggers with Python – Part 2

In my previous post I showed how to write a small simple python script to download the pages of R-bloggers.com. If you followed that post and ran the script, you should have a folder on your hard drive with 2409 .html files labeled post1.html , post2.html and so forth. The next step is to write a small script that extract the information we want from each page, and store that information in a .csv file that is easily read by R. In this post I will show how to extract the post title, author name and date of a given post and store it in a .csv file with a unique id.

To do this open a document in your favorite python editor (I like to use aquamacs) and name it: extraction.py. As in the previous post we start by importing the modules that we will use for the extraction:

from BeautifulSoup import BeautifulSoup

import os
import re

As in the previous post we will be using the BeautifulSoup module to extract the relevant information from the pages. The os module is used to get a list of file from the directory where we have saved the .html files, and finally the re module allows us to use regular expressions to format the titles that include a comma value or a newline value (\n). We need to remove these as they would mess up the formatting of the .csv file.

After having read in the modules, we need to get a list of files that we can iterate over. First we need to specify the path were the files are saved, and then we use the os module to get all the filenames in the specified directory:

path = "/Users/thomasjensen/Documents/RBloggersScrape/download"

listing = os.listdir(path)

It might be that there are other files in the given directory, hence we apply a filter, in shape of a list comprehension, to weed out any file names that do not match our naming scheme:

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]

Notice that a regular expression was used to determine whether a given name in the list matched our naming scheme. For more on regular expressions have a look at this site.

The final steps in preparing our extraction is to change the working directory to where we have our .html files, and create an empty dictionary:

data = {}

Dictionaries are one of the great features of Python. Essentially a dictionary is a mapping of a key to a specific value, however the fact that dictionaries can be nested within each other, allows us to create data structures similar to R’s data frames.

Now we are ready to begin extracting information from our downloaded pages. Much as in the previous post, we will loop over all the file names, read each file into Python and create a BeautifulSoup object from the file:

for page in listing:
    site = open(page,"rb")
    soup = BeautifulSoup(site)

In order to store the values we extract from a given page, we update the dictionary with a unique key for the page. Since our naming scheme made sure that each file had a unique name, we simply remove the .html part from the page name, and use that as our key:

key = re.sub(".html","",page)


This will create a mapping between our key and an empty dictionary, nested within the data dictionary. Once this is done we can start extract information and store it in our newly created nested dictionary. The content we want is located in the main column, which has the id tag “leftcontent” in the html code. To get at this we use find() function on soup object created above:

content = soup.find("div", id = "leftcontent")

The first “h1” tag in our content object contains the title, so again we will use the find() function on the content object, to find the first “h1” tag:

title = content.findNext("h1").text

To get the text within the “h1” tag the .text had been added to our search with in the content object.

To find the author name, we are lucky that there is a class of “div” tags called “meta” which contain a link with the author name in it. To get the author name we simply find the meta div class and search for a link. Then we pull out the text of the link tag:

author = content.find("div",{"class":"meta"}).findNext("a").text

Getting the date is a simple matter as it is nested within div tag with the class “date”:

date = content.find("div",{"class":"date"}).text

Once we have the three variables we put them in dictionaries that are nested within the nested dictionary we created with the key:

data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

Once we have run the loop and gone through all posts, we need to write them in the right format to a .csv file. To begin with we open a .csv file names output:

output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

then we create a header that contain the variable names and write it to the output.csv file as the first row:

variables = unicode(",".join(["id","date","author","title"]))
header = variables + "\n"

Next we pull out all the unique keys from our dictionary that represent individual posts:

keys = data.keys()

Now it is a simple matter of looping through all the keys, pull out the information associated with each key, and write that information to the output.csv file:

for key in keys:
    print key
    id = key
    date = re.sub(",","",data[key]["date"])
    author = data[key]["author"]
    title = re.sub(",","",data[key]["title"])
    title = re.sub("\\n","",title)
    linelist = [id,date,author,title]
    linestring = unicode(",".join(linelist))
    linestring = linestring + "\n"

Notice that we first create four variables that contain the id, date, author and title information. With regards to the title we use two regular expressions to remove any commas and “\n” from the title, as these would create new columns or new line breaks in the output.csv file. Finally we put the variables together in a list, and turn the list into a string with the list items separated by a comma. Then a linebreak is added to the end of the string, and the string is written to the output.csv file. As a last step we close the file connection:


And that is it. If you followed the steps you should now have a csv file in your directory with 2409 rows, and four variables – ready to be read into R. Stay tuned for the next post which will show how we can use this data to see how R-bloggers has developed since 2005. The full extraction script is shown below:

from BeautifulSoup import BeautifulSoup

import os
import re

 path = "/Users/thomasjensen/Documents/RBloggersScrape/download"
 listing = os.listdir(path)

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]
 data = {}
 for page in listing:
site = open(page,"rb")
soup = BeautifulSoup(site)
key = re.sub(".html","",page)
print key
 content = soup.find("div", id = "leftcontent")
title = content.findNext("h1").text
author = content.find("div",{"class":"meta"}).findNext("a").text
date = content.find("div",{"class":"date"}).text
data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

 output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

 keys = data.keys()
 variables = unicode(",".join(["id","date","author","title"]))
 header = variables + "\n"
 for key in keys:
print key
id = key
date = re.sub(",","",data[key]["date"])
author = data[key]["author"]
title = re.sub(",","",data[key]["title"])
title = re.sub("\\n","",title)
linelist = [id,date,author,title]
linestring = unicode(",".join(linelist))
linestring = linestring + "\n"


Thursday, 27 November 2014

Data Mining and Frequent Datasets

I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct.

My question is

    (c) A transactional dataset, T, is given below:
    t1: Milk, Chicken, Beer
    t2: Chicken, Cheese
    t3: Cheese, Boots
    t4: Cheese, Chicken, Beer,
    t5: Chicken, Beer, Clothes, Cheese, Milk
    t6: Clothes, Beer, Milk
    t7: Beer, Milk, Clothes

    Assume that minimum support is 0.5 (minsup = 0.5).

    (i) Find all frequent itemsets.

Here is how I worked it out:

    Item : Amount
    Milk : 4
    Chicken : 4
    Beer : 5
    Cheese : 4
    Boots : 1
    Clothes : 3

Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving:

    {items} : Amount
    {Milk, Chicken} : 2
    {Milk, Beer} : 4
    {Milk, Cheese} : 1
    {Chicken, Beer} : 3
    {Chicken, Cheese} : 3
    {Beer, Cheese} : 2

Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?

data mining


3 Answers

There are two ways to solve the problem:

    using Apriori algorithm
    Using FP counting

Assuming that you are using Apriori, the answer you got is correct.

The algorithm is simple:

First you count frequent 1-item sets and exclude the item-sets below minimum support.

Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold.

The algorithm can go on until no item-sets are greater than threshold.

In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further.

There is a solved example of further steps on Wikipedia here.

You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.


There is more than two algorithms to solve this problem. I will just mention a few of them: Apriori, FPGrowth, Eclat, HMine, DCI, Relim, AIM, etc. –  Phil Mar 5 '13 at 7:18

OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database.

Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies.

Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule.

Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low.

This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules.

Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf).

I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.


The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent. If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.

So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.

The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears). This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.

If item y has a frequency greater than X, it is said to be frequent.

In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that we compute only for items that are individually frequent. So if item y and item z are frequent on itselves, we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.

Once this is calculated, the frequencies greater than the threshold are said frequent itemset.

Source: http://stackoverflow.com/questions/14164853/data-mining-and-frequent-datasets?rq=1

Tuesday, 25 November 2014

4 Data Mining Tips to Scrap Real Estate Data; Innovative Way to Give Realty Business a boost!

Internet has become a huge source of data – in fact; it has turned into a goldmine for the marketers, from where they can easily dig the useful data!

    Web scraping has become a norm in today’s competitive era, where one with maximum and relevant information wins the race!

Real Estate Data Extraction and Scraping Service

It has helped many industries to carve a niche in the market; especially real estate – Scraping real estate data has been of great help for professionals to reach out to a large number of people and gather reliable property data. However, there are some people for whom web scraping is still an alien concept; most probably because most of its advantages are not discussed.

    There are institutions, companies and organizations, entrepreneurs, as well as just normal citizens generating an extraordinary amount of information every day. Property information extraction can be effectively used to get an idea about the customer psyche and even generate valuable lead to further the business.

In addition to this, data mining has also some of following uses making it an indispensable part of marketing.

Gather Properties Details from Different Geographical Locations

You are an estate agent and want to expand your business to the neighboring city or state. But, then you are short of information. You are completely aware of the properties in the vicinity and in your town; however, with data mining services will help you to get an idea about the properties in the other state. You can also approach probable clients and increase your database to offer extensive services.

Online Offers and Discounts are just a Click Away

Now, it is tough to deal with the clients, show them the property of their choice and again act as a mediator between the buyer and seller. In all this, it becomes almost difficult to take a look at some special discounts or offers. With the data mining services, you can get an insight into these amazing offers. Thus, you can plan a move or even provide your client an amazing deal.

What people are talking about – Easy Monitoring of your Online Reputation

Internet has become a melting pot where different people come together. In fact, it provides a huge platform where people discuss about their likes and dislikes. When you dig into such online forums, you can get an idea of reputation that you or your firm holds. You can know what people think about you and where you require to buck up and where you need to slow down.

A Chance to Know your Competitors Better!

Last, but not the least, you can keep an eye on the competitor.  Real Estate is getting more competitive; and therefore, it is important to have knowledge about your competitors to get an upper hand. It will help you to plan your moves and strategize with more ease. Moreover, you also know what is that “something” that your competitor does not have and you have, with can be subtly highlighted.

Property information extraction can prove to be the most fruitful method to get a cutting edge in the industry.

Source: http://www.hitechbposervices.com/blog/4-data-mining-tips-to-scrap-real-estate-data-innovative-way-to-give-realty-business-a-boost/

Friday, 21 November 2014

Web Scraping for Fun & Profit

There’s a number of ways to retrieve data from a backend system within mobile projects. In an ideal world, everything would have a RESTful JSON API – but often, this isn’t the case.Sometimes, SOAP is the language of the backend. Sometimes, it’s some proprietary protocol which might not even be HTTP-based. Then, there’s scraping.

Retrieving information from web sites as a human is easy. The page communicates information using stylistic elements like headings, tables and lists – this is the communication protocol of the web. Machines retrieve information with a focus on structure rather than style, typically using communication protocols like XML or JSON. Web scraping attempts to bridge this human protocol into a machine-readable format like JSON. This is what we try to achieve with web scraping.

As a means of getting to data, it don’t get much worse than web scraping. Scrapers were often built with Regular Expressions to retrieve the data from the page. Difficult to craft, impossible to maintain, this means of retrieval was far from ideal. The risks are many – even the slightest layout change on a web page can upset scraper code, and break the entire integration. It’s a fragile means for building integrations, but sometimes it’s the only way.

Having built a scraper service recently, the most interesting observation for me is how far we’ve come from these “dark days”. Node.js, and the massive ecosystem of community built modules has done much to change how these scraper services are built.

Effectively Scraping Information

Websites are built on the Document Object Model, or DOM. This is a tree structure, which represents the information on a page.By interpreting the source of a website as a DOM, we can retrieve information much more reliably than using methods like regular expression matching. The most popular method of querying the DOM is using jQuery, which enables us to build powerful and maintainable queries for information. The JSDom Node module allows us to use a DOM-like structure in serverside code.

For purpose of Illustration, we’re going to scrape the blog page of FeedHenry’s website. I’ve built a small code snippet that retrieves the contents of the blog, and translates it into a JSON API. To find the queries I need to run, first I need to look at the HTML of the page. To do this, in Chrome, I right-click the element I’m looking to inspect on the page, and click “Inspect Element”.

Screen Shot 2014-09-30 at 10.44.38

Articles on the FeedHenry blog are a series of ‘div’ elements with the ‘.itemContainer’ class

Searching for a pattern in the HTML to query all blog post elements, we construct the `div.itemContainer` query. In jQuery, we can iterate over these using the .each method:

var posts = [];

$('div.itemContainer').each(function(index, item){

  // Make JSON objects of every post in here, pushing to the posts[] array


From there, we pick off the heading, author and post summary using a child selector on the original post, querying the relevant semantic elements:

    Post Title, using jQuery:

    $(item).find('h3').text()trim() // trim, because titles have white space either side

    Post Author, using jQuery:

    $(item).find('.catItemAuthor a').text()

    Post Body, using jQuery:


Adding some JSDom magic to our snippet, and pulling together the above two concept (iterating through posts, and picking off info from each post), we get this snippet:

var request = require('request'),

jsdom = require('jsdom');




  function (errors, window) {

    var $ = window.$, // Alias jQUery

    posts = [];

    $('div.itemContainer').each(function(index, item){

      item = $(item); // make queryable in JQ


        heading : item.find('h3').text().trim(),

        author : item.find('.catItemAuthor a').text(),

        teaser : item.find('p').text()






A note on building CSS Queries

As with styling web sites with CSS, building effective CSS queries is equally as important when building a scraper. It’s important to build queries that are not too specific, or likely to break when the structure of the page changes. Equally important is to pick a query that is not too general, and likely to select extra data from the page you don’t want to retrieve.

A neat trick for generating the relevant selector statement is to use Chrome’s “CSS Path” feature in the inspector. After finding the element in the inspector panel, right click, and select “Copy CSS Path”. This method is good for individual items, but for picking repeating patterns (like blog posts), this doesn’t work though. Often, the path it gives is much too specific, making for a fragile binding. Any changes to the page’s structure will break the query.

Making a Re-usable Scraping Service

Now that we’ve retrieved information from a web page, and made some JSON, let’s build a reusable API from this. We’re going to make a FeedHenry Blog Scraper service in FeedHenry3. For those of you not familiar with service creation, see this video walkthrough.

We’re going to start by creating a “new mBaaS Service”, rather than selecting one of the off-the-shelf services. To do this, we modify the application.js file of our service to include one route, /blog, which includes our code snippet from earlier:

// just boilerplate scraper setup

var mbaasApi = require('fh-mbaas-api'),

express = require('express'),

mbaasExpress = mbaasApi.mbaasExpress(),

cors = require('cors'),

request = require('request'),

jsdom = require('jsdom');

var app = express();


app.use('/sys', mbaasExpress.sys([]));

app.use('/mbaas', mbaasExpress.mbaas);


// Our /blog scraper route

app.get('/blog', function(req, res, next){




    function (errors, window) {

      var $ = window.$, // Alias jQUery

      posts = [];

      $('div.itemContainer').each(function(index, item){

        item = $(item); // make queryable in JQ


          heading : item.find('h3').text().trim(),

          author : item.find('.catItemAuthor a').text(),

          teaser : item.find('p').text()



      return res.json(posts);





var port = process.env.FH_PORT || process.env.VCAP_APP_PORT || 8001;

var server = app.listen(port, function() {});

We’re also going to write some documentation for our service, so we (and other developers) can interact with it using the FeedHenry discovery console. We’re going to modify the README.md file to document what we’ve just done using API Blueprint documentation format:

# FeedHenry Blog Web Scraper

This is a feedhenry blog scraper service. It uses the `JSDom` and `request` modules to retrieve the contents of the FeedHenry developer blog, and parse the content using jQuery.

# Group Scraper API Group

# blog [/blog]

Blog Endpoint

## blog [GET]

Get blog posts endpoint, returns JSON data.

+ Response 200 (application/json)

    + Body

            [{ blog post}, { blog post}, { blog post}]

We can now try out the scraper service in the studio, and see the response:

Scraping – The Ultimate in API Creation?

Now that I’ve described some modern techniques for effectively scraping data from web sites, it’s time for some major caveats. First,  WordPress blogs like ours already have feeds and APIs available to developers - there’s no need to ever scrape any of this content. Web Scraping is not a replacement for an API. It should be used only as a last resort, after every endeavour to discover an API has already been made. Using a web scraper in a commercial setting requires much time set aside to maintain the queries, and an agreement with the source data is being scraped on to alert developers in the event the page changes structure.

With all this in mind, it can be a useful tool to iterate quickly on an integration when waiting for an API, or as a fun hack project.

Source: http://www.feedhenry.com/web-scraping-fun-profit/

Tuesday, 18 November 2014

Get started with screenscraping using Google Chrome’s Scraper extension

How do you get information from a website to a Excel spreadsheet? The answer is screenscraping. There are a number of softwares and plattforms (such as OutWit Hub, Google Docs and Scraper Wiki) that helps you do this, but none of them are – in my opinion – as easy to use as the Google Chrome extension Scraper, which has become one of my absolutely favourite data tools.

What is a screenscraper?

I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreads, thousands or even more pages it can be an incredibly powerful tool.

In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only.

Google Chrome’s Scraper

Scraper is an Google Chrome extension that can be installed for free at Chrome Web Store.


Now if you installed the extension correctly you should be able to see the option “Scrape similar” if you right-click any element on a webpage.

The Task: Scraping the contact details of all Swedish MPs


This is the site we’ll be working with, a list of all Swedish MPs, including their contact details. Start by right-clicking the name of any person and chose Scrape similar. This should open the following window.

Understanding XPaths

At w3schools you’ll find a broader introduction to XPaths.

Before we move on to the actual scrape, let me briefly introduce XPaths. XPath is a language for finding information in an XML structure, for example an HTML file. It is a way to select tags (or rather “nodes”) of interest. In this case we use XPaths to define what parts of the webpage that we want to collect.

A typical XPath might look something like this:


Which in plain English translates to:

    // - Search the whole document...

    div[@id="content"] - ...for the div tag with the id "content".

    table[1] -  Select the first table.

    tr - And in that table, grab all rows.

Over to Scraper then. I’m given the following suggested XPath:


The results look pretty good, but it seems we only get names starting with an A. And we would also like to collect to phone numbers and party names. So let’s go back to the webpage and look at the HTML structure.

Right-click one of the MPs and chose Inspect element. We can see that each alphabetical list is contained in a section tag with the class “grid_6 alpha omega searchresult container clist”.

 And if we open the section tag we find the list of MPs in div tags.

We will do this scrape in two steps. Step one is to select the tags containing all information about the MPs with one XPath. Step two is to pick the specific pieces of data that we are interested in (name, e-mail, phone number, party) and place them in columns.

Writing our XPaths

In step one we want to try to get as deep into the HTML structure as possible without losing any of the elements we are interested in. Hover the tags in the Elements window to see what tags correspond to what elements on the page.

In our case this is the last tag that contains all the data we are looking for:

    //section[@class="grid_6 alpha omega searchresult container clist"]/div/div/div/dl

Click Scrape to test run the XPath. It should give you a list that looks something like this.

Scroll down the list to make sure it has 349 rows. That is the number of MPs in the Swedish parliament. The second step is to split this data into columns. Go back to the webpage and inspect the HTML code.

I have highlighted the parts that we want to extract. Grab them with the following XPaths:

    name: dt/a
    party: dd[1]
    region: dd[2]/span[1]
    seat: dd[2]/span[2]
    phone: dd[3]
    e-mail: dd[4]/span/a

Insert these paths in the Columns field and click Scrape to run the scraper.

Click Export to Google Docs to get the data into a spreadsheet.

Source: http://dataist.wordpress.com/2012/10/12/get-started-with-screenscraping-using-google-chromes-scraper-extension/

Monday, 17 November 2014

Screen-scraping with WWW::Mechanize

Screen-scraping is the process of emulating an interaction with a Web site - not just downloading pages, but filling out forms, navigating around the site, and dealing with the HTML received as a result. As well as for traditional lookups of information - like the example we'll be exploring in this article - we can use screen-scraping to enhance a Web service into doing something the designers hadn't given us the power to do in the first place. Here's an example:

I do my banking online, but get quickly bored with having to go to my bank's site, log in, navigate around to my accounts and check the balance on each of them. One quick Perl module (Finance::Bank::HSBC) later, and now I can loop through each of my accounts and print their balances, all from a shell prompt. Some more code, and I can do something the bank's site doesn't ordinarily let me - I can treat my accounts as a whole instead of individual accounts, and find out how much money I have, could possibly spend, and owe, all in total.

Another step forward would be to schedule a crontab every day to use the HSBC option to download a copy of my transactions in Quicken's QIF format, and use Simon Cozens' Finance::QIF module to interpret the file and run those transactions against a budget, letting me know whether I'm spending too much lately. This takes a simple Web-based system from being merely useful to being automated and bespoke; if you can think of how to write the code, then you can do it. (It's probably wise for me to add the caveat, though, that you should be extremely careful working with banking information programatically, and even more careful if you're storing your login details in a Perl script somewhere.)

Back to screen-scrapers, and introducing WWW::Mechanize, written by Andy Lester and based on Skud's WWW::Automate. Mechanize allows you to go to a URL and explore the site, following links by name, taking cookies, filling in forms and clicking "submit" buttons. We're also going to use HTML::TokeParser to process the HTML we're given back, which is a process I've written about previously.

The site I've chosen to demonstrate on is the BBC's Radio Times site, which allows users to create a "Diary" for their favorite TV programs, and will tell you whenever any of the programs is showing on any channel. Being a London Perl M[ou]nger, I have an obsession with Buffy the Vampire Slayer. If I tell this to the BBC's site, then it'll tell me when the next episode is, and what the episode name is - so I can check whether it's one I've seen before. I'd have to remember to log into their site every few days to check whether there was a new episode coming along, though. Perl to the rescue! Our script will check to see when the next episode is and let us know, along with the name of the episode being shown.

Here's the code:

  #!/usr/bin/perl -w
  use strict;
  use WWW::Mechanize;
  use HTML::TokeParser;

If you're going to run the script yourself, then you should register with the Radio Times site and create a diary, before giving the e-mail address you used to do so below.

  my $email = ";
  die "Must provide an e-mail address" unless $email ne ";

We create a WWW::Mechanize object, and tell it the address of the site we'll be working from. The Radio Times' front page has an image link with an ALT text of "My Diary", so we can use that to get to the right section of the site:

  my $agent = WWW::Mechanize->new();
  $agent->follow("My Diary");

The returned page contains two forms - one to allow you to choose from a list box of program types, and then a login form for the diary function. We tell WWW::Mechanize to use the second form for input. (Something to remember here is that WWW::Mechanize's list of forms, unlike an array in Perl, is indexed starting at 1 rather than 0. Our index is, therefore,'2.')


Now we can fill in our e-mail address for the '<INPUT name="email" type="text">' field, and click the submit button. Nothing too complicated.

  $agent->field("email", $email);

WWW::Mechanize moves us to our diary page. This is the page we need to process to find the date details from. Upon looking at the HTML source for this page, we can see that the HTML we need to work through is something like:

  <tr><td></td><td></td><td class="bluetext">Date of episode</td></tr>
  <td class="bluetext"><b>Time of episode</b></td></tr>
  <a href="page_with_episode_info"></a>

This can be modeled with HTML::TokeParser as below. The important methods to note are get_tag - which will move the stream on to the next opening for the tag given - and get_trimmed_text, which returns the text between the current and given tags. For example, for the HTML code "<b>Bold text here</b>", my $tag = get_trimmed_text("/b") would return "Bold text here" to $tag.

Also note that we're initializing HTML::TokeParser on '\$agent->{content}' - this is an internal variable for WWW::Mechanize, exposing the HTML content of the current page.

  my $stream = HTML::TokeParser->new(\$agent->{content});
  my $date;
    # <input>
  # <tr><td></td></tr><tr>
  $stream->get_tag("tr"); $stream->get_tag("tr");
  # <td></td><td></td>
  $stream->get_tag("td"); $stream->get_tag("td");
  # <td class="bluetext">Date of episode</td></tr>
  my $tag = $stream->get_tag("td");
  if ($tag->[1]{class} and $tag->[1]{class} eq "bluetext") {
      $date = $stream->get_trimmed_text("/td");
      # The date contains '&nbsp;', which we'll translate to a space.
      $date =~ s/\xa0/ /g;
   # <td></td><td></td>
  # <td class="bluetext"><b>Time of episode</b> 
  $tag = $stream->get_tag("td");
  if ($tag->[1]{class} eq "bluetext") {
      # This concatenates the time of the showing to the date.
      $date .= ", from " . $stream->get_trimmed_text("/b");
  # </td></tr><a href="page_with_episode_info"></a>
  $tag = $stream->get_tag("a");
  # Match the URL to find the page giving episode information.
  $tag->[1]{href} =~ m!src=(http://.*?)'!;

We have a scalar, $date, containing a string that looks something like "Thursday 23 January, from 6:45pm to 7:30pm.", and we have an URL, in $1, that will tell us more about that episode. We tell WWW::Mechanize to go to the URL:


The navigation we want to perform on this page is far less complex than on the last page, so we can avoid using a TokeParser for it - a regular expression should suffice. The HTML we want to parse looks something like this:

  <br><b>Episode</b><br>  The Episode Title<br>

We use a regex delimited with '!' in order to avoid having to escape the slashes present in the HTML, and store any number of alphanumeric characters after some whitespace, all between <br> tags after the Episode header:

  $agent->{content} =~ m!<br><b>Episode</b><br>\s+?(\w+?)<br>!;

$1 now contains our episode, and all that's left to do is print out what we've found:

  my $episode = $1;
  print "The next Buffy episode ($episode) is on $date.\n";

And we're all set. We can run our script from the shell:

  $ perl radiotimes.pl

  The next Buffy episode (Gone) is Thursday Jan. 23, from 6:45 to 7:30 p.m.
I hope this gives a light-hearted introduction to the usefulness of the modules involved. As a note for your own experiments, WWW::Mechanize supports cookies - in that the requestor is a normal LWP::UserAgent object - but they aren't enabled by default. If you need to support cookies, then your script should call "use HTTP::Cookies; $agent->cookie_jar(HTTP::Cookies->new);" on your agent object in order to enable session-volatile cookies for your own code.
Happy screen-scraping, and may you never miss a Buffy episode again.

Source: http://www.perl.com/pub/2003/01/22/mechanize.html

Friday, 14 November 2014

Big Data Democratization via Web Scraping

Big Data Democratization via Web Scraping

If  we had to put democratization of data inline with the classroom definition of democracy, it would read- Data by the people, for the people, of the people. Makes a lot of sense, doesn’t it? It resonates with the generic feeling we have these days with respect to easy access to data for our daily tasks. Thanks to the internet revolution, and now the social media.


Big Data web Crawling

By the people- most of the public data on the web is a user group’s sentiments, analyses and other information.

Of the people- Although the “of” here does not literally mean that the data is owned, all such data on the internet either relates to the user group itself or its views on things.

For the people- Most of this data is presented via channels (either social media, news, etc.) for public benefit be it travel tips, daily news feeds, product price comparisons, etc.

Essentially, data democratization has come to mean that by leveraging cloud computing, data that’s mostly user-generated on the internet has become accessible by all industries- big or small for their own internal use (commercial or not). This democratization has been put to use for unearthing hidden patterns from big blobs of datasets. Use cases have evolved with the consumer internet landscape and Big Data is now being used for various other means quite unanticipated.

With respect to the democratization, we’ve also heard enough about how data analytics is paving way beyond data analysts within companies and becoming available to even the non-tech-savvies. But did anyone mention DaaS providers who aid in the very first phase of data acquisition? Data scraping or web crawling (whatever your lingo is) has come to become an indivisible part of data democratization, especially when talking large-scale. The first step into bringing the public data to use is acquiring it which is where setting up web crawlers internally or partnering with DaaS providers comes to play. This blog guides towards making a choice. Its not always all the data that companies crunch or should crunch from the web. There’s obviously certain channels that are of more interest to the community than the rest and there lies the barrier- to identify sources of higher ROI and acquire data in a machine-readable format.

DaaS providers usually come to help with the entire data acquisition pipeline- starting from picking the right sources through crawl, extraction, dedup as well as data normalization based on specific requirements. Once the data has been acquired, its most likely published on another channel. Such network effect bolsters the democracy.

Steps in Data Acquisition Pipeline


Note- PromptCloud only delivers structured data as per the schema provided.

So while democratization may refer to easy access of computing resources in order to draw patterns from Big Data, it could also be analogous to ensuring right data in the right format at right intervals. In fact, DaaS providers have themselves used this democracy to empower it further.


Thursday, 13 November 2014

Why Businesses Need Data Scraping Service?

With the ever-increasing popularity of internet technology there is an abundance of knowledge processing information that can be used as gold if used in a structured format. We all know the importance of information. It has indeed become a valuable commodity and most sought after product for businesses. With widespread competition in businesses there is always a need to strive for better performances.

Taking this into consideration web data scraping service has become an inevitable component of businesses as it is highly useful in getting relevant information which is accurate. In the initial periods data scraping process included copying and pasting data information which was not relevant because it required intensive labor and was very costly. But now with the help of new data scraping tools like Mozenda, it is possible to extract data from websites easily. You can also take the help of data scrapers and data mining experts that scrape the data and automatically keep record of it.

How Professional Data Scraping Companies and Data Mining Experts Device a Solution?

Data Scraping Plan and Solutions


Why Data Scraping is Highly Essential for Businesses?

Data scraping is highly essential for every industry especially Hospitality, eCommerce, Research and Development, Healthcare, Financial and data scraping can be useful in marketing industry, real estate industry by scraping properties, agents, sites etc., travel and tourism industry etc. The reason for that is it is one of those industries where there is cut-throat competition and with the help of data scraping tools it is possible to extract useful information pertaining to preferences of customers, their preferred location, strategies of your competitors etc.

It is very important in today’s dynamic business world to understand the requirements of your customers and their preferences. This is because customers are the king of the market they determine the demand. Web data scraping process will help you in getting this vital information. It will help you in making crucial decisions which are highly critical for the success of business. With the help of data scraping tools you can automate the data scraping process which can result in increased productivity and accuracy.

Reasons Why Businesses Opt. For Website Data Scraping Solutions:

Website Scraping
Demand For New Data:

There is an overflowing demand for new data for businesses across the globe. This is due to increase in competition. The more information you have about your products, competitors, market etc. the better are your chances of expanding and persisting in competitive business environment. The manner in which data extraction process is followed is also very important; as mere data collection is useless. Today there is a need for a process through which you can utilize the information for the betterment of the business. This is where data scraping process and data scraping tools come into picture.

Capitalize On Hot Updates:

Today simple data collection is not enough to sustain in the business world. There is a need for getting up to date information. There are times when you will have the information pertaining to the trends in the market for your business but they would not be updated. During such times you will lose out on critical information. Hence; today in businesses it is a must to have recent information at your disposal.

The more recent update you have pertaining to the services of your business the better it is for your growth and sustenance. We are already seeing lot of innovation happening in the field of businesses hence; it is very important to be on your toes and collect relevant information with the help of data scrapers. With the help of data scrapping tools you can stay abreast with the latest developments in your business albeit; by spending extra money but it is necessary tradeoff in order to grow in your business or be left behind like a laggard.

Analyzing Future Demands:

Foreknowledge about the various major and minor issues of your industry will help you in assessing the future demand of your product / service. With the help of data scraping process; data scrapers can gather information pertaining to possibilities in business or venture you are involved in. You can also remain alert for changes, adjustments, and analysis of all aspects of your products and services.

Appraising Business:

It is very important to regularly analyze and evaluate your businesses. For that you need to evaluate whether the business goals have been met or not. It is important for businesses to know about your own performance. For example; for your businesses if the world market decides to lower the prices in order to grow their customer base you need to be prepared whether you can remain in the industry despite lowering the price. This can be done only with the help of data scraping process and data scraping tools.


Tuesday, 11 November 2014

How to scrape Amazon with WebDriver in Java

Here is a real-world example of using Selenium WebDriver for scraping.
This short program is written in Java and scrapes book title and author from the Amazon webstore.
This code scrapes only one page, but you can easily make it scraping all the pages by adding a couple of lines.

You can download the souce here.

import java.io.*;
import java.util.*;
import java.util.regex.*;

import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;

public class FetchAllBooks {

    public static void main(String[] args) throws IOException {

        WebDriver driver = new FirefoxDriver();


        List<WebElement> allAuthors =  driver.findElements(By.className("tgProductAuthor"));
        List<WebElement> allTitles =  driver.findElements(By.className("tgProductTitleText"));
        int i=0;
        String fileText = "";

        for (WebElement author : allAuthors){
            String authorName = author.getText();
            String Url = (String)((JavascriptExecutor)driver).executeScript("return

arguments[0].innerHTML;", allTitles.get(i++));
            final Pattern pattern = Pattern.compile("title=(.+?)>");
            final Matcher matcher = pattern.matcher(Url);
            String title = matcher.group(1);
            fileText = fileText+authorName+","+title+"\n";

        Writer writer = new BufferedWriter(new OutputStreamWriter(new

FileOutputStream("books.csv"), "utf-8"));


Source: http://scraping.pro/scraping-amazon-webdriver-java/

Monday, 10 November 2014

Web Scraping Enters Politics

Web scraping is becoming an essential tool in gaining an edge over everything about just anything. This is proven by international news on US political campaigns, specifically by identifying wealthy donors. As is commonly known, election campaigns should follow a rule regarding the use of a certain limited amount of money for the expenses of each candidate. Being so, much of the campaign activities must be paid by supporters and sponsors.

It is not a surprise then that even politics is lured to make use of the dynamic and ever growing data mining processes. Once again, web mining has proven to be an essential component of almost all levels of human existence, the society, and the world as a whole. It proves its extraordinary capacity to dig precious information to reach the much aspired for goals of every individual.

Mining for personal information

The CBC News online very recently disclosed that the US Republican presidential candidate Mitt Romney has used data mining in order to identify rich donors. It is reported that the act of getting personal information such as the buying history and church attendance were vital in this incident. Through this information, the party was able to identify prospective rich donors and indeed tap them. As a businessman himself, Romney knows exactly how to fish and where the fat fish are. Moreover, what is unique about the identified donors is that they have never been donating before.


Friday, 7 November 2014

Web Scraping: The Invaluable Decision Making Tool

Business decisions are mandatory in any company. They reflect and directly influence about the future of the company. It is important to realize that decisions must be made in any business situation. The generation of new ideas calls for new actions. This in turn calls for decisions. Decisions can only be made when there is adequate information or data regarding the problem and the cause of action to be taken. Web scraping offers the best opportunity in getting the required information that will enable the management make a wise and sound decision.

Therefore web scraping is an important part in generation of the practical interpretations for the business decision making process. Since businesses take many courses of actions the following areas call for adequate web scraping in order to make outstanding decisions.

1. Suppliers. Whether you are running an offline business there is need to get information regarding your suppliers. In this case there are two situations. The first situation is about your current suppliers and the second situation is about the possibility of acquiring new suppliers. By web scraping you has the opportunity to gather about your suppliers. You need to know other business they are supplying to and the kind of discounts and prices they offer to them. Another important aspect about consumers is to determine the periods when they have surplus and therefore be able to determine the purchasing prices.

Web scraping can provide new information concerning new suppliers. This will make a cutting edge in the purchasing sector. You can get new suppliers that have reasonable prices. This will go a long way in ensuring a profitable business. Therefore web scraping is an integral process that should be taken first before making a vital decision concerning suppliers.