Wisconsin Web Scraping

Wisconsin Data Scraping, Web Scraping Tennessee, Data Extraction Tennessee, Scraping Web Data, Website Data Scraping, Email Scraping Tennessee, Email Database, Data Scraping Services, Scraping Contact Information, Data Scrubbing

Tuesday 23 December 2014

Scrape Web data using R

Plenty of people have been scraping data from the web using R for a while now, but I just completed my first project and I wanted to share the code with you.  It was a little hard to work through some of the “issues”, but I had some great help from @DataJunkie on twitter.

As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.

One note.  When I read in my table, it contained a wierd set of characters.  I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.

Bring on fantasy football!

################################################################

## Help from the followingn sources:

## @DataJunkie on twitter

## http://www.regular-expressions.info/reference.html

## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package

## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package

## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage

################################################################

library(XML)

library(stringr)

# build the URL

url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",

        "&conference=NFL&year=season_2009",
        "&timeframe=Week1", sep="")

# read the tables and select the one that has the most rows

tables <- readHTMLTable(url)

n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

tables[[which.max(n.rows)]]

# select the table we need - read as a dataframe

my.table <- tables[[7]]

# delete extra columns and keep data rows

View(head(my.table, n=20))

my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]

# rename every column

c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",

        "R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")

names(my.table) <- c.names

# data get read in with wierd symbols - need to remove - initially stored as character factors

# for the loops, I am manually telling the code which regex to use - assumes constant behavior

# depending on where the wierd characters are -- is this an encoding?

front <- c(1)

back <- c(4:ncol(my.table))

for(f in front) {

    test.front <- as.character(my.table[, f])

    tt.front <- str_sub(test.front, start=3)

    my.table[,f] <- tt.front

}

for(b in back) {

    test <- as.character(my.table[ ,b])

    tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))

    my.table[, b] <- tt.back
}

str(my.table)

View(my.table)

# clear memory and quit R

rm(list=ls())

q()

n

Source: http://www.r-bloggers.com/scrape-web-data-using-r/

No comments:

Post a Comment