Stepwise Web Scraping

Web Scraping of Tom Brady Match Information

Background Information about Tom Brady

Tom Brady, in full Thomas Edward Patrick Brady, Jr., (born August 3, 1977, San Mateo, California, U.S.), American gridiron football quarterback who led his teams to a record seven Super Bowl victories (2002, 2004, 2005, 2015, 2017, 2019, and 2021) and was named the game’s Most Valuable Player (MVP) five times (2002, 2004, 2015, 2017, and 2021).

Brady, who did not start until his junior year, led Michigan to victory in the 1999 Orange Bowl and gained a reputation as a determined and intelligent player but one who lacked any exceptional physical skills. In 2000 he was chosen in the sixth round of the NFL draft by the New England Patriots, and he worked diligently during his first season to bulk up physically and improve his strength and technique.

For more information you about Tom Brady Click here

Web Scraping of both Regular and Playoff Tables

In this exercise we are interested in obtaining the information contained in the two tables the Regular table as well as the Play offs Table. To achieve this in R , the rvest and tidyverse packages would be handy for both web-scraping and data wrangling, while flextable packages is to aid in creation of table.

Loading the required packages.

library(rvest)     # Web scrape
library(tidyverse) # Data wrangling
library(DT) # For Tables

Initializing Webscraping process

To initialize the Web scraping process, there is need to obtain the URL of the website of interest in this case the pro-football reference website.

Therefore, the url for the pro-football reference are loaded into the R software and creating the web page that would enable the extraction of the elements of the page

url="https://www.pro-football-reference.com/players/B/BradTo00/gamelog/"
page<-url %>% 
  read_html()   #creating the page

Getting the Stats regular season table

Getting the statistics table “stats” of the regular season table then follows. Having created the web-page in the previous junk of code, I am going ahead to obtain the Stats table from the pro football page.

Hence Viewing the first 6 rows of the scraped dataset

stats=page %>% 
  html_element(css="#stats") %>% 
  html_table()   # Scraping the regular stats table 


head(stats) %>% datatable()   # Viewing the heading and the first two lines

The Stats regular table have some missing column names in the first lines since the Stats Regular table have both headers and sub headers.

Getting the Stats Playoffs season table

In a similar way we replicate the same code to to get the stats play off table and viewing the outcome using the glimpse function

stats_playoffs=page %>% 
  html_element(css="#stats_playoffs") %>% 
  html_table()


head(stats_playoffs) %>% datatable()

Similarly, the Stats Playoffs table also have some missing column names in the first lines since the Stats Playoffs table have both headers and sub headers.

Checking the number of columns of the two tables

ncol(stats)

## [1] 65

ncol(stats_playoffs)

## [1] 52

So far there are two issues making our data set untidy. Even though the two data sets seems to be having similar structures, they bear different columns with different column names.

Data Manipulation

1. Data Cleaning

Main Objective

At the end of the data cleaning we should be able to combine the two data sets

To realize the above objective, the following activities shall be done:

Combining the Column names first row to form the column name of the the two data sets by first cleaning the Stats Regular Data set.
Removing all the empty columns from the Stats Regular Table.
Removing all the column names in the stats table beginning “_“.
Removing unnecessary Rows and Columns: removing rows beginning with alphabets in Date Column, removing column that are empty
Checking on the duplicated column names and assigning then “2” - the second instance to make them unique.
Replacing both the “*” in the GS column with the started the match
Replacing the instances of empty in the GS column with ‘Played but not started’
Assign the columns after GS to with the string values present in GS to Zeros
Using the Cleaning procedure applied in Stats Regular Table as a function of cleaning the Stats Playoffs Table

The Data Clining Function

Since our data sets from the regular and play off tables have similar structure. A code is written to clean the first data set the regular or stats table then used as a formula to to speed up the cleaning process of the stats play off table.

This is a beneficial as it would aid the cleaning not only the two tables but all untidy data sets/tables with similar structure.

In this code, a function is developed based on the stats table.

# 1. Combining the column names and first row to form the data set column names

data_cleaning<-function(stats){
  colnames(stats)=paste(colnames(stats), stats[1,], sep="_" )
  
# 2. Removing all the column names in the stats table beginning "_"
  
  colnames(stats)=str_remove_all(colnames(stats), "^[_]")
  
# 3. Removing unnecessary Rows and Columns
# 3.1 Removing Rows beginning with alphabets in Date Column 
  
  stats=stats[!grepl("^.*[A-Z].*", stats$Date),]
  
# 3.2 Removing the column names with empty strings 
  
  stats=stats[,!grepl("^$", colnames(stats))]    

# 4. Checking on the duplicated column names  
  dub_index<-grep(TRUE, duplicated(colnames(stats))) 
  
# 5. Assigning the Duplicates the second instance  
  colnames(stats)[dub_index]=paste(colnames(stats)[dub_index], "2", sep="_") 
  
# 6. Assessing the Column Names of the Stats Regular Table  
  colnames(stats)

# 7. Replacing the instances of "*" with the started the match   
  stats$GS=stats$GS %>% 
    str_replace_all("[*]","Started") %>% 
    
# 8. Replacing the instances of empty with played but not started  
    str_replace_all("^$", "Played but not started") 
  
# 9. Viewing the first six rows of the Data set  
  head(stats)
  
# 10. Determining the index number 
  
  GS_index<-grep("^GS$", colnames(stats))
  
# 11. Creating a loop to assign the columns after GS to with the string values present in GS to Zeros
  
  GS_index<-GS_index+1
  for(i in GS_index:ncol(stats)){ 
    stats[[i]][stats[[i]] %in% unique(stats$GS)]<- NA   
  }
  
# 12. providing the outcome of the data cleaning process
  
  return(stats)
}

Stats Regular Table

Using the built data cleaning function, both the tables can now be cleaned.

Calling the data cleaning function to clean the Stats Regular data set
Making the Table dynamic by calling the DT package functionality

#Cleaning the stats regular data table

Stats_regular<- data_cleaning(stats)

datatable(Stats_regular)

Stats Playoffs Table

Calling the data cleaning function to clean the Stats Playoffs Regular data set
Making the Table dynamic by calling the DT package functionality

#cleaning the stats playoffs table
Stats_playoffs<- data_cleaning(stats_playoffs) 

datatable(Stats_playoffs)

Determining the unique column names that are present in both datasets

# Combining the unique column names

all_column_names<-c(colnames(Stats_regular), colnames(Stats_playoffs)) %>% 
  unique()

all_column_names

##  [1] "Rk"              "Year"            "Date"            "G#"             
##  [5] "Week"            "Age"             "Tm"              "Opp"            
##  [9] "Result"          "GS"              "Passing_Cmp"     "Passing_Att"    
## [13] "Passing_Cmp%"    "Passing_Yds"     "Passing_TD"      "Passing_Int"    
## [17] "Passing_Rate"    "Passing_Sk"      "Passing_Yds_2"   "Passing_Y/A"    
## [21] "Passing_AY/A"    "Rushing_Att"     "Rushing_Yds"     "Rushing_Y/A"    
## [25] "Rushing_TD"      "Receiving_Tgt"   "Receiving_Rec"   "Receiving_Yds"  
## [29] "Receiving_Y/R"   "Receiving_TD"    "Receiving_Ctch%" "Receiving_Y/Tgt"
## [33] "Scoring_TD"      "Scoring_Pts"     "Sk"              "Tackles_Solo"   
## [37] "Tackles_Ast"     "Tackles_Comb"    "Tackles_TFL"     "Tackles_QBHits" 
## [41] "Fumbles_Fmb"     "Fumbles_FL"      "Fumbles_FF"      "Fumbles_FR"     
## [45] "Fumbles_Yds"     "Fumbles_TD"      "Punting_Pnt"     "Punting_Yds"    
## [49] "Punting_Y/P"     "Punting_RetYds"  "Punting_Net"     "Punting_NY/P"   
## [53] "Punting_TB"      "Punting_TB%"     "Punting_In20"    "Punting_In20%"  
## [57] "Punting_Blck"    "Off. Snaps_Num"  "Off. Snaps_Pct"  "Def. Snaps_Num" 
## [61] "Def. Snaps_Pct"  "ST Snaps_Num"    "ST Snaps_Pct"    "Status"

Determining the missing column names that are present in Stats Regular table and absent in the Stats Playoffs table

`%notin%`=Negate(`%in%`)

not_in_headers<-all_column_names %notin% colnames(Stats_playoffs)
missing_columns<-all_column_names[not_in_headers]
missing_columns

##  [1] "Receiving_Tgt"   "Receiving_Rec"   "Receiving_Yds"   "Receiving_Y/R"  
##  [5] "Receiving_TD"    "Receiving_Ctch%" "Receiving_Y/Tgt" "Sk"             
##  [9] "Tackles_Solo"    "Tackles_Ast"     "Tackles_Comb"    "Tackles_TFL"    
## [13] "Tackles_QBHits"

Stats_playoffs[,missing_columns]<-0

Ordering the the column names so as to combine the two datasets

Stats_regular=Stats_regular[,all_column_names]
Stats_playoffs=Stats_playoffs[,all_column_names]

Combining the two datasets to form one table for both Regular and Playoff Statistics

Stats_Combined<-rbind(Stats_regular, Stats_playoffs) %>% 
  arrange(Year)
DT::datatable(Stats_Combined)