Main Objective
At the end of the data cleaning we should be able to combine the
two data sets
To realize the above objective, the following activities shall be
done:
Combining the Column names first row to form the column name of
the the two data sets by first cleaning the Stats Regular Data
set.
Removing all the empty columns from the Stats Regular
Table.
Removing all the column names in the stats table beginning
“_“.
Removing unnecessary Rows and Columns: removing rows
beginning with alphabets in Date Column, removing column that
are empty
Checking on the duplicated column names and assigning then “2” -
the second instance to make them unique.
Replacing both the “*” in the GS column with the started the
match
Replacing the instances of empty in the GS column with ‘Played
but not started’
Assign the columns after GS to with the string values present in
GS to Zeros
Using the Cleaning procedure applied in Stats Regular Table as a
function of cleaning the Stats Playoffs Table
The Data Clining Function
Since our data sets from the regular and play off tables have similar
structure. A code is written to clean the first data set the regular or
stats table then used as a formula to to speed up the cleaning process
of the stats play off table.
This is a beneficial as it would aid the cleaning not only the two
tables but all untidy data sets/tables with similar structure.
In this code, a function is developed based on the stats table.
# 1. Combining the column names and first row to form the data set column names
data_cleaning<-function(stats){
colnames(stats)=paste(colnames(stats), stats[1,], sep="_" )
# 2. Removing all the column names in the stats table beginning "_"
colnames(stats)=str_remove_all(colnames(stats), "^[_]")
# 3. Removing unnecessary Rows and Columns
# 3.1 Removing Rows beginning with alphabets in Date Column
stats=stats[!grepl("^.*[A-Z].*", stats$Date),]
# 3.2 Removing the column names with empty strings
stats=stats[,!grepl("^$", colnames(stats))]
# 4. Checking on the duplicated column names
dub_index<-grep(TRUE, duplicated(colnames(stats)))
# 5. Assigning the Duplicates the second instance
colnames(stats)[dub_index]=paste(colnames(stats)[dub_index], "2", sep="_")
# 6. Assessing the Column Names of the Stats Regular Table
colnames(stats)
# 7. Replacing the instances of "*" with the started the match
stats$GS=stats$GS %>%
str_replace_all("[*]","Started") %>%
# 8. Replacing the instances of empty with played but not started
str_replace_all("^$", "Played but not started")
# 9. Viewing the first six rows of the Data set
head(stats)
# 10. Determining the index number
GS_index<-grep("^GS$", colnames(stats))
# 11. Creating a loop to assign the columns after GS to with the string values present in GS to Zeros
GS_index<-GS_index+1
for(i in GS_index:ncol(stats)){
stats[[i]][stats[[i]] %in% unique(stats$GS)]<- NA
}
# 12. providing the outcome of the data cleaning process
return(stats)
}