I happen to be working on a huge
data with two separate data sets. The initial pre-processing work was to ensure
that I merge those columns with a match, both total and partial matching R. This
is unique case so I had to dig deep to find anyone who had posted the
similar thing before. To my surprise, partial merging is a common problem to
many analyst except a few who are well conversant with function writing in
their respective languages. Fortunately I came across a blog by
Tony Hirst Merging Data Set s based on partially matched data set elements. This an exemplary work
from Tony given that his work is based on Levenshtein Distance. In this
article, I will be explaining Tony’s work to easier its
understanding and even to customize it in order to achieve your desires.
#The data I want to merge are in two CSV files, Book1.csv (Methods) should be merged with #techstreetcom(Item), with partial match and duplicate
#It it intends to converte all the elements into lovwer case, removing stop words this is done by the
#First is setting the work directory
setwd(" /")
Function1=function(x){
func=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
return(func)
}
# the following function does all that is required for the matching
partialMatch=function(x,y,levDist=0.1){
#asigning the variables to data frame,
xx=data.frame(func=sapply(x, Function1),row.names=NULL)
yy=data.frame(func=sapply(y, Function1),row.names=NULL)
xx$raw=x
yy$raw=y
xx=subset(xx,subset=(func!=''))
xy=merge(xx,yy,by='func',all=T)
matched=subset(xy,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))
matched$pass="Duplicate"
todo=subset(xy,subset=(is.na(raw.y)),select=c(func,raw.x))
colnames(todo)=c('func','raw')
todo$partials= as.character(sapply(todo$func, agrep, yy$func,max.distance = levDist,value=T))
todo=merge(todo,yy,by.x='partials',by.y='func')
partial.matched=subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))),select=c("func","raw.x","raw.y"))
partial.matched$pass="Partial"
matched=rbind(matched,partial.matched)
un.matched=subset(todo,subset=(is.na(raw.x)),select=c("func","raw.x","raw.y"))
if (nrow(un.matched)>0){
un.matched$pass="Unmatched"
matched=rbind(matched,un.matched)
}
matched=subset(matched,select=c("raw.x","raw.y","pass"))
return(matched)
}
#A rogue character in @coneee's data file borked things for me, so let's do a character code conversion first
#calling the function and matching the elements partilaly and totally
If you are not able to get the everything, you can still use this function. In the line bellow, just change the column and the data to meet your objective
matches=partialMatch(techstreetcom$Item,Book1$Methods)
#Writing the matched data into working directory as csv file
write.csv(matches,"matches.csv")
No comments:
Post a Comment