Doe on Data Analytics: 2016

Friday, 12 August 2016

Data Scraping web application with R shiny

Every time we tweet about certain topics of our own interest and appeal. Sometimes this tweets are liked, re-tweeted and even disliked depending on other people's view. However, when the tweets are about interesting topics such as politics, businesses, sports or even religion, it becomes trending meme and can be analysed to retrieve the sentiments. Sentiment is just a view of an attitude towards a situation or event.

Prior to the determination of any sentiments or sentimental analysis, one must scrape the tweets and import the data to the analytical tool of their choice. For me, I am an R programmer and therefore will show this in R shiny. I will be explaining how to set up the R shiny web app like bellow.
In order to grasp more on this article you can view the web app here https://oscardoe.shinyapps.io/TweetFisher/

Main topics of the web app

User interface (ui.r)
Loading the required packages.
This work requires specific packages and therefore they must be installed and then loaded before starting to code or copy paste the code.
library(shiny)
library(twitteR)
library(shinythemes)
library(shiny)
library(twitteR)
library(wordcloud)
library(tm)
library(stringr)
library(rsconnect)
library(dashboard)
library(shinydashboard)

Specifying the tweet search terms.
This program is using the searchTwitter ion and therefore specific search terms can be specified. The search terms are the term you want to find its tweets. This can be a hashtag, user or just any name.
Location of tweets.
Number of tweets
Date range of the hashtag
shinyUI(dashboardPage(
#creating the title of the web app
dashboardHeader(title ="Tweets Fisher"),
#developing the interface
dashboardSidebar(
# requesting for a term to be searched
textInput("term", "Enter a term", "Kenya"),
#inputing the number of tweets you want scraped
sliderInput("cant", "Select a number of tweets",min=5,max=1500, value = 50),
#selecting the language of the tweets
radioButtons("lang", "Select the language", c(
"English"="en",
"Spanish"="es")),
#selecting the location of the tweets
selectizeInput("location","Enter your search area",choices=list(
kenya=("-1.2920659,36.82196,45mi"),
kuwait=("29.3454657,47.9969453,80mi"))),
textInput("date1","Enter start date","2016-07-24"),
textInput("date2","Enter End date","2016-07-25"),
submitButton(text = "Run")),

dashboardBody(
h4("Last 15 tweets on your entered term"),
tableOutput("table")),
tabPanel("ANALYSIS",
mainPanel("Follow me http://doenyamanga.blogspot.co.ke/"))
))

reqURL<-"https://api.twitter.com/oauth/request_token"
access_token<-"856193844-4wjipqL5kiiC1xuq3MjbGcVC9XMjS0AFYvvYrl5y"
access_token_secret<-"hr0ZtyZvbiKfA4iTx4k4ujVDjW69g5v4rYHzlXEx8b6vi"
apiKey<-"PLTb0HBxqdOyDsMIzBolLlRtv"
apiSecret<-"ypCPljAfCmp3GmXcthrbAxXqx9PgEYz8p8OElJjgvEZhsJVM0n"
my_oauth<-setup_twitter_oauth(apiKey,apiSecret,access_token,access_token_secret)

Server side (server.r)

library(shiny)

shinyServer(function(input,output){

rawData<-reactive({

tweets<-searchTwitter(input$term,n=input$cant,lang=input$lang,since=input$date1,until=input$date2,geocode=input$location)

twListToDF(tweets)

})

output$table<-renderTable({

head(rawData()[1],n=15)

})

#removing twitter handlers

#rawDataa<-reactive({str_replace_all(rawData(),"@\\w+","")})

#removing emojis

tweets1<-reactive({iconv(rawDataa(),'UTF-8','ASCII')})

wordCorpus<-reactive({Corpus(VectorSource(tweets1()))})

tdm<-reactive({as.matrix(tweets1())})

output$wordcl<-renderPlot({

wordcloud(tdm(),random.order=T,colors=brewer.pal(8,"Dark2"))

})

Friday, 1 July 2016

Partial Merging of Data sets in R

I happen to be working on a huge data with two separate data sets. The initial pre-processing work was to ensure that I merge those columns with a match, both total and partial matching R. This is unique case so I had to dig deep to find anyone who had posted the similar thing before. To my surprise, partial merging is a common problem to many analyst except a few who are well conversant with function writing in their respective languages. Fortunately I came across a blog by Tony Hirst Merging Data Set s based on partially matched data set elements. This an exemplary work from Tony given that his work is based on Levenshtein Distance. In this article, I will be explaining Tony’s work to easier its understanding and even to customize it in order to achieve your desires.

#The data I want to merge are in two CSV files, Book1.csv (Methods) should be merged with #techstreetcom(Item), with partial match and duplicate

#It it intends to converte all the elements into lovwer case, removing stop words this is done by the

#First is setting the work directory

setwd(" /")

Function1=function(x){

func=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')

return(func)

}

# the following function does all that is required for the matching

partialMatch=function(x,y,levDist=0.1){

#asigning the variables to data frame,

xx=data.frame(func=sapply(x, Function1),row.names=NULL)

yy=data.frame(func=sapply(y, Function1),row.names=NULL)

xx$raw=x

yy$raw=y

xx=subset(xx,subset=(func!=''))

xy=merge(xx,yy,by='func',all=T)

matched=subset(xy,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))

matched$pass="Duplicate"

todo=subset(xy,subset=(is.na(raw.y)),select=c(func,raw.x))

colnames(todo)=c('func','raw')

todo$partials= as.character(sapply(todo$func, agrep, yy$func,max.distance = levDist,value=T))

todo=merge(todo,yy,by.x='partials',by.y='func')

partial.matched=subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))),select=c("func","raw.x","raw.y"))

partial.matched$pass="Partial"

matched=rbind(matched,partial.matched)

un.matched=subset(todo,subset=(is.na(raw.x)),select=c("func","raw.x","raw.y"))

if (nrow(un.matched)>0){

un.matched$pass="Unmatched"

matched=rbind(matched,un.matched)

}

matched=subset(matched,select=c("raw.x","raw.y","pass"))

return(matched)

}

#A rogue character in @coneee's data file borked things for me, so let's do a character code conversion first

#calling the function and matching the elements partilaly and totally

If you are not able to get the everything, you can still use this function. In the line bellow, just change the column and the data to meet your objective

matches=partialMatch(techstreetcom$Item,Book1$Methods)

#Writing the matched data into working directory as csv file

write.csv(matches,"matches.csv")

Sunday, 19 June 2016

Resolving the common error JAVA_HOME cannot be determined from the Registry

When trying to load R packages depending on [rjava],you might face errors.

library(XLConnect)

library(rjava)

library(XLConnectjars)

library( xlsx)

Error : .onLoad failed in loadNamespace() for ‘rJava’, details:
call: fun(libname, pkgname)
error: JAVA_HOME cannot be determined from the Registry

The said packages are rjava package itself, XLConnect Package, XLConnectjars Package, xlsx Package.

This error arise due to lack of java in your computer or when its location is corrupted.

The error actually says that there is no entry in the Registry.

To resolve this error, download the java here. Kindly ensure the downloaded java matches your computer specs.

Saturday, 18 June 2016

Market basket Analysis with R

Perhaps you have heard about R and its unlimited capability but not the experience of R and market basket analysis. First and foremost I would like to elaborate on market basket analysis. MBA as it is frequently abbreviated, explains the combinations of products that frequently co-occur in transactions. For instance, people who buy bread and eggs, also tend to buy butter as many of them are planning to make an omelet.

Marketing team should target customers who buy bread and eggs with offers on butter, to encourage them to spend more on their shopping basket.

Market basket analysis or association rules as it is well referred, is a well-documented area of data mining. In R platform, the analysis has been done under arules package, and the visualization is done by arulesViz package. Majority of blogs I have seen uses the Grocery data from the arules package. However, in this tutorial I intend explain how to use a different data set (csv file) to fulfill you analytics desire.

Basic definition in association rules

Items (Products)

An item is the single product in the basket, each line is called a transaction and each column in a row represents an item.

The most logical representation of item is I = {i1, i2,…,i_n }

Support

The support of a product or set of products is the fraction of transactions in our data set that contain that product or set of products. High support should be taken into consideration as opposed to lower ones.

Confidence

Confidence is conditional probability that customer buying an item A will also buy item B.

Lift

It is the percentage increase of buying product B when one buys product A. We do look at rules with lift of more than one.

Lift(i_m=>i_n)=Support(i_mui_n)/(Support(i_m)*Support(i_m))

Basket Analysis with R

As opposed to my previous post customer data analysis in Kenya in which i illustrated the importance of customer data analysis to supermarkets/retailers. In this post I intend to show how basket analysis can be done in R. For this analysis you can download the dataset here. The data set is in CSV format and must be imported to R in a transaction class. Use the code bellow for importation. First to remember to download latest R for this analysis to avoid syntax errors.

rm(list = ls()) #clear the memory

install.packages("arules") #installing required arules packages,

install.packages("arulesViz") #installing required arulesViz packages

library(arules) # Load the libraries

library(arulesViz) # Load the libraries

#importing csv file into R as a transaction class.

#rm.duplicate is meant remove any duplicate transactions

Data<-read.transaction(“D:directory/data.csv”, format = "basket",sep = ",",rm.duplicates=TRUE)

Upon importing the data set into R, try to check its form just for comparison purposes. The data is in csv format, the rows represents transactions and each record represent the item purchased.

size(basket) # size function will display the imported transactions

It should give the following output. This means that every product has been assigned a product ID

LIST(basket[1:3]) #To confirm the transaction, use the list function to display first three

summary(basket) #summary function helps in understanding more of the data

# calculates support for frequent items

frequentItems<-eclat(basket,parameter=list(supp =0.07,maxlen = 15))

itemFrequencyPlot(basket,topN=20,type="absolute",col="green")# plot frequent items

Lets apply the apriori function from arules package. a low support and a high confidence will help to extract the relationship even for less overall co-occurrences in the data.

rules<-apriori(basket,parameter=list(supp=0.01,conf=0.5,maxlen=1000))

# show the support, lift and confidence for all rules

options (digits=5) # Show the top 5 rules, but only 5 digits

inspect (rules[1:8]) # Show the top 5 rules, but only 5 digits

To sort the rules with respect to confidence. The code sorts the rules from the highest confidence to the lowest. Decreasing is set to TRUE

rules1<-sort (rules, by="confidence",decreasing=TRUE) # 'high-confidence' rules.

To Remove Redundant Rules

The code bellow remove the redundancy in the rules. It filters the non-redundant.

redundant<-which(colSums(is.subset(rules1,rules1))>1) # get redundant rules in vector

rulesnow<-rules1[-redundant] # remove redundant rules

To find out what customers had purchased before buying ‘KETCHUP'

Ketchupprior<-apriori(data=basket, parameter=list(supp=0.01,conf = 0.08),

appearance = list(default="lhs",rhs="Ketchup"),

control=list(verbose=F)) # get rules that lead to buying 'whole milk'

Ketchuppriorrules<-sort(Ketchupprior, decreasing=TRUE,by="confidence")

inspect(Ketchuppriorrules[1:5])

# Interactive Plot

plot(Ketchupprior[1:25],method="graph",interactive=TRUE,shading="confidence",main="Products before Ketchup") # feel free to expand and move around the objects in this plot

plot(Ketchupprior, measure=c("support", "lift"), shading="confidence",main="Products before Ketchup")

Thursday, 26 May 2016

Customer data analysis in kenya

With the intense competition among the retailers in Kenya, individual retailers are looking for the best way to win customers over to their shelves. Some spend a lot of time in the social media while others hire the services of various advertisements companies just to help them increase customer loyalty. Over the time, most of these retailers if not all had introduced customer loyalty cards and given shopping vouchers in terms of points whenever one makes a purchase. Consequently most customers have registered for the loyalty cards with different supermarkets and therefore only shops at those specific stores. Payments system has also revolutionized the shopping experience. Currently shoppers in Kenya have various payments methods including: Lipa-Mpesa, Airtel Card, Visa Cards Master Card and the old fashioned cash system. These systems contain all information we need to know about our customers, it is Big data.

Basket Analysis

Over these years since I graduated, I have studied Supermarket/retailer data and willing to introduce it to the Kenyan supermarkets. I call it Basket Analysis. This service is meant to help your supermarket know its customers purchase patterns, thereby maintaining customer loyalty hence increasing sales. With the current payment system and your loyalty card (Bonus card) your database has a lot of data that when analyzed properly can make you the market leader.

Basket analysis can reveal the following about your customers.

Who are your customers and where are they located?
What age group are they?
What is their preference?
How much do they spend per shopping and
How frequent do they shop?
What will he/she buy next and how much he/she spend?
What is the best way to arrange the shelves?
What products can be cross promoted together?

These are but a few questions that the Basket Analysis can answer.

Application of basket analysis will greatly increase sales hence propelling your growth. Imagine an instance your supermarket can predict when a customer will shop, how much she will spend and the most possible combination she will buy.

Pages