Saturday, 18 June 2016

Market basket Analysis with R

Perhaps you have heard about R and its unlimited capability but not the experience of R and market basket analysis. First and foremost I would like to elaborate on market basket analysis. MBA as it is frequently abbreviated, explains the combinations of products that frequently co-occur in transactions. For instance, people who buy bread and eggs, also tend to buy butter as many of them are planning to make an omelet.
Marketing team should target customers who buy bread and eggs with offers on butter, to encourage them to spend more on their shopping basket.
Market basket analysis or association rules as it is well referred, is a well-documented area of data mining. In R platform, the analysis has been done under arules package, and the visualization is done by arulesViz package. Majority of blogs I have seen uses the Grocery data from the arules package. However, in this tutorial I intend explain how to use a different data set (csv file) to fulfill you analytics desire.

Basic definition in association rules

Items (Products)

An item is the single product in the basket, each line is called a transaction and each column in a row represents an item.
The most logical representation of item is         I = {i1, i2,…,in }

Support

The support of a product or set of products is the fraction of transactions in our data set that contain that product or set of products. High support should be taken into consideration as opposed to lower ones.

Confidence

Confidence is conditional probability that customer buying an item A will also buy item B.

Lift

It is the percentage increase of buying product B when one buys product A. We do look at rules with lift of more than one.
              Lift(im=>in)=Support(imuin)/(Support(im)*Support(im))

Basket Analysis with R

As opposed to my previous post customer data analysis in Kenya in which i illustrated the importance of customer data analysis to supermarkets/retailers. In this post I intend to show how basket analysis can be done in R. For this analysis you can download the dataset here.  The data set is in CSV format and must be imported to R in a transaction class. Use the code bellow for importation. First to remember to download latest R for this analysis to avoid syntax errors.

rm(list = ls()) #clear the memory
install.packages("arules") #installing required arules packages, 
install.packages("arulesViz") #installing required arulesViz packages
library(arules) # Load the libraries
library(arulesViz) # Load the libraries
#importing csv file into R as a transaction class.
#rm.duplicate is meant remove any duplicate transactions
Data<-read.transaction(“D:directory/data.csv”, format = "basket",sep = ",",rm.duplicates=TRUE)

Upon importing the data set into R, try to check its form just for comparison purposes. The data is in csv format, the rows represents transactions and each record represent the item purchased.

size(basket) # size function will display the imported transactions
It should give the following output. This means that every product has been assigned a product ID



LIST(basket[1:3]) #To confirm the transaction, use the list function to display first three 





summary(basket) #summary function helps in understanding more of the data


# calculates support for frequent items
frequentItems<-eclat(basket,parameter=list(supp =0.07,maxlen = 15))
itemFrequencyPlot(basket,topN=20,type="absolute",col="green")# plot frequent items


Lets apply the apriori function from arules package. a low support and a high confidence will help to extract the relationship even for less overall co-occurrences in the data.


rules<-apriori(basket,parameter=list(supp=0.01,conf=0.5,maxlen=1000))
# show the support, lift and confidence for all rules

options (digits=5) # Show the top 5 rules, but only 5 digits
inspect (rules[1:8]) # Show the top 5 rules, but only 5 digits


To sort the rules with respect to confidence. The code sorts the rules from the highest confidence to the lowest. Decreasing is set to TRUE

rules1<-sort (rules, by="confidence",decreasing=TRUE) # 'high-confidence' rules.

To Remove Redundant Rules

The code bellow remove the redundancy in the rules. It filters the non-redundant.

redundant<-which(colSums(is.subset(rules1,rules1))>1) # get redundant rules in vector
rulesnow<-rules1[-redundant] # remove redundant rules

To find out what customers had purchased before buying ‘KETCHUP'

Ketchupprior<-apriori(data=basket, parameter=list(supp=0.01,conf = 0.08),
               appearance = list(default="lhs",rhs="Ketchup"),
               control=list(verbose=F)) # get rules that lead to buying 'whole milk'
Ketchuppriorrules<-sort(Ketchupprior, decreasing=TRUE,by="confidence")
inspect(Ketchuppriorrules[1:5])


# Interactive Plot
plot(Ketchupprior[1:25],method="graph",interactive=TRUE,shading="confidence",main="Products before Ketchup") # feel free to expand and move around the objects in this plot


plot(Ketchupprior, measure=c("support", "lift"), shading="confidence",main="Products before Ketchup")



1 comment:

  1. Excelent article. Could you please send me the data to jsalinas@lamolina.edu.pe ? Thanks

    ReplyDelete