Monday, 28 September 2015

Twitter Text mining/scraping with R

As a data analyst one should be able to analyse the current trends in social media. This is useful in decision making and insight on people's views. Currently there is teachers strike in Kenya which has lasted for four weeks. Schools are  closed and  the strike is still as per the time of updating this blog. I won't go to the logistic about the strike but I can use my skills to determine people's views, correlation between the strike and other things. Take for example the president's name, the opposition leaders, the teachers and the children, all of them are affected from the  analysis. Check out the ggplot product below to gain more insight.

The data sets are just tweets about  the teachers strike. The bar graph shows more insight in that among the tweets users mention different words during tweets and the frequency of these words is what I have used to plot this graph. Also check out the word cloud below.
From the word cloud it is clear that strike was the word of the tweets. Followed by other words like teachers, the impeachment thereat from the opposition, UKenyatta the current president who is facing impeachment threats, Raila is the opposition leader, Ruto is the deputy president, Mutahi ngunyi is a well known political analyst who is a staunch supporter of President Kenyatta. There are so many words from the tweets but the word cloud only had specific frequency I had set.
For more of this kind of analysis follow me on twitter @oscar_doe.

Twitter text mining is a very well documented area of data mining. I will therefore explain little of it here just to make this article comprehensive. In twitter text mining I will be using the package twitter and tm in R. The main function will be searchTwitteR{} because of its ability to data mine with specific details such as location, dates, language and many more.

#Load packages
required(tm)
required (twitteR)
Libarary(wordcloud)
#Authorize twitter API
reqURL<-"https://api.twitter.com/oauth/request_token"
access_token<-" refer from the twitter app"
access_token_secret<-" refer from the twitter app"
apiKey<-" refer from the twitter app"
apiSecret<-" refer from the twitter app"
setup_twitter_oauth(apiKey,apiSecret,access_token,access_token_secret)
#Scrape twitter using searchTwitter function
#This function picks tweets with Brexit
#The language is null to accommodate other languages
#Date has been set from 23-june-2016
#location is actually worldwide
tweets<-searchTwitter("Brexit",n=1000,lang = NULL,since = "2016-06-23",until=NULL,locale=NULL,geocode=NULL,sinceID=NULL,maxID=NULL)
tweets_text<-sapply(tweets,function(x)x$getText())
Now we have the tweets but there other things that is contained in the tweets that we do not need. We do not need emojis, stop words like, the, why and many more. So we now proceed to removing this unwanted sets in our data.
#removing emojis in r
tweets1<-iconv(tweets_text,'UTF-8','ASCII')
Bulid a corpus and save the tweets as data frame
tweets_corpus=Corpus(VectorSource(tweets1))
tdm=TermDocumentMatrix(
  tweets_corpus,
  control=list(
    removePunctuation=TRUE,stopwords=c("England","Wales","Scotland",stopwords("english")),
    removeNumbers=TRUE,tolower=TRUE)
)
m=as.matrix(tdm)
# get word counts in decreasing order
word_freqs=sort(rowSums(m),decreasing=TRUE)
# create a data frame with words and their frequencies
dm=data.frame(word = names(word_freqs),freq=word_freqs)
wordcloud(dm$word,dm$freq,random.order=FALSE,colors=brewer.pal(8,"Dark2"))
returning the number of tweets scrapped

length(tweets)

No comments:

Post a Comment