Doe on Data Analytics: 2015

Monday, 28 September 2015

Twitter Text mining/scraping with R

As a data analyst one should be able to analyse the current trends in social media. This is useful in decision making and insight on people's views. Currently there is teachers strike in Kenya which has lasted for four weeks. Schools are closed and the strike is still as per the time of updating this blog. I won't go to the logistic about the strike but I can use my skills to determine people's views, correlation between the strike and other things. Take for example the president's name, the opposition leaders, the teachers and the children, all of them are affected from the analysis. Check out the ggplot product below to gain more insight.

The data sets are just tweets about the teachers strike. The bar graph shows more insight in that among the tweets users mention different words during tweets and the frequency of these words is what I have used to plot this graph. Also check out the word cloud below.

From the word cloud it is clear that strike was the word of the tweets. Followed by other words like teachers, the impeachment thereat from the opposition, UKenyatta the current president who is facing impeachment threats, Raila is the opposition leader, Ruto is the deputy president, Mutahi ngunyi is a well known political analyst who is a staunch supporter of President Kenyatta. There are so many words from the tweets but the word cloud only had specific frequency I had set.
For more of this kind of analysis follow me on twitter @oscar_doe.

Twitter text mining is a very well documented area of data mining. I will therefore explain little of it here just to make this article comprehensive. In twitter text mining I will be using the package twitter and tm in R. The main function will be searchTwitteR{} because of its ability to data mine with specific details such as location, dates, language and many more.

#Load packages

required(tm)

required (twitteR)

Libarary(wordcloud)

#Authorize twitter API

reqURL<-"https://api.twitter.com/oauth/request_token"

access_token<-" refer from the twitter app"

access_token_secret<-" refer from the twitter app"

apiKey<-" refer from the twitter app"

apiSecret<-" refer from the twitter app"

setup_twitter_oauth(apiKey,apiSecret,access_token,access_token_secret)

#Scrape twitter using searchTwitter function

#This function picks tweets with Brexit

#The language is null to accommodate other languages

#Date has been set from 23-june-2016

#location is actually worldwide

tweets<-searchTwitter("Brexit",n=1000,lang = NULL,since = "2016-06-23",until=NULL,locale=NULL,geocode=NULL,sinceID=NULL,maxID=NULL)

tweets_text<-sapply(tweets,function(x)x$getText())

Now we have the tweets but there other things that is contained in the tweets that we do not need. We do not need emojis, stop words like, the, why and many more. So we now proceed to removing this unwanted sets in our data.

#removing emojis in r

tweets1<-iconv(tweets_text,'UTF-8','ASCII')

Bulid a corpus and save the tweets as data frame

tweets_corpus=Corpus(VectorSource(tweets1))

tdm=TermDocumentMatrix(

tweets_corpus,

control=list(

removePunctuation=TRUE,stopwords=c("England","Wales","Scotland",stopwords("english")),

removeNumbers=TRUE,tolower=TRUE)

)

m=as.matrix(tdm)

# get word counts in decreasing order

word_freqs=sort(rowSums(m),decreasing=TRUE)

# create a data frame with words and their frequencies

dm=data.frame(word = names(word_freqs),freq=word_freqs)

wordcloud(dm$word,dm$freq,random.order=FALSE,colors=brewer.pal(8,"Dark2"))

returning the number of tweets scrapped

length(tweets)

Friday, 11 September 2015

Market Research

Market research is the process of gathering, analyzing and interpreting information about a specific market, product or service to be offered for sale in the future. It also helps to explore the past, present and potential customers for the product or service, it research into characteristics, spending habits, location and needs of your business’s market, the industry and the competitors.

We all believe that accurate and thorough information is the foundation of all successful business ventures because it provides a wealth of information about prospective and existing customers, the competition, and the industry.

At Unitary Analytics we take market research as a key factor in maintaining competitiveness over the potential competitors. Our experience experts will conduct a market research for you, therefore providing you with important information to identify and analyze the market need, market size and competition. This will allow prospective and business owners to determine the feasibility of the business before committing substantial resources to that venture and therefore provides relevant data to help solve challenges that the business will face.

Monday, 17 August 2015

Presenting a forecasting model using R shiny

Analytical result presentation is always the headache of any data analyst. I happen to have been working on a business forecasting model this past month, I am half way through but my client keep on asking for the progress. How to show him this long code would be messy to him. Of course I am lover of Rstudio so certainly I chose it over excel. However, it is very easy to present your workings in excel to any layman but when it comes to R it is will be useless to your client unless you find the best way to present.

Power of R shiny package.
Developed by Joe and his crew, shiny package is the solution to our problems. I did do some research during my project and indeed am more in love with these capabilities.

That is part of my work, i am not allowed to show the full work but I will display it all in my next post.

Check the UI.R part.

Follow this blog for posting, I will show share the web app source code as well as upload it online.

Saturday, 20 June 2015

Data Pre-Processing before predictive modeling with R

Every data analyst need to Pre-process the data before commencing any analysis tasks. In my desktop, there is an excel sheet named modeling. Actually I chose the name because of my intended purpose for this article. I am going to use the data sets to do some modeling, but prior to that I have to prepare my data for the main task. The preparation is simple but very important for any serious data analyst who wants to get insight to his or her data.

Transformation for a Single Predictor

Centering and scaling the predictor variables

To center a predictor variable, the average predictor value is subtracted from all the values. As a result of centering, the predictor has a zero mean

To scale the data, each value of the predictor variable is divided by its standard deviation. Scaling the data coerce the values to have a common standard deviation of one

These manipulations are generally used to improve the numerical stability of some calculations.

Transformations to Resolve Skewness

Another form of preprocessing is transformation to remove skewness (not symmetric). A right skewed distribution has humped to the left and vice versa. A very common way to determine whether a data set is skewed is calculate the ratio of highest to lowest values, if the ratio is greater than 20 then there is significance skewness.

Replacing the data with square root, log or inverse may help in removing the skewness.

Transformation for multiple predictors

Transformations to Resolve Outliers

Most people understand skewness but for a data set one has to remember the following:

With small sample sizes, the outliers might be as a result of a skewed distribution where there are not yet enough data to see the skewness. Or the data may indicate a special part of the population under study that was just starting to be sampled for the survey.

There are a lot about this that I cannot spell out here, or else it will not be a blog but a book.

Other data processes include

Principal component Analysis

Dealing with Missing Values
Removing Predictors
Adding Predictors
Binning Predictors
Spatial Sign

R codes for pre-processing data

I will start by first conducting principal component analysis. I do assume that as a data analyst you are conversant with this type of transformation and therefore I will not go into detail of explaining the whole thing here. In any case you are newbie then PCA as mostly known is a analysis to determine those predictors mostly explain the model as opposed to using all models to model.

PCA

The aim of this transformation is to determine those variables that greatly contribute to Pregnancy

#import my data set to R for the analysis

data<-read.csv("C:/Users/doe/Desktop/modelling.csv",header=T)

#load all the package you need here

library(caret)

library(corrplot)

library(e1071)

library(lattice)

# I will not explain what the packages do for now, pardon me please

#the simple code bellow will center, Scale and perform PCA

pcaObject<-prcomp(data,center=TRUE,scale=TRUE)

#am hundred percent sure the data set is transformed, that simple aha!

#rotation stores the variable loadings, where rows correspond to predictor variables and columns are #associated with the components:

That marks the end of beginning today, follow visit the website for more information or contact the Unitary Analytics (Intelligent Data Analytics) offices in Nairobi Kenya.

Predictive Modeling in R

Prescient modeling or investigation includes a mixed bag of statistical procedures from modeling, machine learning, and data mining that analyze present and recorded truths to make expectations about future, or generally obscure, occasions.In business, predictive models adventure examples found in recorded and value-based information to distinguish risk and opportunities. Models catch connections among numerous components to permit appraisal of danger or potential connected with a specific arrangement of conditions, directing choice making for applicant exchanges.The characterizing useful impact of these specialized methodologies is that predictive analytics gives a predictive score for every person such customer, employee, healthcare patient, product, vehicle, component, machine, or other organizational unit with a specific end goal to , inform, or influence organizational processes that pertain across large numbers of individuals, such as in marketing, credit risk assessment, fraud detection, manufacturing, healthcare, and government operations including law enforcement.

Business Applications of Predictive Analytics
Associations of all sizes apply predictive analytics to make operational decisions, both online and offline, crosswise over advertising, deals and past. Which business use of predictive analytics is best for you is a key question, and relies on upon which sort of choice you decide to take, the way prescient scores will best serve to drive decisions inside of your firm.At Unitary Analytics we apply predictive models to do the following:

Credit Scoring
This is utilized all through monetary administrations. Scoring models transform a client's financial record, advance application, client information, and so forth, with a specific end goal to rank loan applicant with their probability of making future credit installments on time.

Client Predictions Drive Operational Decisions
Prescient scores are the brilliant eggs delivered by prescient investigation – one prescient score for each client or prospect. Every client's score, thus, advises what move to make with that client. Business knowledge simply doesn't get more significant than this sort of choice computerization.Prescient investigation is connected from numerous points of view to help organizations conquer a plenty of difficulties. The center distinction in one method of utilization to another is in what's being anticipated. Anticipating client reaction, snap, or deserting are each altogether different things, and convey business esteem in distinctive ways.This just an overview of what we will be discussing in the main article, after all you need to have the theory first before we embark on real R programming.

Tuesday, 9 June 2015

Welcome to Data Analytics with R

If you want to derive meaning out of you data the you will almost obviously wing up with R or Python. Today we will be discussing R.

Pages