Saturday, 20 June 2015

Data Pre-Processing before predictive modeling with R


Every data analyst need to Pre-process the data before commencing any analysis tasks. In my desktop, there is an excel sheet named modeling. Actually I chose the name because of my intended purpose for this article. I am going to use the data sets to do some modeling, but prior to that I have to prepare my data for the main task. The preparation is simple but very important for any serious data analyst who wants to get insight to his or her data.

Transformation for a Single Predictor


Centering and scaling the predictor variables


To center a predictor variable, the average predictor value is subtracted from all the values. As a result of centering, the predictor has a zero mean



To scale the data, each value of the predictor variable is divided by its standard deviation. Scaling the data coerce the values to have a common standard deviation of one



These manipulations are generally used to improve the numerical stability of some calculations.



Transformations to Resolve Skewness




Another form of preprocessing is transformation to remove skewness (not symmetric). A right skewed distribution has humped to the left and vice versa. A very common way to determine whether a data set is skewed is calculate the ratio of highest to lowest values, if the ratio is greater than 20 then there is significance skewness.



Replacing the data with square root, log or inverse may help in removing the skewness.



Transformation for multiple predictors



Transformations to Resolve Outliers



Most people understand skewness but for a data set one has to remember the following:

With small sample sizes, the outliers might be as a result of a skewed distribution where there are not yet enough data to see the skewness. Or the data may indicate a special part of the population under study that was just starting to be sampled for the survey.



There are a lot about this that I cannot spell out here, or else it will not be a blog but a book.

Other data processes include

  • Principal component Analysis
  • Dealing with Missing Values

  • Removing Predictors

  • Adding Predictors

  • Binning Predictors

  • Spatial Sign





R codes for pre-processing data




I will start by first conducting principal component analysis. I do assume that as a data analyst you are conversant with this type of transformation and therefore I will not go into detail of explaining the whole thing here. In any case you are newbie then PCA as mostly known is a analysis to determine those predictors mostly explain the model as opposed to using all models to model.



PCA

The aim of this transformation is to determine those variables that greatly contribute to Pregnancy


#import my data set to R for the analysis

data<-read.csv("C:/Users/doe/Desktop/modelling.csv",header=T)

#load all the package you need here

library(caret)

library(corrplot)

library(e1071)

library(lattice)

# I will not explain what the packages do for now, pardon me please

#the simple code bellow will center, Scale and perform PCA

pcaObject<-prcomp(data,center=TRUE,scale=TRUE)

#am hundred percent sure the data set is transformed, that simple aha!

#rotation stores the variable loadings, where rows correspond to predictor variables and columns are #associated with the components:




That marks the end of beginning today, follow visit the website for more information or contact the Unitary Analytics (Intelligent Data Analytics) offices in Nairobi Kenya.

No comments:

Post a Comment