Every
data analyst need to Pre-process the data before commencing any
analysis tasks. In my desktop, there is an excel sheet named
modeling. Actually I chose the name because of my intended purpose
for this article. I am going to use the data sets to do some
modeling, but prior to that I have to prepare my data for the main
task. The preparation is simple but very important for any serious
data analyst who wants to get insight to his or her data.
Transformation for a Single Predictor
Centering and scaling the predictor variables
To
center a predictor variable, the average predictor value is
subtracted from all the values. As a result of centering, the
predictor has a zero mean
To
scale the data, each value of the predictor variable is divided by
its standard deviation. Scaling the data coerce the values to have a
common standard deviation of one
These
manipulations are generally used to improve the numerical stability
of some calculations.
Transformations to Resolve Skewness
Another
form of preprocessing is transformation to remove skewness (not
symmetric). A right skewed distribution has humped to the left and
vice versa. A very common way to determine whether a data set is
skewed is calculate the ratio of highest to lowest values, if the
ratio is greater than 20 then there is significance skewness.
Replacing
the data with square root, log or inverse may help in removing the
skewness.
Transformation for multiple predictors
Transformations to Resolve Outliers
Most
people understand skewness but for a data set one has to remember the
following:
With
small sample sizes, the outliers might be as a result of a skewed
distribution where there are not yet enough data to see the skewness.
Or the data may indicate a special part of the population under study
that was just starting to be sampled for the survey.
There
are a lot about this that I cannot spell out here, or else it will
not be a blog but a book.
Other
data processes include
- Principal component Analysis
- Dealing with Missing Values
- Removing Predictors
- Adding Predictors
- Binning Predictors
- Spatial Sign
R codes for pre-processing data
I
will start by first conducting principal component analysis. I do
assume that as a data analyst you are conversant with this type of
transformation and therefore I will not go into detail of explaining
the whole thing here. In any case you are newbie then PCA as mostly
known is a analysis to determine those predictors mostly explain the
model as opposed to using all models to model.
PCA
The
aim of this transformation is to determine those variables that
greatly contribute to Pregnancy
#import
my data set to R for the analysis
data<-read.csv("C:/Users/doe/Desktop/modelling.csv",header=T)
#load
all the package you need here
library(caret)
library(corrplot)
library(e1071)
library(lattice)
#
I will not explain what the packages do for now, pardon me please
#the
simple code bellow will center, Scale and perform PCA
pcaObject<-prcomp(data,center=TRUE,scale=TRUE)
#am
hundred percent sure the data set is transformed, that simple aha!
#rotation
stores
the variable loadings, where rows correspond to predictor variables
and columns are #associated with the components:
That
marks the end of beginning today, follow visit the website for more
information or contact the Unitary Analytics (Intelligent Data
Analytics) offices in Nairobi Kenya.