Variables used to make Predictive Model in R


In this section we will learn basics command which are being used to make  Predictive Modelling

From this section onwards, we’ll dive deep into various stages of predictive modeling. Hence, make sure you understand every aspect of this section. In case you find anything difficult to understand, ask me in the comments section below.

Data Exploration is a crucial stage of predictive model. This stage forms a concrete foundation for data manipulation (the very next stage). Let’s understand it in R.

In this tutorial, I’ve taken the data set from “R library”. Before we start, you must get familiar with these terms:

Response Variable (or Dependent Variable): In a data set, the response variable (y) is one on which we make predictions.

Predictor Variable (or Independent Variable): In a data set, predictor variables (X) are those using which the prediction is made on response variable.

 

Train Data: The predictive model is always built on train data set. An intuitive way to identify the train data is, that it always has the ‘response variable’ included.

Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This data always contains less number of observations than train data set. Also, it does not include ‘response variable’.

Right now, you should download the data set. Take a good look at train and test data. Cross check the information shared above and then proceed.

Let’s now begin with importing and exploring data.

First check your working directory, set the path to the data files

#get the address of current working directory by using this command

>getwd()

#set working directory
>setwd(path)

or you can set the directory this way

go to [file]->[change directory]-> this will open a small pop-up on your window screen , from there set the project directory.

As a beginner, I’ll advise you to keep the train and test files in your working directly to avoid unnecessary directory troubles. Once the directory is set, we can easily import the .csv files using commands below.

#Load Datasets

Look If we already have train and test “.csv” file in our working directory then we can load directly into R using these command:

>train <- read.csv(“filename.csv”)

#train is a variable where we save the training data in R

>test <- read.csv(“filename.csv”)

#test is a variable where we save the testing data in R

If we don’t have separately training and testing file then we can create these data file from our original data file like this:

#Suppose you have total 392 variables

Take first 300 variables for training module

>train <- pidlm[(1:300),]

Remaining variables will fall in testing module

>test <- pidlm[(301:392),]

#pidlm is the variable where we have kept enexpected data from  original data sheet.

We can check the dimensions of our file
>dim(train)

> dim(test)

Let’s get deeper in train data set now.

#check the variables and their types in train
> str(train)

It will give something like this

[

‘data.frame’: 8523 obs. of 12 variables:
$ Item_Identifier : Factor w/ 1559 levels “DRA12″,”DRA24”,..: 157 9 663 1122 1298 759 697 739 441 991 …
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 …
$ Item_Fat_Content : Factor w/ 5 levels “LF”,”low fat”,..: 3 5 3 5 3 5 5 3 5 5 …
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 …
$ Item_Type : Factor w/ 16 levels “Baking Goods”,..: 5 15 11 7 10 1 14 14 6 6 …
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 …
$ Outlet_Identifier : Factor w/ 10 levels “OUT010″,”OUT013”,..: 10 4 10 1 2 4 2 6 8 3 …
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 …
$ Outlet_Size : Factor w/ 4 levels “”,”High”,”Medium”,..: 3 3 3 1 2 3 2 3 1 1 …
$ Outlet_Location_Type : Factor w/ 3 levels “Tier 1″,”Tier 2”,..: 1 3 1 3 3 3 3 3 2 2 …
$ Outlet_Type : Factor w/ 4 levels “Grocery Store”,..: 2 3 2 1 2 3 2 4 2 2 …
$ Item_Outlet_Sales : num 3735 443 2097 732 995 …]

Let’s do some quick data exploration.

To begin with, I’ll first check if this data has missing values. This can be done by using:

> table(is.na(train))

> colSums(is.na(train))
#This will give which data is missing

> summary(train)

This command will give the summary of each variable which are present in our data sheet

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

Up ↑