Hello friends, In my previous articles I wrote about all the basics of Linear regression. In this article I will tell you one of the several use of Linear Regression. And in this article we will learn about “how to classification using Linear Regression in R”.
Pre-requisites for this tutorial:
- Introduction to Linear Regression model (optional).
- Linear Regression Analysis using R (must have knowledge how to do regression on R).
As we know that linear regression is most widely used in predictive analytics. And Here in this article we will use this for classification.
First of all open R or RStudio and set your working directory.
Step 1: First install “mlbench” package.
> package.install(“mlpackage”) ↵ (press enter).
After pressing enter button, it will ask for the cran mirror. [just select the cloud 0]
Step 2: Load the installed package using the liberary command.
> library(mlbench) ↵ (press enter )
Step 3: load your desired data [Here I am loading the PimaIndiansDiabetes2 (it is a inbuilt data in the package)]
The data contains 392 women with pimaindianheritage, they were tasted for diabetes.
Our goal in this article is to get the classification of diagnosis of diabetes
lets look into the data
Here is the structure of the data:
Lets get to know the variables of data:
Pregnant : Number of times she got pregnant.
Glucose : Glucose tolerance test.
Pressure : Blood pressure(mm-Hg).
Triceps : Skin thickness (mm).
Insulin : 2 hr Syrum of insulin (miliunit/ml).
Mass : ratio to weight to height(bmi)[kg/m²].
Pedigree : diabetes pedigree function.
Age : How old she is(year).
Diabetic : whether she is positive or negative.
These are the variables of our data
In the above image you have seen that there is some missing values in insulin variable. Lets look into the complete data to check the missing values. The missing values can be fixed.
Step 4: check the complete data
> PimaIndiansDiabetes2 ↵ press enter
There are 768 total observations.
Only last few observations to show the number of observations
>pidna <- na.omit(PimaIndiansDiabetes2)
After omitting the NA values from the data-set only 392 valid observation remains.
and ‘pidna’ is the new variable where we have save the “PimaIndiansDiabetes2”.
> pidlm <- pidna
The value of pidna is stored in pidlm
If you have notice this that the all variables have numeric value but the diabetes column have only categorical value. We should change the categorical value to numeric value, how we do it, let’s have a look
Step 5: Change the categorical value into numeric value
> pidlm$diabetes <- as.numeric(pidna$Diabetes)-1
When we change the categorical value into numeric value. By default 1-negative and
Here we have set like this that it will come as 1-positive and 0-negative.
Now, lets see the corrected data again
NOTE: Don’t get confused with the response variable(diabetes) having values 0 or 1. Thes are not in binary form they are just numeric number. If we set this command like this pidlm$diabetes <- as.numeric(pidna$Diabetes)-2 then the response values should be like -1 or 0.
> head(pidlm) ↵ (press enter)
Now we can use this data-set for linear Regression from this data-set we will make two different data-set one is training set and the other is testing set.
Step 6 : We will take first 300 records for training set
> train <- pidlm[(1:300),]
> test <- pidlm[(301:392),]
NOTE:If you are getting dimension error in your R command just remove ‘,’ and try.
Step 7 : Now lets make a linear regression model
> lm_reg <- lm(diabetes~., data=train)
Here our regression model is ready
Step 8 : Lets have look into the summary of the lm_reg
Here in this summary, we will conclude that who has diabetes and this summary coefficient will tell us this.
Step 9: Lets make a predictive model of linear regression models
> predicted <- predict(lm_reg, new data = test)
> TAB <- table(test$diabetes, predicted > 0.5)
We are assuming this if the predicted value is greater than 0.5 then, we will consider it as a positive case, Otherwise it will be a negative case.
From the table we can say that there are only 62 cases with the diabtese.
57+5 of women with ‘0’ diabetes value
57 values are predicted false with logic predicted > 0.5
9+21 of women with ‘1’ diabetes values.
21 women are predicted true with the logic of predicted > 0.5, which they are successfully predicted.
9 women cases where predicted value < 0.5 giving the predicted value ‘0’ even though the original value was ‘1’.
Number of successful prediction are 78 while the total number of test cases are 92.
From here we also calculate the miss classification value
Step 10: Calculating miss classification value
Don’t worry we will calculate the miss classification from here:
> form <- 1-sum(diag[TAB])/sum(TAB)
diag[tab]= total predicted correct value.
sum(tab) = total predicted value.
> mcrate <- 1-[sum(total value predicted correct)/sum(total value)]
We are not limited with predicted >0.5, we can change the predicted value and see the other value results.
Step 11: Lets check for the predicted value >0.7
> tab_high <- table(test$diabetes, predicted > 0.7) ↵ press enter
From the table we can say that there are only 62 cases with the diabetes.
60+2 of women with ‘0’ diabetes value
60 values are predicted false with logic predicted > 0.7
10+20of women with ‘1’ diabetes values.
20 women are predicted true with the logic of predicted > 0.7, which they are successfully predicted.
10 women cases where predicted value < 0.7 giving the predicted value ‘0’ even though the original value was ‘1’.
As if you will change the value of predicted less than 0.5 then the results will be different.
EDIT: If you have any doubt or found any mistakes please post it on comment we will look into the mistake.