How Linear Regression can also be used to do classification?

Hello friends, In my previous articles I wrote about all the basics of Linear regression. In this article I will tell you one of the several use of Linear Regression. And in this article we will learn about “how to classification using Linear Regression in R”.

Let’s Begin

Pre-requisites for this tutorial:

As we know that linear regression is most widely used in predictive analytics. And Here in this article we will use this for classification.

First of all open R or RStudio and set your working directory.

Step 1: First install “mlbench” package.

> package.install(“mlpackage”)  (press enter).

After pressing enter button, it will ask for the cran mirror. [just select the cloud 0]

Step 2: Load the installed package using the liberary command.

library(mlbench)  (press enter )

Step 3: load your desired data [Here I am loading the PimaIndiansDiabetes2 (it is a inbuilt data in the package)]

data(PimaIndiansDiabetes2)

The data contains 392 women with pimaindianheritage, they were tasted for diabetes.

Our goal in this article is to get the classification of diagnosis of diabetes

lets look into the data

> head(PimaIndiansDiabetes2)

Here is the structure of the data:

datastructure

Lets get to know the variables of data:

Pregnant : Number of times she got pregnant.

Glucose : Glucose tolerance test.

Pressure : Blood pressure(mm-Hg).

Triceps : Skin thickness (mm).

Insulin : 2 hr Syrum of insulin (miliunit/ml).

Mass : ratio to weight to height(bmi)[kg/m²].

Pedigree : diabetes pedigree function.

Age : How old she is(year).

Diabetic : whether she is positive or negative.

These are the variables of our data

In the above image you have seen that there is some missing values in insulin variable. Lets look into the complete data to check the missing values. The missing values can be fixed.

Step 4: check the complete data

> PimaIndiansDiabetes2  press enter

There are 768 total observations.

datastructure

Only last few observations to show the number of observations

>pidna <- na.omit(PimaIndiansDiabetes2)

After omitting the NA values from the data-set only 392 valid observation remains.

datastructure2

and ‘pidna’ is the new variable where we have save the “PimaIndiansDiabetes2”.

pidlm <- pidna

The value of pidna is stored in pidlm

If you have notice this that the all variables have numeric value but the diabetes column have only categorical value. We should change the categorical value to numeric value, how we do it, let’s have a look

Step 5: Change the categorical value into numeric value

> pidlm$diabetes <- as.numeric(pidna$Diabetes)-1

When we change the categorical value into numeric value. By default 1-negative and

2-positive.

Here we have set like this that it will come as 1-positive  and 0-negative.

Now, lets see the corrected data again

NOTE: Don’t get confused with the response variable(diabetes) having values 0 or 1. Thes are not in binary form they are just numeric number. If we set this command like this pidlm$diabetes <- as.numeric(pidna$Diabetes)-2 then the response values should be like -1 or 0.

> head(pidlm) ↵  (press enter)

datastructure3

Now we can use this data-set for linear Regression from this data-set we will make two different data-set one is training set and the other is testing set.

Step 6 : We will take first 300 records for training set

> train <- pidlm[(1:300),]

> test <- pidlm[(301:392),]

NOTE:If you are getting dimension error in your R command just remove ‘,’ and try.

Step 7 : Now lets make a linear regression model

> lm_reg <- lm(diabetes~., data=train)

Here our regression model is ready

Step 8 : Lets have look into the summary of the lm_reg

> summary(lm_reg)

datastructure4

Here in this summary, we will conclude that who has diabetes and this summary coefficient will tell us this.

Step 9: Lets make a predictive model of linear regression models

> predicted <- predict(lm_reg, new data = test)

> predicted

> TAB <- table(test$diabetes, predicted > 0.5)

> TAB

datastructure5

We are assuming this if the predicted value is greater than 0.5 then, we will consider it  as a positive case, Otherwise it will be a negative case.

From the table we can say that there are only 62 cases with the diabtese.

57+5 of women with ‘0’ diabetes value

57  values are predicted false with logic predicted > 0.5

9+21 of women with ‘1’ diabetes values.

21 women are predicted true with the logic of predicted > 0.5,  which they are successfully predicted.

9 women cases where  predicted value < 0.5 giving the predicted value ‘0’ even though the original value was ‘1’.

Number of successful prediction are 78 while the total number of test cases are 92.

From here we also calculate the miss classification value

Step 10:  Calculating miss classification value

Don’t worry we will calculate the miss classification from here:

> form <- 1-sum(diag[TAB])/sum(TAB)

diag[tab]= total predicted correct value.

sum(tab) = total predicted value.

> mcrate <- 1-[sum(total value predicted correct)/sum(total value)]

We are not limited with predicted >0.5, we can change the predicted value and see the other value results.

Step 11: Lets check for the predicted value >0.7

> tab_high <- table(test$diabetes, predicted > 0.7)  press enter

datastructure6

From the table we can say that there are only 62 cases with the diabetes.

60+2 of women with ‘0’ diabetes value

60 values are predicted false with logic predicted > 0.7

10+20of women with ‘1’ diabetes values.

20 women are predicted true with the logic of predicted > 0.7,  which they are successfully predicted.

10 women cases where  predicted value < 0.7 giving the predicted value ‘0’ even though the original value was ‘1’.

As if you will change the value of predicted less than 0.5 then the results will be different.

EDIT: If you have any doubt or found any mistakes please post it on comment we will look into the mistake. 

Advertisements

7 thoughts on “How Linear Regression can also be used to do classification?

  1. I am not sure lm reg suitable for this data, because dependent variable is category variable. It violate assumption that errors are iid normal.

    If we choose logistic regression, it should be better or not?

    Please advice.

    Liked by 1 person

    1. spondars, you are right the data is having categorical dependent variables, see step5 I have changed the categorical into numeric, 1-negative and 2 for positive.

      Like

    1. @alberto, here in this example we are calculating Diabetes victim and in this dataset we have tested these observations with some varibles like which can casue diabetes, if the r^2 is less than this means all the variables are not significant to cause Diabetes.

      Like

  2. A logit model (or arctan as well) should be more suitable to predict a dichotomous DV, since the ordinary lm prediction exceeds the allowed range of [0;1].

    For the lm-method you introduced: A subsequent ROC analysis could give some useful information on the diagnostic quality of the model and provides the opportunity to determine the cutoff value of minimal decision error in the training data (youden index).

    Like

      1. @ben you are right, linear regression can’t be used to do classification directly but please go through the article carefully, I have changed the variables to categorical form. If you data is categorical then you can use LM model, as far as I know. If I am wrong please shoot an email @irrfankhann29@gmail.com.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s