Hello friends, In my previous articles I wrote about all the basics of Linear regression. In this article I will tell you one of the several use of Linear Regression. And in this article we will learn about “how to classification using Linear Regression in R”.

Let’s Begin

**Pre-requisites for this tutorial: **

*Introduction to Linear Regression model (optional).**Linear Regression Analysis using R (must have knowledge how to do regression on R).*

As we know that linear regression is most widely used in predictive analytics. And Here in this article we will use this for classification.

First of all open **R** or **RStudio and set your working directory.**

**Step 1:** First install “mlbench” package.

> **package.install(“mlpackage”)** **↵ (press enter).**

After pressing enter button, it will ask for the cran mirror. [just select the cloud 0]

**Step 2:** Load the installed package using the liberary command.

> **library(mlbench)** **↵** **(press enter )**

**Step 3:** load your desired data [Here I am loading the PimaIndiansDiabetes2 (it is a inbuilt data in the package)]

> **data(PimaIndiansDiabetes2)**

The data contains 392 women with pimaindianheritage, they were tasted for diabetes.

Our goal in this article is to get the classification of diagnosis of diabetes

lets look into the data

> **head(PimaIndiansDiabetes2)**

Here is the structure of the data:

Lets get to know the variables of data:

**Pregnant** : Number of times she got pregnant.

**Glucose** : Glucose tolerance test.

**Pressure** : Blood pressure(mm-Hg).

**Triceps** : Skin thickness (mm).

**Insulin** : 2 hr Syrum of insulin (miliunit/ml).

**Mass** : ratio to weight to height(bmi)[kg/m²].

**Pedigree** : diabetes pedigree function.

**Age** : How old she is(year).

**Diabetic** : whether she is positive or negative.

These are the variables of our data

In the above image you have seen that there is some missing values in insulin variable. Lets look into the complete data to check the missing values. The missing values can be fixed.

**Step 4:** check the complete data

> **PimaIndiansDiabetes2 ↵ press enter**

There are 768 total observations.

Only last few observations to show the number of observations

>**pidna <- na.omit(PimaIndiansDiabetes2)**

After omitting the NA values from the data-set only 392 valid observation remains.

and **‘pidna’** is the new variable where we have save the **“PimaIndiansDiabetes2”.**

> **pidlm <- pidna**

The value of **pidna** is stored in **pidlm**

If you have notice this that the all variables have numeric value but the diabetes column have only categorical value. We should change the categorical value to numeric value, how we do it, let’s have a look

**Step 5:** Change the categorical value into numeric value

>** pidlm$diabetes <- as.numeric(pidna$Diabetes)-1**

When we change the categorical value into numeric value. By default **1-negative** and

**2-positive**.

Here we have set like this that it will come as **1-positive** and **0-negative.**

Now, lets see the corrected data again

**NOTE: Don’t get confused with the response variable(diabetes) having values 0 or 1. Thes are not in binary form they are just numeric number. If we set this command like this pidlm$diabetes <- as.numeric(pidna$Diabetes)-2 then the response values should be like -1 or 0.**

> **head(pidlm) ↵ (press enter)**

Now we can use this data-set for linear Regression from this data-set we will make two different data-set one is **training set** and the other is **testing set.**

**Step 6 :** We will take first 300 records for training set

>** train <- pidlm[(1:300),]**

> **test <- pidlm[(301:392),]**

**NOTE:If you are getting dimension error in your R command just remove ‘,’ and try.**

**Step 7 :** Now lets make a linear regression model

>** lm_reg <- lm(diabetes~., data=train)**

Here our regression model is ready

**Step 8 :** Lets have look into the summary of the **lm_reg**

>** summary(lm_reg)**

Here in this summary, we will conclude that who has diabetes and this summary coefficient will tell us this.

**Step 9: **Lets make a predictive model of linear regression models

> **predicted <- predict(lm_reg, new data = test)**

>** predicted**

> **TAB** **<- table(test$diabetes, predicted > 0.5)**

> **TAB**

We are assuming this if the predicted value is greater than 0.5 then, we will consider it as a positive case, Otherwise it will be a negative case.

From the table we can say that there are only 62 cases with the diabtese.

*57+5 of women with ‘0’ diabetes value*

*57 values are predicted false with logic predicted > 0.5*

*9+21 of women with ‘1’ diabetes values.*

*21 women are predicted true with the logic of predicted > 0.5, which they are successfully predicted.*

*9 women cases where predicted value < 0.5 giving the predicted value ‘0’ even though the original value was ‘1’.*

Number of successful prediction are 78 while the total number of test cases are 92.

From here we also calculate the miss classification value

**Step 10:** Calculating miss classification value

**Don’t worry we will calculate the miss classification from here:**

> **form <- 1-sum(diag[TAB])/sum(TAB)**

**diag[tab]= total predicted correct value.**

**sum(tab) = total predicted value.**

>** mcrate <- 1-[sum(total value predicted correct)/sum(total value)]**

We are not limited with predicted >0.5, we can change the predicted value and see the other value results.

**Step 11:** Lets check for the predicted value >0.7

> **tab_high <- table(test$diabetes, predicted > 0.7)** **↵ press enter**

From the table we can say that there are only 62 cases with the diabetes.

*60+2 of women with ‘0’ diabetes value*

*60 values are predicted false with logic predicted > 0.7*

*10+20of women with ‘1’ diabetes values.*

*20 women are predicted true with the logic of predicted > 0.7, which they are successfully predicted.*

*10 women cases where predicted value < 0.7 giving the predicted value ‘0’ even though the original value was ‘1’.*

As if you will change the value of predicted less than 0.5 then the results will be different.

**EDIT:** **If you have any doubt or found any mistakes please post it on comment we will look into the mistake. **

I am not sure lm reg suitable for this data, because dependent variable is category variable. It violate assumption that errors are iid normal.

If we choose logistic regression, it should be better or not?

Please advice.

LikeLiked by 1 person

spondars, you are right the data is having categorical dependent variables, see step5 I have changed the categorical into numeric, 1-negative and 2 for positive.

LikeLike

Overfitting problem and R2 is very small.

Ciao Alberto

LikeLiked by 1 person

@alberto, here in this example we are calculating Diabetes victim and in this dataset we have tested these observations with some varibles like which can casue diabetes, if the r^2 is less than this means all the variables are not significant to cause Diabetes.

LikeLike

A logit model (or arctan as well) should be more suitable to predict a dichotomous DV, since the ordinary lm prediction exceeds the allowed range of [0;1].

For the lm-method you introduced: A subsequent ROC analysis could give some useful information on the diagnostic quality of the model and provides the opportunity to determine the cutoff value of minimal decision error in the training data (youden index).

LikeLike

You are right @ben, I just gave an idea of classification using LM method.

LikeLike

@ben you are right, linear regression can’t be used to do classification directly but please go through the article carefully, I have changed the variables to categorical form. If you data is categorical then you can use LM model, as far as I know. If I am wrong please shoot an email @irrfankhann29@gmail.com.

LikeLike