One of the most frequent used techniques in statistics is linear regression where we investigate the potential relationship between a variable of interest (often called the response variable but there are many other names in use) and a set of one of more variables (known as the independent variables or some other term). Unsurprisingly there are flexible facilities in R for fitting a range of linear models from the simple case of a single variable to more complex relationships.
In this post we will consider the case of simple linear regression with one response variable and a single independent variable. For this example we will use some data
The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between variables.
First I will tell you the steps:
- Open R script.
- Set working directory according to your habit.
- Load the dataset in R software(.txt, .xlsx, .csv).
If the extension of your file is ‘.xlsx’ then
Use these commands
var1 <- read.xlsx(“<filename with proper extension>”, sheetIndex = <number or name >)
#sheetIndex is nothing but the title of excel sheet where your data stored.
Now we are ready to go to make a linear regression model using R programming software
My R script
Now load the data
You seen in the image that first i checked my working directory and then changed it to another directory, this means the working datafiles have another location so i changed it for my help.
I can set the working directory by two methods: 1)files->change directory 2) setwd(“<path>”)
In the following data
X = annual franchise fee ($1000)
Y = start up cost ($1000)
for a pizza franchise
You can see the details in below image is
Here in this image you can see that the R² valued is very less, this means that there is almost no relationship between the two variables.
How good is your Regression model?
- Based on R² value, we can explain the model.
- Difference between observations (which are not explained by model) is the error term or residual.
- In the above regression model the value of almost R²=.22, 22% variance of dependent variables which are explained by the model and the remaining 78% which is not explained, is error term or residual.