Essential of R Programming


R has five basic or ‘atomic’ classes of objects. Wait, what is an object ?Everything you see or create in R is an object. A vector, matrix, data frame, even a variable is an object. R treats it that way. So, R has 5 basic classes of objects. This includes:

  • Character
  • Numeric (Real Numbers)
  • Integer (Whole Numbers)
  • Complex
  • Logical (True / False)

Since these classes are self-explanatory by names, I wouldn’t elaborate
on that. These classes have attributes. Think of attributes as their ‘identifier’, a name or number which aptly identifies them. An object can have following attributes:

  • Names, Dimension Name
  • Dimension
  • Class
  • Length

Attributes of an object can be accessed using attributes() function. More on this coming in following section.

Let’s understand the concept of object and attributes practically. The most basic object in R is known as vector. You can create an empty vector using vector(). Remember, a vector contains object of same class.

For example: Let’s create vectors of different classes. We can create vector using c() or concatenate command also.

> a <- c(3.5, 5)   #numeric
> b <- c(3+ 4i) #complex
> d <- c(54, 14)   #integer
> e <- vector("logical", length = 5)

Similarly, you can create vector of various classes.

 

Data Types in R

R has various type of ‘data types’ which includes vector (numeric, integer etc), matrices, data frames and list. Let’s understand them one by one.

Vector: As mentioned above, a vector contains object of same class. But, you can mix objects of different classes too. When objects of different classes are mixed in a list, coercion occurs. This effect causes the objects of different types to ‘convert’ into one class. For example:

> kh <- c("Time", 24, "October", TRUE, 3.33)  #character
> we <- c(TRUE, 24) #numeric
> df <- c(2.5, "May") #character

To check the class of any object, use class(“vector name”) function.

> class(kh)
 "character"

To convert the class of a vector, you can use as. command.

> bar <- 0:5
> class(bar)
> "integer"
> as.numeric(bar)
> class(bar)
> "numeric"
> as.character(bar)
> class(bar)
> "character"

Similarly, you can change the class of any vector. But, you should pay attention here. If you try to convert a “character” vector to “numeric” , NAs will be introduced. Hence, you should be careful to use this command.

 

List: A list is a special type of vector which contain elements of different data types.

For example:

> my_list <- list(22, "ab", TRUE, 1 + 2i)
> my_list

[[1]]
[1] 22

[[2]]
[1] "ab"

[[3]]
[1] TRUE

[[4]]
[1] 1+2i

(” ” <- this means it has string inside this)

As you can see, the output of a list is different from a vector. This is because, all the objects are of different types. The double bracket [[1]] shows the index of first element and so on. Hence, you can easily extract the element of lists depending on their index. Like this:

> my_list[[3]]
> [1] TRUE

You can use [] single bracket too. But, that would return the list element with its index number, instead of the result above. Like this:

> my_list[3]
> [[1]]
  [1] TRUE

 

Matrices: When a vector is introduced with row and column i.e. a dimension attribute, it becomes a matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data structure. It consist of elements of same class. Let’s create a matrix of 3 rows and 2 columns:

In a matrix, every element must have same class.

> variable <- matrix(1:6, nrow=3, ncol=2)
> variable
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

[ ,2 ] <- column

[ 1, ] <- row

Here we have given a “matrix”  command to concatenate the elements of the variable, this command will give the output in matrix form.

> dim(variable)
[1] 3 2

> attributes(variable)
$dim
[1] 3 2

As you can see, the dimensions of a matrix can be obtained using either dim() or attributes()command.  To extract a particular element from a matrix, simply use the index shown above. For example(try this at your end):

> variable[,2]   #extracts second column
> variable[,1]   #extracts first column
> variable[2,]   #extracts second row
> variable[1,]   #extracts first row

As an interesting fact, you can also create a matrix from a vector. All you need to do is, assign dimension dim() later. Like this:

> age <- c(23, 44, 15, 12, 31, 16)
> age
[1] 23 44 15 12 31 16

> dim(age) <- c(2,3)
> age
    [,1] [,2] [,3]
[1,] 23   15   31
[2,] 44   12   16

> class(age)
[1] "matrix"

You can also join two vectors using cbind() and rbind() functions. But, make sure that both vectors have same number of elements. If not, it will return NA values.

> x <- c(1, 2, 3, 4, 5, 6)
> y <- c(20, 30, 40, 50, 60)
> cbind(x, y)
> cbind(x, y)
x    y
[1,] 1   20
[2,] 2   30
[3,] 3   40
[4,] 4   50
[5,] 5   60
[6,] 6   70

> class(cbind(x, y))
[1] “matrix”

 

Data Frame: This is the most commonly used member of data types family. It is used to store tabular data. It is different from matrix. In a matrix, every element must have same class. But, in a data frame, you can put list of vectors containing different classes. This means, every column of a data frame acts like a list. Every time you will read data in R, it will be stored in the form of a data frame. Hence, it is important to understand the majorly used commands on data frame:

> dataframe <- data.frame(name = c("ash","jane","paul","mark"), score = c(67,56,87,91))
> dataframe
  name  score
1 ash    67
2 jane   56
3 paul   87
4 mark   91

> dim(dataframe) [: will return the dimension of the data frame]
[1] 4 2

> str(dataframe) [: will returns the structure of a data frame]
'data.frame': 4 obs. of 2 variables:
$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
$ score: num 67 56 87 91

> nrow(dataframe):will return the number of rows
[1] 4

> ncol(dataframe): will return the number of columns
[1] 2

Here you see “name” is a factor variable and “score” is numeric. In data science, a variable can be categorized into two types: Continuous and Categorical.

Continuous variables are those which can take any form such as 1, 2, 3.5, 4.66 etc.

Categorical variables are those which takes only discrete values such as 2, 5, 11, 15 etc.

In R, categorical values are represented by factors. In dataframe. Factor or categorical variable are specially treated in a data set.

Let’s now understand the concept of missing values in R. This is one of the most painful yet crucial part of predictive modeling. You must be aware of all techniques to deal with them. The complete explanation on such techniques is provided here.

Missing values in R are represented by NA and NaN.

Now we’ll check if a data set has missing values (using the same data frame ).

How

> dataframe[1:2,2] <- NA

Here we are making the 1st and 2nd row and 2nd column:NA dataframe 

> dataframe
  name  score
1 ash   NA
2 jane  NA
3 paul  87
4 mark  91

> is.na(dataframe)

#checks the entire data set for NAs and return logical output
      name score
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE FALSE
[4,] FALSE FALSE
> table(is.na(dataframe))

 will returns a table of logical output

FALSE TRUE
6      2

> dataframe[!complete.cases(dataframe),]

will returns the list of rows having missing values
name  score
1 ash       NA
2 jane      NA

Missing values hinder normal calculations in a data set. For example, let’s say, we want to compute the mean of score. Since there are two missing values, it can’t be done directly. Let’s see:

mean(dataframe$score)
[1] NA
> mean(dataframe$score, na.rm = TRUE)
[1] 89

The use of na.rm = TRUE parameter tells R to ignore the NAs. To remove rows with NA values in a data frame, you can usena.omit:

> new_dataframe <- na.omit(dataframe)
> new_dataframe
   name score
3 paul   87
4 mark   91

 

Control Structures in R

As the name suggest, a control structure ‘controls’ the flow of code / commands written inside a function. A function is a set of multiple commands written to automate a repetitive coding task.

For example: You have 10 data sets. You want to find the mean of ‘Age’ column present in every data set. This can be done in 2 ways: either you write the code to compute mean 10 times or you simply create a function and pass the data set to it.

Let’s understand the control structures in R with simple examples:

if, else – This structure is used to test a condition. Below is the syntax:

if (<condition>){
         ##do something
} else {
         ##do something
}

Example

#initialize a variable
N <- 10

#check if this variable * 5 is > 40
if (N * 5 > 40){
       print("This is easy!")
} else {
       print ("It's not easy!")
}
[1] "This is easy!"

 

for – This structure is used when a loop is to be executed fixed number of times. It is commonly used for iterating over the elements of an object (list, vector). Below is the syntax:

for (<search condition>){
          #do something
}

Example

#initialize a vector
y <- c(99,45,34,65,76,23)

#print the first 4 numbers of this vector
for(i in 1:4){
     print (y[i])
}
[1] 99
[1] 45
[1] 34
[1] 65

 

while – It begins by testing a condition, and executes only if the condition is found to be true. Once the loop is executed, the condition is tested again. Hence, it’s necessary to alter the condition such that the loop doesn’t go infinity. Below is the syntax:

#initialize a condition
Age <- 12

#check if age is less than 17
while(Age < 17){
         print(Age)
         Age <- Age + 1 #Once the loop is executed, this code breaks the loop
}
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16

There are other control structures as well but are less frequently used than explained above. Those structures are:

  1. repeat – It executes an infinite loop
  2. break – It breaks the execution of a loop
  3. next – It allows to skip an iteration in a loop
  4. return – It help to exit a function

Note: If you find the section ‘control structures’ difficult to understand, not to worry. R is supported by various packages to compliment the work done by control structures.

 

Useful R Packages

Out of ~7800 packages listed on CRAN, I’ve listed some of the most powerful and commonly used packages in predictive modeling in this article. Since, I’ve already explained the method of installing packages, you can go ahead and install them now. Sooner or later you’ll need them.

Importing Data: R offers wide range of packages for importing data available in any format such as .txt, .csv, .json, .sql etc. To import large files of data quickly, it is advisable to install and use data.table,readr, RMySQL, sqldf, jasonlite.

Data Visualization: R has in built plotting commands as well. They are good to create simple graphs. But, becomes complex when it comes to creating advanced graphics. Hence, you should installggplot2.

Data Manipulation: R has a fantastic collection of packages for data manipulation. These packages allows you to do basic & advanced computations quickly. These packages are dplyr, plyr, tidyr,lubricate, stringr.

Modeling / Machine Learning: For modeling, caret package in R is powerful enough to cater to every need for creating machine learning model. However, you can install packages algorithms wise such as randomForest, rpart, gbm etc

Till here, you became familiar with the basic work style in R and its associated components. From next section, we’ll begin with predictive modeling. But before you proceed. I want you to practice, what you’ve learnt till here.

Practice Assignment:  As a part of this assignment, install ‘swirl’ package in package. Then type,library(swirl) to initiate the package.  And, complete this interactive R tutorial. If you have followed this article thoroughly, this assignment should be an easy task for you.

Thanks to analyticsvidhya

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Up ↑