How to do cox(Proportional Hazard) regression modelling using R?

Definition: Cox regression (or proportional hazards regression) is a method for investigating the effects of several variable upon the time a specified event takes to happen. In the context of an outcome such as death this is known as Cox regression for survival analysis. The Cox model hazard function calculates the hazard at time t of a subject, adjusted for possible explanatory variables. Cox regression does not assume any particular “survival model” but it is not truly non-parametric.

Why it is not truly non-parametric?

Because cox regression assume that the effects of the predictor variables upon survival are constant over time and are additive in one scale.

The formula is expressed as the product of the baseline hazard function of time and an exponential function of covariates. The baseline hazard is an unspecified form of the Cox model and the distribution of the outcome (survival
time) is unknown. This makes the Cox (PH) regression a semi-parametric model. The semi-parametric property of the Cox (PH) model makes it a robust model which can closely approximate parametric models

If the below assumptions of Cox regression are met, this function will provide better estimates of survival probabilities and cumulative hazard than those provided by the Kaplan-Meier function.

Assumptions of cox regression model:

  • Hazard and Hazard-ratios.
  • Time-dependent and covariates.
  • Model analysis and deviance.
  • Survival and commulative hazard rates.
Hazard- ratios(HR):

The Cox PH assumption states that the Hazard-ratios(HR) for any two individuals in the same study is constant over time.

In other words, the hazard for a subject is proportional to the hazard for another subject in the same study where the proportionality constant, say b is independent of time.

When the HR vary with time, for example where hazards cross or when time varying confounding variables are present the PH assumption maybe violated, making it inappropriate to use the Cox PH model. Where the Cox PH assumption is not met, variations of the Cox model can be used, for example the extended Cox regression or the stratified Cox regression depending on the context.

Time-dependent and Fixed covariates:

When individuals are followed over time, the values of covariates may change with time.

Covariates can thus be divided into fixed and time-dependent.

A covariate is time dependent if the difference between its values for two different subjects changes with time(e.g. blood pressure,cholestrol,smoking).
A covariate is fixed if its values can not change with time(e.g. gender or ethnicity).

Model Analysis and deviance:

What do you mean by “model analysis”? Yes, you guess it right, its the total analysis of your model; e.g. statistical significance( it covers all the hypothesis testing like t-test, chi-square, fisher, at all). Here the likelihood of chi-square model is calculated by comparing the deviance.

Deviance is minus twice the log of the likelihood ratio for models fitted by maximumlikelihood (-2*log(likelihood)).

The value of adding a parameter to a Cox model is tested by subtracting the deviance of the model with the new parameter from the deviance of the model without the new parameter, the difference is then tested against a chi-square distribution with degrees of freedom equal to the difference between the degrees of freedom of the old and new models.

Survival and cumulative hazards rate:

The survival function and the cumulative hazard function are calculated relative to the baseline (lowest value of covariates) at each time point.

If you have binary/dichotomous predictors in your model you are given the option to calculate 
survival and cumulative hazards for each variable separately.

Lets discuss only the assumptions of Kaplan-Meier functions:

  • Censored(*) individuals have the same prospect of survival as those who continue to be followed. This can not be tested for and can lead to a bias(**) that artificially reduces S.
  • Survival prospects are the same for early as for late recruits to the study (can be tested for).
  • The event studied (e.g. death) happens at the specified time. Late recording of the event studied will cause artificial inflation of S.


*  Censorship in survival-time (time-to-event, failure-time) studies refers to incomplete data.

** Bias is a systematic error that leads to an incorrect estimate of effect or association.

I will explain the code which is a inbuilt in R

R-code for cox(ph) regression model:

first install or import package “mgcv”. Get the description of mgcv package.

library(survival) ## to load the data

mgcv‘ is an package, which has gam(Generalised Additive Model), This is similar to glm. For more details on ‘mgcv’ package, please follow the steps

open R -> type ??mgcv

This, the above command will open help page from R documents.

We are using the survival data from ‘survival‘ package, which is an inbuilt in R. Now Move on

col1 <- colon[colon$etype==1,] ## concentrate on single event
col1$differ <- as.factor(col1$differ)
col1$sex <- as.factor(col1$sex)

Changing the characteristics of differ and sex variable for our smoothness.

model <- gam(time~s(age,by=sex)+sex+s(nodes)+perfor+rx+obstruct+adhere,,data=col1,weights=status)

Here(above) is the model formula for gam


Summary of the above model looks like this

Let’s take a look at the plot effects of model


Here is the plot effect of the given data

Now, till now you got the idea about data’s behaviour or what kind of data is what we have.

Now lets’ work for survival plot for particular object here in this data set object is patient and we are calculating the survival plot for patient ‘j‘:

Below is the function for survival analysis.

numP <- 300;j <- 6
newd <- data.frame(time=seq(0,3000,length=numP))
dname <- names(col1)
for (n in dname) newd[[n]] <- rep(col1[[n]][j],numP)
newd$time <- seq(0,3000,length=numP)
fv <- predict(model,newdata=newd,type="response",se=TRUE)

Crude plot of baseline survival

 lines(model$family$data$tr,exp(-model$family$data$h + 2*model$family$data$q^.5),col=2)
 lines(model$family$data$tr,exp(-model$family$data$h - 2*model$family$data$q^.5),col=2)

Here is the plot for above code


Please give your feedback in comments or shoot me a mail




2 thoughts on “How to do cox(Proportional Hazard) regression modelling using R?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s