Using categorical variables in regression modeling

Similar to Stata, SAS, SPSS and other statistical platforms, R needs to differentiate between a categorical variable and a continuous when running regression models.
In R, categorical variables are referred to as.factor and it is essential to transform a specific numerical variable into a factor before or when specifying the regression model. 

Using the retinol data let us illustrate the use of factor variables:

>retinol=read.csv2("C:/filepath/TPretinol.csv")



#running a linear model with the variable tabac not treated as.facor
>lm=lm(retplasma~tabac, data=retinol)

>summary (lm)


Call:
lm(formula = retplasma ~ tabac, data = retinol)

Residuals:
    Min      1Q  Median      3Q     Max 
-435.38 -134.63  -38.36  115.13 1121.13 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  588.848     29.635  19.870   <2e-16 ***
tabac          8.511     16.599   0.513    0.608    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 209.1 on 313 degrees of freedom
Multiple R-squared:  0.0008393, Adjusted R-squared:  -0.002353 

F-statistic: 0.2629 on 1 and 313 DF,  p-value: 0.6085


#however, the variable tabac should be considered as a categorical variable with 3 levels (1=Never smoker, 2=Ex-smoker, 3=Current). It is therefore erroneous to treat it as a continuous variable. 

#transforming the variable tabac into factor
>retinol$tabac=as.factor(retinol$tabac)

>lm1=lm(retplasma~tabac, data=retinol)

>summary (lm1)

Call:
lm(formula = retplasma ~ tabac, data = retinol)

Residuals:
    Min      1Q  Median      3Q     Max 
-428.24 -144.57  -26.31  112.19 1082.76 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   583.31      16.53  35.297   <2e-16 ***
tabac2         60.94      25.41   2.398   0.0171 *  
tabac3        -20.24      35.64  -0.568   0.5706    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 207.1 on 312 degrees of freedom
Multiple R-squared:  0.02372, Adjusted R-squared:  0.01747 

F-statistic: 3.791 on 2 and 312 DF,  p-value: 0.02363

By default R, treated the first category as the reference category; it is worth noting that Stata uses the same default action when processing a categorical variable.


# in order to change the reference category. The function relevel() is used with the instruction ref="2" which fixes the new level of reference, which in this case is set to 2. The function constrasts() can be used again to verify that “2” is indeed the new reference.


>retinol$tabac<-relevel(retinol$tabac, ref="2")

>contrasts (retinol$tabac)

  1 3
2 0 0
1 1 0

3 0 1


#suppose we want level 3 to be the reference category:

>retinol$tabac<-relevel(retinol$tabac, ref="3")

>contrasts (retinol$tabac)


  2 1
3 0 0
2 1 0
1 0 1



Introduction to the Analysis of Survival Data in the Presence of Competing Risks

 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4741409/