In R, categorical variables are referred to as.factor and it is essential to transform a specific numerical variable into a factor before or when specifying the regression model.
Using the retinol data let us illustrate the use of factor variables:
>retinol=read.csv2("C:/filepath/TPretinol.csv")
#running a linear model with the variable tabac not treated as.facor
>lm=lm(retplasma~tabac, data=retinol)
>summary (lm)
Call:
lm(formula = retplasma ~ tabac, data = retinol)
Residuals:
Min 1Q Median 3Q Max
-435.38 -134.63 -38.36 115.13 1121.13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 588.848 29.635 19.870 <2e-16 ***
tabac 8.511 16.599 0.513 0.608
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 209.1 on 313 degrees of freedom
Multiple R-squared: 0.0008393, Adjusted R-squared: -0.002353
F-statistic: 0.2629 on 1 and 313 DF, p-value: 0.6085
#however, the variable tabac should be considered as a categorical variable with 3 levels (1=Never smoker, 2=Ex-smoker, 3=Current). It is therefore erroneous to treat it as a continuous variable.
#transforming the variable tabac into factor
>retinol$tabac=as.factor(retinol$tabac)
>lm1=lm(retplasma~tabac, data=retinol)
>summary (lm1)
Call:
lm(formula = retplasma ~ tabac, data = retinol)
Residuals:
Min 1Q Median 3Q Max
-428.24 -144.57 -26.31 112.19 1082.76
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 583.31 16.53 35.297 <2e-16 ***
tabac2 60.94 25.41 2.398 0.0171 *
tabac3 -20.24 35.64 -0.568 0.5706
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 207.1 on 312 degrees of freedom
Multiple R-squared: 0.02372, Adjusted R-squared: 0.01747
F-statistic: 3.791 on 2 and 312 DF, p-value: 0.02363
By default R, treated the first category as the reference category; it is worth noting that Stata uses the same default action when processing a categorical variable.
# in order to change the reference category. The function relevel() is used with the instruction ref="2" which fixes the new level of reference, which in this case is set to 2. The function constrasts() can be used again to verify that “2” is indeed the new reference.
>retinol$tabac<-relevel(retinol$tabac, ref="2")
>contrasts (retinol$tabac)
1 3
2 0 0
1 1 0
3 0 1
#suppose we want level 3 to be the reference category:
>retinol$tabac<-relevel(retinol$tabac, ref="3")
>contrasts (retinol$tabac)
2 1
3 0 0
2 1 0
1 0 1