Using splines to check linearity between predictor X and outcome Y {mgcv}

In linear regression analysis, If the predictor X is the numerical variable "calories", then an increase of X from 10 to 30 will have the same impact on Y as an increase from 60 to 80. Similar logic could be applied in a logistic regression or a Poisson regression, where the linearity between X and the Log[prob(y=1)/1-prob(y=1)] (aka logit[prob(Y=1)]) needs to be assessed.

One of the very first steps in using GLM for statistical modeling is to verify the assumption of linearity between all numerical predictors and the outcome. This could be done using a regression spline. A spline is the best smoother that can summarize a relationship between two variables. If it is close to a straight line, the assumption of linearity is verified. If it is monotonic, with only a slight curvature, the assumption can be accepted but the statistical power of the test of association between Y and Xi will be depreciated.

Using the retinol data we are interested to test whether there is a linear relationship between calories and retdiet (amount of retinol consumed):

using the package {mgcv}
>mod<-gam(retdiet~s(calories), data=retinol)
>plot (mod)

It can be observed that the relationship between amount of calories and amount of retinol consumed is not strictly linear.

now let us test the relationship between Y (concentration of plasmatic retinol) and X (amount of retinol consumed)

>mod<-gam(retplasma~s(retdiet), data=retinol)
>plot (mod)


The relationship is clearly linear


*for logit or poisson models, you need to add the option family=binomial or family=poisson to the gam function.

#example
>mod<-gam(binaryvariable~s(age), data=dataset, family=binomial)


ref: Bruno Falissard (The analysis of questionnaire data using R)




Using categorical variables in regression modeling

Similar to Stata, SAS, SPSS and other statistical platforms, R needs to differentiate between a categorical variable and a continuous when running regression models.
In R, categorical variables are referred to as.factor and it is essential to transform a specific numerical variable into a factor before or when specifying the regression model. 

Using the retinol data let us illustrate the use of factor variables:

>retinol=read.csv2("C:/filepath/TPretinol.csv")



#running a linear model with the variable tabac not treated as.facor
>lm=lm(retplasma~tabac, data=retinol)

>summary (lm)


Call:
lm(formula = retplasma ~ tabac, data = retinol)

Residuals:
    Min      1Q  Median      3Q     Max 
-435.38 -134.63  -38.36  115.13 1121.13 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  588.848     29.635  19.870   <2e-16 ***
tabac          8.511     16.599   0.513    0.608    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 209.1 on 313 degrees of freedom
Multiple R-squared:  0.0008393, Adjusted R-squared:  -0.002353 

F-statistic: 0.2629 on 1 and 313 DF,  p-value: 0.6085


#however, the variable tabac should be considered as a categorical variable with 3 levels (1=Never smoker, 2=Ex-smoker, 3=Current). It is therefore erroneous to treat it as a continuous variable. 

#transforming the variable tabac into factor
>retinol$tabac=as.factor(retinol$tabac)

>lm1=lm(retplasma~tabac, data=retinol)

>summary (lm1)

Call:
lm(formula = retplasma ~ tabac, data = retinol)

Residuals:
    Min      1Q  Median      3Q     Max 
-428.24 -144.57  -26.31  112.19 1082.76 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   583.31      16.53  35.297   <2e-16 ***
tabac2         60.94      25.41   2.398   0.0171 *  
tabac3        -20.24      35.64  -0.568   0.5706    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 207.1 on 312 degrees of freedom
Multiple R-squared:  0.02372, Adjusted R-squared:  0.01747 

F-statistic: 3.791 on 2 and 312 DF,  p-value: 0.02363

By default R, treated the first category as the reference category; it is worth noting that Stata uses the same default action when processing a categorical variable.


# in order to change the reference category. The function relevel() is used with the instruction ref="2" which fixes the new level of reference, which in this case is set to 2. The function constrasts() can be used again to verify that “2” is indeed the new reference.


>retinol$tabac<-relevel(retinol$tabac, ref="2")

>contrasts (retinol$tabac)

  1 3
2 0 0
1 1 0

3 0 1


#suppose we want level 3 to be the reference category:

>retinol$tabac<-relevel(retinol$tabac, ref="3")

>contrasts (retinol$tabac)


  2 1
3 0 0
2 1 0
1 0 1



Introduction to the Analysis of Survival Data in the Presence of Competing Risks

 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4741409/