R codes and Programming: July 2020

How to generate a simulated data in R

Package {simstudy}

The below example uses the library(simstudy) to generate a data of 10 observation for 3 variables: nr, x1 and y1. The function defData allows us to define the name, the formula and the distribution of each variable.
Check CRAN webiste for additional documentation on the simstudy package.

```
def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)

dt <- genData(10, def)

dt

```

How to convert all characters of a data frame to Numeric

```
char_columns<-sapply(merged_data, is.character) #identify character columns
data_char_num<-merged_data #replicate data
data_char_num[,char_columns]<-as.data.frame(apply(data_char_num[,char_columns],2, as.numeric))#recode character as numeric
sapply(data_char_num, class) #print classes of all columns
```

How to make predictions from a linear model

Note: for numerical values you enter numbers without " ", for factor variables you enter the "label"

```
df<-data.frame(res=c("ModHigh"), age=c(21), sex=("female"), train=c("clinical"), bdi=c(4))
pred<-predict(adj.model6,df, data=resilience)
print(pred)
```

Apply Function in R | R Tutorial 1.15 | MarinStatsLectures

different usages of tapply

using tapply to run a command, in this case calculating the variance of the variable met within each of the categories of the variable coffee consumption (3 categories)

```
tapply(coffee.exercise$met, coffee.exercise$coffee.consumption, var)

```
rowMeans, rowSums, colMeans, colSums, can do similar functions as apply. These are pretty similar to Stata's egen rowmean and rowtotal functions

Below are some examples:
```
colSums(ham, na.rm = T)
colMeans(ham, na.rm = T)
```

About subsetting data in R

Unlike Stata and other statistical package, running a cross tabulation on a subset of data in R is not a very straight forward thing.
Let us assume the following scenario: assuming I want to cross tabulate Sex (M, F) by Tobacco (1 - Current, 2-Ex, 3-Never), but by excluding the Never smoking category. In stata a simple if Tobacco!=3 would suffice. However in R we need to subset the data prior to tabulating it:

```{r}
#subsetting the data
retinol1<-subset(retinol, tabac!=3)
table(retinol1$Sex, retinol1$tabac)
```

However, subsetting can be embedded directly if we are doing univariate analysis:

``` {r}
table(retinol$tabac[retinol$tabac!=3])

```

Update! Turns out that the function xtab has a subset option!

```

xtabs(~ Sex + tabac, retinol, subset = tabac != 3)

```

How to conduct LOCF inputation in R; library{zoo}

Let us assume that weight has been measured on 624 patients for 4 distinct time points: M0, M1, M3 and M6

``` {r}

library(zoo)

#start by creating the vectors which includes the variables we want to use for imputation

WeightImpute<-cbind(MetSData$POIDS_M0,MetSData$POIDS_M1,MetSData$POIDS_M3,MetSData$POIDS_M6)

#then we rename the columns

colnames(WeightImpute)=c("w0", "w1", "w3", "w6")

#creating a replicate array to be used within the for loop

WeightImputeF=WeightImpute

#creating an object which is equal to the number of rows within our array (624)

n=dim(WeightImpute)[1]

#creating a counter (1:624) labeling it index

index=which(!is.na(WeightImpute[,1]))

#creating a for loop using the na.locf function from library (zoo) that will carry on the LOCF. ATTENTION: the imputation will be carry out by column, that is the `i' is placed in the row part of the argument

for(i in index){WeightImputeF[i,]=na.locf(WeightImpute[i,])}

WeightImputeF

```

Labeling points on scatter plot

In the below example i show how to assign a label (in this case Casenr) to each data point on my scatter plot.

```
plot(BMI~Age, data=prevend.sample)
text(BMI~Age, labels=Casenr, data=prevend.sample)
```

recode using elseif command

The below code creates a new variable "hdrs1" which takes the values of HAMD17tot_M0 if visit=0, and the values of HAMD17tot_M1 if visit =1 and the values of HAMD17tot_M3 if visit=3

I start by creating a vector than contains the values of HAMD17tot_M0, then I start modiying the vector using the ifelse command: if visit=1 then replace values by those of HAMD17tot_M1, else keep as is.

```
romain1$hdrs1<-romain1$HAMD17tot_M0
romain1$hdrs1<-ifelse(romain1$visit==1,romain1$HAMD17tot_M1, romain1$hdrs1)
romain1$hdrs1<-ifelse(romain1$visit==3,romain1$HAMD17tot_M3, romain1$hdrs1)
```

running a command on a subset of data (similar to if condition in stata)

``` hist(coffee.exercise$met[coffee.exercise$coffee.consumption=="A"])
hist(coffee.exercise$met[coffee.exercise$coffee.consumption=="B"])
hist(coffee.exercise$met[coffee.exercise$coffee.consumption=="C"])
hist(coffee.exercise$met[coffee.exercise$coffee.consumption=="D"])
````

How to remove NA as a level

by including NA in " " we transform it into a real missing value in an R factor vector

```
HIV_coded$bisexual<- factor(replace(HIV_coded$bisexual, HIV_coded$bisexual == "NA", NA))
```

Writing loops in R to loop over functions

The below vector is used to loop over 3 columns that I want to cross tabulate against sexual identity:

```{r}
l=c("nationality", "sex_at_birth", "education")
for (i in l){
mytables<-table(HIV_coded$sexual_identity, HIV_coded[,i])
print(mytables)
}
```

The below loop is used to calculate the summary statistics for a series of variables:
```
l=c("age", "age_1")
for (i in l) {
mymeans<-summary(HIV_coded[,i])
print(mymeans)
}
```

R codes and Programming