How To Format Data for Analysis

How To Format Your Data for Analysis for treatment comparison with Model Estimated Controls

Introduction

To use the psc package you need to have two things:

  • A predictive statistical model in the form of a flexible parametric model (that can be used to predict the outcome of patients had they been treated with the control treatment).

  • A data cohort that were treated with the experimental treatment.

It is important that your data is formatted correctly before using the psc package.

Data

Let’s load in our data and view it…

load("espac4gemcap.R")
head(espac4_gemcap)
   ResecM LymphN Diff_Status  treat       time cen PostOpCA199
2       0      1           2 GEMCAP   8.707019   1    1.383791
5       0      0           0 GEMCAP  49.277267   1    0.000000
8       0      1           1 GEMCAP   6.735929   1    1.824549
9       1      0           1 GEMCAP  21.911958   1    1.131402
10      1      1           0 GEMCAP  23.554534   1    5.855931
12      1      0           0 GEMCAP 118.628176   0    2.595255

We should also load in the model and view the data that was used to build the model…

load("flsm.R")
head(flsm$data$m)
   Surv(time, cen) LymphN ResecM Diff_Status PostOpCA199 (weights)
18       11.300920      1      0           0    6.222576         1
19       14.816032      1      0           1    4.394449         1
28       17.181340      1      0           2    9.314430         1
40        9.789750      1      0           0    4.077537         1
47        5.256242      1      1           0    2.772589         1
52       23.028909      1      0           1    8.398860         1

Note that the time and event variables are encoded as ‘time’ and ‘cen’ in both the model and the data cohort. This is key for when we want to make our comparison.

We should also check that the categorical variables have class factor and are dummy encoded, and that continuous variables are class numeric in the data cohort.

str(espac4_gemcap)
'data.frame':   362 obs. of  7 variables:
 $ ResecM     : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 1 2 2 ...
 $ LymphN     : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 1 1 ...
 $ Diff_Status: Factor w/ 3 levels "0","1","2": 3 1 2 2 1 1 1 2 3 2 ...
 $ treat      : chr  "GEMCAP" "GEMCAP" "GEMCAP" "GEMCAP" ...
 $ time       : num  8.71 49.28 6.74 21.91 23.55 ...
 $ cen        : int  1 1 1 1 1 0 0 1 0 0 ...
 $ PostOpCA199: num  1.38 0 1.82 1.13 5.86 ...

If variables are not the correct class this can be changed:

espac4_gemcap$LymphN <- as.factor(espac4_gemcap$LymphN)
espac4_gemcap$PostOpCA199 <- as.numeric(espac4_gemcap$PostOpCA199)

It is also important to ensure that the variables present in the model are also present in the comparison dataset, and that their names match exactly:

names(flsm$data$m[,2:5]) %in% names(espac4_gemcap)
[1] TRUE TRUE TRUE TRUE

Now that we have checked the correct variables are in our dataset and that they are encoded correctly, we are ready to use the psc package to make a comparison of two treatments!

Take homes

  • Ensure you have a statistical model and a data cohort.

  • Make sure variables are correctly encoded

  • Variable names should be consistent in the model and the data cohort