**Welcome to the PCERC Biostatistics resources page.**

**Here we provide information related to five major biostatistics topics in Comparative Effectiveness research.**

**Here we provide information related to five major biostatistics topics in Comparative Effectiveness research.**

- Who would you like to study?
- Who are you trying to sample?
- Who are you able to sample?
- Who will be approached for inclusion in the study?
- Who will enroll?

- How will the outcome be measured?
- What type of variable is the outcome?

- How will the predictor be measured?
- What type of variable is the predictor?

- A shared common cause of both the explanatory variable and the outcome variable is biased

- Basic association between explanatory variable and outcome variable is biases

- What are potential confounders?
- How will these be measured?

- A shared common cause of both the explanatory variable and the outcome variable is biased

- Association between predictor and outcome in entire sample might not show true effect of predictor

- What are potential effect modifiers?
- How will these be measured?

**General:** General: Our study will determine if ________ (explanatory variable) is associated with __________ (outcome variable) in _____________ (population)

**Example:** Our study will determine if vitamin D supplementation is associated with the time to next relapse in relapsing-remitting MS patients

**General:** Our study will determine if ________ (explanatory variable) is associated with __________ (outcome variable) accounting for __________ (confounders) in _____________ (population)

**Example:** Our study will determine if vitamin D supplementation is associated with time to next relapse accounting for age, gender and treatment in relapsing-remitting MS patients

**General:** We will assess if the association between ________ (explanatory variable) and __________ (outcome variable) changes based on __________ (effect modifiers) in _____________ (population)

**Example:** We will assess if the association between vitamin D supplementation on time to next relapse changes based on HLA DRB1501 status in relapsing-remitting MS patients

In order to perform an analysis, we must collect data and prepare the dataset for analysis. In this section, we will discuss creating a dataset in Excel that will be easy to import into a statistical package like R or Stata. Then, we will describe approaches for reading data into R and Stata from a comma delimited file or an Excel spreadsheet.

A common scenario in clinical research is to create the initial dataset using Microsoft Excel or database package like Microsoft Access. In either case, here are four rules for setting up your dataset in Excel or your database program (Access, Oracle) to ensure that you can easily transfer your data into a statistical software package.

- Use only the first row for variable names. Make each of the variable names unique, descriptive, short (less than 32 characters), and without any special symbols. Do not use quotes, commas, apostrophe, or any of the symbols associated with the numbers on the keyboard (!, @, #, $, %, ^, &, *). Also do not start any of the variable names with a number.
- Do not use quotes, commas, apostrophe, or any of the symbols associated with the numbers on the keyboard (!, @, #, $, %, ^, &, *) in any of the entries in your dataset. If a patient’s name is Daniel O’Brien, type this as Daniel OBrien.
- Do not have open space in your dataset. This means that there should be no empty columns or rows
- Do not perform calculations within your analysis dataset. Often researchers will calculate the mean of a variable at the bottom of a column.
- As an example of how to set up a dataset that can be imported easily into Stata, please open the Practice_data.xlsx file. The information on the “Original data” spreadsheet would not be easily imported into a statistical software package because several of the rules from above are violated.
- The first row is blank and the dataset starts several rows down. This blank space makes the Excel spreadsheet easier to look at, but it is hard to read this format into a statistical software package.
- The treated and untreated groups are separated by several rows and there is no variable indicating which patients were in each group.
- The variable names for several variables require more than one row.
- The data for the first time point and the second time point are colored, which will not be imported into a statistical software package.
- The means of variables are calculated within the dataset.
- To create a spreadsheet that can easily be imported into a statistical software package, we have created an additional spreadsheet called “Data for analysis”. In this dataset, several important things have changed.
- The name and social security number of the participants have been removed. Since this information is not likely needed for analysis, we do not need it in the analysis spreadsheet. If it is required for analysis, we can include it.
- The first row contains the variable names for each column. Note that the variable names had to be extended so that we know the second column is Time1Factor1 rather than having two rows for variable names. Each variable name is unique, descriptive and as short as possible.
- We have placed the data for each group together with no extra space. In order to indicate which patients were treated and which patients were untreated, we have added a new variable called Treatment (=0 for untreated, =1 for treated).
- We have removed the mean of the variables. This information is easily calculated in a statistical software package.
- Note that there are no special characters in our dataset.

In addition to the dataset for analysis, it is often advisable to have a separate sheet that is a code for the variables that provides labels for the variables. This sheet also explains how dichotomous or categorical variables are coded and provides units for continuous variables. Please check the “Code” sheet.

*Importing data from .csv files*

To import data from a .csv file into Stata, we can use the drop-down menus by going to *File/Import/Text data (delimited, *.csv, …)*. Note that Stata can import data in other formats, but we will focus on this command initially. The following window will open. First, if we know the location of the dataset, we can just type the path in the “File to import” box (blue arrow), but often we must click “Browse” to find the dataset we would like to import (red arrow).

After we click “Browse”, the following window opens. At the top of the menu, we must select the location of the file we would like to import (blue arrow). Then, we must choose the file type of the file we would like to import (red arrow). In this case, we will choose “Text files”, of which .csv is one of the options. Next, we must choose the file we would like to import, and the file name will appear in the “File name” box (green arrow). Finally, we click Open (orange arrow).

After we click Open, the location and file name have been placed in the “File to import” box (blue arrow). Then, we choose the delimiter, which is how the columns are separated in the dataset. Since we have a comma delimited file, we choose the “Comma” option (red arrow). Note that Stata can try to figure out the delimiter automatically, but it is best to specify the delimiter when possible. We should also state that the first row is variable names (green arrow). Finally, note that Stata provides a preview of the dataset so that we can determine if the data are being imported correctly (orange arrow). Then, we click OK.

We can tell that the dataset has been imported into Stata because the variables now appear in the Variables window. Please open the Data Editor as well to check that the dataset has been imported. In the Results and Review windows, we can see the Stata command used to import the data was the *import* command:

. import delimited “C:\Users\Brian\Desktop\Lecture notes\Practicum folder\2012_2013\Practice_data.csv”, delimiter(comma) varnames(1)

(8 vars, 12 obs)

After the *import *command (blue arrow), the location of the file is listed (red arrow) and the delimiter is shown using the *comma *option (green arrow). Stata also tells us that 8 variables and 12 observations were imported. The variables correspond to the columns and the observations correspond to the rows. If this information does not match what you expected, your *import* command may not have worked properly.

*Importing data from .xls files*

Please type *clear* so that we can import another dataset. the Rather than saving data as a comma delimited file, we can import data directly from Excel starting with Stata version 12. In Stata version 13 or later, we can use the drop-down menus by going to *File/Import/Excel spreadsheet (*.xls, *.xlsx)*. The following window will open. First, if we know the location of the dataset, we can just type the path in the “Excel file” box (blue arrow), but often we must click “Browse” to find the dataset we would like to import (red arrow).

After we click “Browse”, the following window opens. At the top of the menu, we must select the location of the file we would like to import (blue arrow). Then, we must choose the file type of the file we would like to import (red arrow). In this case, we will choose “Excel Files (*.xls, *.xlsx”. Next, we must choose the file we would like to import, and the file name will appear in the “File name” box (green arrow). Finally, we click Open (orange arrow).

After we click Open, the location and file name have been placed in the “Excel file” box (blue arrow). Then, we choose the worksheet that we would like to import (red arrow). In this case, we would like to import the “Data for analysis” worksheet. We must also state that the first row is variable names (green arrow). Finally, note that Stata provides a preview of the dataset so that we can determine if the data are being imported correctly (orange arrow). Then, we click OK.

We can tell that the dataset has been imported into Stata because the variables now appear in the Variables window. Please open the Data Editor as well to check that the dataset has been imported. In the Results and Review windows, we can see the Stata command used to import the data was the *import* command:

. import excel “C:\Users\Brian\Desktop\Lecture notes\Practicum folder\2012_2013\Practice_data.xlsx”, sheet(“Data for analysis”) firstrow

(8 vars, 12 obs)

An important step in a comparative effectiveness study is to calculate the sample size required to complete the study. In general, there are five components of the sample size calculation:

- Type I error rate (alpha level)
- Standard deviation of the outcome
- Null and alternative hypothesis or difference between the means
- Sample size
- Power

The *type I error rate (alpha level)* is the probability of rejecting the null hypothesis given that the null hypothesis is true. The type I error rate is almost always chosen to be 0.05 based on a two-sided test, but other values can be used. The *power* is the probability of rejecting the null hypothesis given that the alternative hypothesis is true. The power is usually set at 0.8 or 0.9 when we are performing a sample size calculation.

We need to know four of these five pieces of information to calculate the final piece. As an example, if we would like to calculate the sample size for a study, we need to specify the type I error rate, standard deviation of the outcome, difference between the means, and power.

When we change any of the four features and keep the other three the same, the sample size will change in a specific way.

When the difference between the groups increases | The required sample size will decrease |

When the type I error rate (alpha level) increases | The required sample size will decrease |

When the power increases | The required sample size will increase |

When the standard deviation increases | The required sample size will increase |

Similarly, we can calculate the power by specifying the other four features. When we change any of the four features and keep the other three the same, the power will change in a specific way.

When the difference between the groups increases | The power will increase |

When the type I error rate (alpha level) increases | The power will increase |

When the sample size increases | The power will increase |

When the standard deviation increases | The power will decrease |

Every statistical software package has commands to perform sample size or power calculations. There are also several on-line calculators to perform these calculations including:

MGH Biostatistics Center: http://hedwig.mgh.harvard.edu/sample_size/size.html

Power and Sample size website: http://powerandsamplesize.com/

In many instances, we are interested in performing sample size calculations for non-inferiority trials, and these websites provide sample size calculations for these designs.

Binary/dichotomous outcome: https://www.sealedenvelope.com/power/binary-noninferior/

Continuous outcome: https://www.sealedenvelope.com/power/continuous-noninferior/

In comparative effectiveness research, the goal is commonly to estimate the association between some explanatory variable (often a treatment or some other intervention) and the outcome variable. The first step in many analyses is to identify what type of data the main explanatory variable and main outcome variable are because this can lead the analyst to choosing a reasonable measure of the association and statistical test. Below, we provide a table showing different types of variables for the outcome and explanatory variable that are common in the comparative effectiveness research with a commonly used measure of association and test.

Outcome variable |
Explanatory variable |
Measure |
Test |

Continuous | None | Mean of single sample | One sample t-test |

Dichotomous | Difference in means between groups | Two sample t-test | |

Categorical | Difference in means between groups | One-way ANOVA | |

Continuous | Correlation or linear regression coefficient | Test for correlation coefficient or linear regression | |

Dichotomous | None | Proportion of single sample | One sample test for proportion |

Dichotomous |
Difference in proportionsRelative risk Odds ratio |
Chi-squared test | |

Categorical | Difference in proportions | Chi-squared test | |

Continuous | Logistic regression coefficient or odds ratio | Logistic regression | |

Time to event | None | Kaplan-Meier curve | |

Dichotomous | Hazard ratio | Log-rank test or Cox proportional hazards model | |

Categorical | Hazard ratio | Log-rank test or Cox proportional hazards model | |

Continuous | Hazard ratio | Cox proportional hazards model |

For information about how to perform these analyses using several statistical software packages, please see this website:

After we choose the appropriate measure of association and statistical test, we will use a statistical software package to provide

- Point estimate of the measure of association
- 95% confidence interval for the estimate
- P-value for a hypothesis test.

The point estimate is the estimate of the measure of association from the available data. When we have a dichotomous explanatory variable and a continuous outcome, the point estimate will be the estimated difference in the means between the two groups.

The 95% confidence interval and p-value for a hypothesis test provide complimentary information for performing statistical inference. When we perform a study, we are usually taking a sample of subjects from a larger population, and we hope to make a statement about a specific characteristic or parameter from the larger population using our sample. The 95% confidence interval and p-value provide us with two ways to make statistical inferences.

A confidence interval provides a range of plausible values for the population characteristic or parameter based on our data. The confidence interval is constructed so that if we took many repeated samples from the population and calculated the 95% confidence interval for each of the repeated samples, 95% of the intervals would cover the true population parameter. For our example with a dichotomous explanatory variable and continuous outcome, the 95% confidence interval will provide the range of plausible values for the difference in the means.

A confidence interval is centered at the point estimate and the width of the confidence interval is related to the variability in the data, the sample size and desired level of confidence. When a confidence interval is narrow, the range of plausible values for the population parameter is small. We can narrow the width of the confidence interval by:

- Decreasing the variability
- Increasing the sample size
- Decreasing our level of confidence

When we have a specific null hypothesis in mind that we are interested in testing, we can calculate a p-value. The p-value is the probability of the observed result or something more extreme given that the null hypothesis is true. For our example with a dichotomous explanatory variable and continuous outcome, the null hypothesis could be that the mean in the two groups was the same. Therefore, the p-value would be the probability of the observed difference in the means or something more extreme given that the groups had the same mean.

In comparative effectiveness research, the goal is often to assess the effect of a dichotomous predictor or exposure on an outcome of interest. The gold standard in this type of research is the randomized clinical trial. In this study design, we randomize patients to receive either the exposure of interest or a placebo/standard of care, and we can compare the two groups in terms of an outcome. Since the patients are randomized to treatment, we know that the treatment groups are similar in terms of all factors including the measured and unmeasured confounders. Another term for this is that the two treatment groups are ** exchangeable** because the two groups could be exchanged for one another since the only difference is the assigned treatment. As a consequence of the randomization, the statistical analysis that we use to compare two treatment groups can be a two sample t-test or chi-squared test described in the “Hypothesis testing/Estimation” section because the only factor that is leading to a difference between the groups is the treatment.

To provide a more formal, we can introduce a new concept called potential outcomes or counterfactuals. The potential outcomes or counterfactual framework assesses the effect of a treatment by comparing the value of the outcome when the patient was treated (*Y _{T=1}*) to the value of the outcome when the patient was untreated (

(1)

Unfortunately, we cannot observe the same patient both on treatment and off treatment because each patient can only receive one treatment. (Note it is possible for a patient to be observed on and off treatment if we use a pre/post design, but this design has problems associated with carry-over effects and natural changes in the disease course so we will not consider these designs here.) Since we can only observe each patient as either treated or untreated, the measurement in the other condition is called a potential outcome (because we could have potentially observed this) or a counterfactual (because this is what we would have observed if contrary to fact they were given the other treatment).

Since we cannot directly calculate the quantity of interest (Equation 1) because we cannot observe the value for a subject both on and off the treatment, we must use alternative estimators. To develop these estimators we can start by noting that Equation 1 can be written as shown in Equation 2.

(2)

The first term is the mean of the outcome if all subjects had been treated, and the second term is the mean of the outcome if all subjected had been untreated. These two quantities are called the two potential outcome means.

When we have completed a randomized trial, the subjects who received the treatment were a random sample of the entire group. Therefore, we can assume that the mean of the outcome in the subjects who were treated can be used to estimate the mean of the outcome had all subjects been treated. In other words, the potential outcome mean in the treated is equal to the mean among those who were. From a mathematical notation perspective, we can show the relationship here.

and (3)

We should note here that traditional linear regression directly estimates the difference between the two measures on the right hand side of each equation. In particular, the linear regression model that we would fit for a randomized clinical trial would be:

(4)

Using this model, we can calculate the mean value in the treatment group and the placebo group using these equations:

Placebo:

Treatment:

Therefore, the treatment effect can be calculated using this equation:

Treatment effect:

It is important to note that the estimated regression coefficient here will equal the estimated difference in the means for a two-sample t-test.

For more information on potential outcomes, please see …

For more information on interpretation of simple linear regression models, please see …

For the comparison of a two sample t-test and linear regression, please see …

* *

Although randomized trials have been used to address many clinical questions, many researchers aim to estimate the effect of a treatment using observational studies. A potential problem in observational studies is that the subjects who were observed to take the treatment and the subjects who were observed to take the standard of care might be different. If there are variables that are common causes of the treatment choice and the outcome, these variables are called confounders. In the presence of confounding, the estimated difference between the treatment and standard of care groups is not equal to the effect of the treatment because the confounders change the estimate. Therefore, we need approaches to allow us to estimate the effect of the treatment in the presence of confounding.

To learn more about confounding, please see …

To see how researchers can use causal diagrams to identify confounding, please see …

In order to estimate the effect of a treatment in the presence of confounding, there are two commonly used approaches: group comparisons adjusting for other variables using regression and group comparisons accounting for the propensity score. We describe each of these approaches below.

* *

Regression is one of the most commonly used approaches for the analysis of data in comparative effectiveness research. In the randomized trial section, we showed that a simple linear regression equation to estimate the treatment effect.

* *

Regression adjustment handles confounding by modeling the difference between the treatment groups holding other variables constant. Since we are holding the other variables constant, we are comparing treated and untreated subjects who are the same on the confounders, and this can allow us to estimate the treatment effect. Let’s assume that we have a list of three confounders (C1, C2, C3). One potential regression model to estimate the treatment effect is to use this model.

Based on this model, we can look at the result of this model in the placebo and treatment group as we did in the randomized trial

Placebo:

Treatment:

Although these equations look complex, the key concept is that if the values of the confounders are equal, they will cancel out of the equation by Therefore, the treatment effect can be calculated using this equation:

Thus, multiple regression allows us to estimate the difference between the treatment group and placebo group holding the other variables constant. If we have the correct model (and this is a big caveat), we can interpret the b_{1} coefficient as the effect of the treatment.

The main assumption of this analysis is that we have the correct regression model, which means that we have included all confounders and we have included the correct functional form for each of the predictors.

An alternative approach to estimate the treatment effect in the presence of confounding is to use the propensity score. The propensity score is the probability of taking the treatment given a set of covariates. In many instances, we estimate the propensity score using a logistic regression model. If we have the same three confounders as in the Regression Adjustment section (C1, C2, C3), a propensity score model could be

We can use this logistic regression model and calculate the propensity score for each subject by using the following formula.

Using the available data, we estimate the coefficients using logistic regression, and then we can estimate the propensity score using the formula provided above. Most statistical software package can calculate the predicted probability from a logistic regression model, and the predicted probability is the estimated propensity score

There are multiple approaches for using the propensity score. First, we can add the propensity score to a regression model and control for the propensity score (or a function of the propensity score) rather than adjusting for all the confounders directly. Second, we can break the sample into groups based on the deciles (or quintiles) of the propensity score, and perform an analysis stratified on the deciles. To do this, we can use a regression model and control for decile as a categorical variable.

The third approach is called propensity score matching. For this approach, we find matches for each of the treated subjects using the propensity score and then perform the analysis in the matched groups. In many instances, we perform one-to-one matching, and then we can perform a paired analysis to estimate the treatment effect. When we perform propensity score matching, we are trying to find a group of subjects who took the standard of care that match the subjects who took the treatment. In this case, the matched sample might be different than the overall group, and we are estimating the average treatment effect in the treated, which is different than the other analyses.

The final approach is called inverse probability of treatment weighting. This approach calculates the probability of receiving the treatment (treatment or standard of care) that the patient actually received. For the treatment group, this is equal to the propensity score. For the standard of care group, this is equal to 1 minus the propensity score because:

Then, we weight each subject based on the inverse of the probability of receiving the treatment. We can perform the weighted analysis using weight regression approaches, which can be completed in statistical software packages.

When we use propensity score methods, a key intermediate step is to ensure that we have achieved balance between the treatment and standard of care group in either the matched groups, within each decile of the propensity score or in the weighted sample. This check ensures that there is no residual confounding due to the measured variables in the propensity score analyses.

To learn more about these approaches, Harvard Catalyst Post Graduate Education provides a Comparative Effectiveness Research on-line course describing all these approaches including how to perform them in statistical software: https://catalyst.harvard.edu/services/cer/

For an introduction to propensity score analysis, please see https://www.ncbi.nlm.nih.gov/pubmed/27413128.

For additional reading about the propensity score, please see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/. This article also provides a comparison of the propensity score analyses and regression adjustment.

For additional reading about reporting the results of study using the propensity score, please see https://www.ncbi.nlm.nih.gov/pubmed/25433444.