|
|
|
|
Global Statements Details
%MACRO name(a,b) Defines a new macro with two macro variables
&a Used to represent macro variable in a macro
%MEND name Ends a macro
%name(2,5) Executes a macro with values of macro variables as given
It is widely known that smoking cigarettes is a risk factor for lung cancer. The relative risk is about 4, meaning that the probability of a smoker developing lung cancer is 4 times as large as the probability of a non-smoker developing lung cancer. This risk ratio can range up to 30 for heavy smokers. These risk ratios are determined by studing the rate of lung cancer in human populations. In practice, the risk ratios can rarely be directly assessed. Instead, an "odds ratio" is used as an approximation to the risk ratio. The problem we address is how close the "odds ratio" is to the risk ratio.
One way of studing the relative risk of lung cancer due to cigarette smoking is to select a simple random sample of subjects from a population, and then identify each subject's risk factor status (smoking status). We follow the subject over a period of time (say 1 year), and determine the proportion of persons who develop lung cancer in the smoking and in the non-smoking groups. Such a study is called a Cohort Study. Currently, about 28% of adults in the US smoke cigarettes. If the smokers and non-smokers were identified by selecting a simple random sample from the population, we would expect about 28% of sample subjects to be smokers. For simplicity, we will assume the smoking rate is 30% in this illustration.
Suppose we adopt this approach to investigate the risk ratio for smoking. We want to have a good idea of the risk ratio, so we select a large sample, one with n=10,000. In this sample, let us assume that 3,000 subjects smoke, while the rest are non-smokers.
|
Description |
Sample |
|
Population |
10000 |
|
Smokers |
3000 |
|
Non-Smokers |
7000 |
The second step in a cohort study is to follow-up the study subjects over time. Suppose we follow these subjects up over a 1 year period, and enumerate the number of subjects who develop lung cancer. The annual incidence of lung cancer (the probabiltiy of a person being diagnosed with lung cancer in a year) is 0.000821 (see SEER report). This incidence rate is usually reported as a rate per 100,000 people, or 82.1.
In our sample, we would expect 8 people to develop lung cancer in a year. Suppose that we find that 5 smokers develop lung cancer, while 3 non-smokers develop lung cancer. These data can be summarized in the following 2 by 2 table in Table 4.1.
Table 4.1. Hypothetical results of Cohort Study on Smoking and Lung Cancer
|
|
|||||
|
Exposure |
Status |
Yes |
No |
Total |
Incidence |
|
Smoker? |
Yes |
5 |
2995 |
3000 |
0.001666 |
|
No |
3 |
6997 |
7000 |
0.000428 |
|
The risk ratio is estimated from this study as: RR=3.889
There are problems with this approach of studying disease. One problem is the amount of work (following 10,000 people for one year), for the relatively small number of lung cancer cases. The second problem is that a small difference in the distribution of lung cancer cases can make a large difference in the estimated risk ratio. For example, if there were 6 lung cancer cases among the smokers, and 2 among the non-smokers, the estimated RR = 7. In contrast, if there were 4 lung cancer cases in each group, the estimated RR = 2.33. Thus, even with all this work, we would not be very sure that the RR of 3.889 would apply to another group of 10,000 people. For these reasons, a different study design is usually used.
A case control study attempts to conserve resources, and get a
better answer of the relative risk of disease. To do so, a sample of
subjects (patients) with lung cancer, and subjects without lung
cancer are selected. The smoking status is then evaluated for each
subject. To see how a case control study works, suppose that a
population consists of
1,000,000 subjects, where 300,000 smoke, and the rest do not. Also,
let us assume the incidence rate of cancer is 82.1 per 100,000. A
case/control study begins by selecting a sample of lung cancer
"cases", and a similar sample of "controls". Suppose we select a
sample of 100 cases and 100 controls. For each case and control, we
establish whether or not the subject was a smoker. The results can be
summarized in a 2 by 2 table as in a Cohort study. The numbers in
this Table were constructed so that the true risk ratio in the
population is 4.
Table 4.2. Expected distribution of Smoking and Lung Cancer from a Case/Control study assuming a Risk Ratio of 4 [assuming Prob(Smoke)=0.3, Prob(Lung Cancer)= 0.000821]
|
|
|||||
|
Exposure |
Status |
Yes |
No |
Total |
Proportion with Lung Cancer |
|
Smoker? |
Yes |
63 |
30 |
93 |
0.6774 |
|
No |
37 |
70 |
107 |
0.3458 |
|
|
Total |
100 |
100 |
|||
Suppose we use the same method of calculating the risk ratio in a case/control study as we used in a cohort study. Dividing the proportion of smoking subjects with lung cancer by the proportion of non-smoking subjects with lung cancer, we find a ratio of 1.959. This ratio is very different from the population risk ratio of 4.
However, if we divide the odds of a lung cancer patient smoking (63/37) by the odds of a subject without lung cancer smoking (30/70), we find the relative odds ratio is equal to 3.973, very close to the value of 4 that is the true risk ratio for the population. The similarity of the relative odds ratio to the risk ratio is the justification for use of the odds ratio to approximate the risk ratio in case/control studies.
Case/control studies have the advantages of being economical, and relatively easy to conduct. They do not require a prospective time period, and hence can produce results in a more timely manner. Case/control studies have the disadvantage of having to assess exposure retrospectively (and hence perhaps not quite as accurately). It is also possible in case/control studies that the exposure occurred after the disease. However when such an occurance is unlikely, and when this loss in accuracy is small, case/contol studies have a considerable advantage over cohort studies.
Since the objective of a case/control study is to estimate the relative risk of an exposure, the validity of the approximation to the relative risk by the relative odds ratio is important. The approximation is said to be "good" for "rare diseases". We will examine using a simulation what is meant by "good" and "rare diseases". Questions of interest include
Prior to using the computer to simulate settings that enable us to place some meaning on the "good approximation assuming the disease is rare" assumption, it is helpful to construct a simple example. We assume for simplicity that risk factor status and disease status can be measured without error in either study design. Case/control studies and cohort studies are both designed to answer a question in a defined population. We start our example by defining a population, and specifying how estimates would be constructed for a case/control and a cohort study.
Suppose a population is given as in Table 4.3. A cohort or case/control study design could be used with this popualtion to estimate the relative risk of lung cancer for smokers. In the cohort design, the relative risk would estimate the quantity: 4.03.
Table 4.3. Hypothetical population
|
|
|||||
|
Exposure |
Status |
Yes |
No |
Total |
Incidence |
|
Smoker? |
Yes |
52 |
29948 |
30000 |
0.00173 |
|
No |
30 |
69970 |
70000 |
0.000429 |
|
|
Total |
82 |
99918 |
100000 |
0.00082 |
|
Next we consider a case/control study for this same popualtion. The first step in designing a case/control study is a decision as to the number of cases, and the number of controls to select. We could choose to select equal numbers of cases and controls, twice as many controls as cases, or even 12194 times as many controls as cases. To specify a case/control study, the ratio of controls per case needs first to be determined. Is this choice important when comparing the relative odds ratio with the relative risk?
We postpone this question for the moment, and assume that we design our case/control study with equal number of cases and controls. Let us assume the study includes all cases and an equal number of controls, as in Table 4.4a.
Table 4.4a. Case/Control Sample with Equal Numbers of Cases and Controls
|
|
|||||
|
Exposure |
Status |
Yes |
No |
Total |
|
|
Smoker? |
Yes |
52 |
25 |
87 |
|
|
No |
30 |
57 |
107 |
||
|
Total |
82 |
82 |
164 |
||
The relative odds ratio =(57)(57)/[(30)(25)]=3.952
Suppose we double the number of controls, equivalent to doubling the frequency in the second row of Table 4.4a to produce Table 4.4b.
Table 4.4b. Case/Control Sample with Twice as Many Controls as Cases
|
|
||||
|
Exposure |
Status |
Yes |
No |
Total |
|
Smoker? |
Yes |
52 |
50 |
102 |
|
No |
30 |
114 |
144 |
|
|
Total |
82 |
164 |
246 |
|
Then the relative odds ratio is still given as =(2*57)(57)/[(30)(2*25)]=3.952
Thus, changing the ratio of cases to controls in a case/control study does not affect the odds ratio. Without loss of generality, we can set the number of cases equal to the number of controls when comparing the risk ratios and odds ratios in a simulation.
The example in Section 5 makes clear that the key to designing a simulation is specifying the hypothetical population in the 2 by 2 table. The elements in the four cells of the 2 by 2 table depend on three quantities:
These values, plus the number of subjects in the population (N) completely specify the values in the cells of a 2 by 2 table. We can see how different populations may be constructed via Table 4.5.
Table 4.5. Hypothetical Population with Exposure and Disease
|
|
||||
|
Yes |
No |
Total |
||
|
Exposure? |
Yes |
a |
b=(NE-a) |
NE |
|
No |
c=(NI-a) |
d |
N(1-E) |
|
|
Total |
NI |
N(1-I) |
N |
|
Then R=[a/NE]/[(NI-a)/[N(1-E)] or a=N[ (ERI)/(1-E+RE) ]
For given values of E, R, N, and I, we can evaluate "a". Once "a" is determined, the values of "b", "c", and "d" are also determined. For example, with a population of N=100,000 , the 2 by 2 table for lung cancer and smoking with a R=4 is given in Table 4.3. Note that formula defining the risk ratio and relative odds ratios are given in terms of Table 4.5 as
Relative Risk Ratio = R = [a/(a+b)] / [c/(c+d)] whereasRelative Odds Ratio = [ad]/[bc].
The steps involved in the simulation are as follows:
We begin by developing a program (lec26p1.sas) to generate a population, evaluate the risk ratio and odds ratio, and the difference
diff= (odds-ratio)- (risk ratio)
In addition, we define the relative percent difference by dividing this difference by the risk ratio, and expessing it as a percent. The results are given in lec26p1t1.lst.
These variables are referred to via the array "ARSK", where the "rth" element in the array is specified as "ARSK{r}". We add additional ARRAY variables in LEC26P4.SAS to generate different values of the incidence rate as well as the risk ratios, with results shown in Table 3 (lec26p4t1.lst).
Notice the negative values for entries in Table 3 when the exposure rate and the risk ratios are both large. Inspection of the value of "a" for these entries shows that more "diseased exposed" subjects were expected assuming I, R, and E than existed in the entire population. In other words, this combination of I, R, and E parameters can not exist. This problem with the simulation can be corrected by adding a check to make sure that all cell sizes are non-negative.
This check, plus two descriptive plots, are included in LEC26P5.SAS. The first set of plots allows the SAS program to determine the scale for the axes. The second set of plots set the axis scale (see output).
The graphs that are produced by PROC PLOT are examples of low resolution graphics. Better graphs and plots can be produced by PROC GPLOT. Additional statements that produce such graphs are also included in the program LEC26p6.SAS, and are listed below. The resulting graph can be copied directly from SAS using the Editor to the Clipboard, and pasted into a report (as in lec26g1.doc).
It is of interest to repeat the simulation for different levels of exposure of the risk factor in the population. We can simply change the exposure levels, and rerun the program. However, it is helpful to have a way of automatically repeating a set of commands. Such repetition makes use of MACROS. SAS MACROS are groups of DATA and PROC steps that function together. The groups are given a MACRO name, and marked. Variables can be defined that are MACRO variables, such that they can be changed when the set of DATA and PROC steps are executed.
MACRO commands make programs shorter, and if overused, are at the expense of clarity. In long simulations, the shorter set of programming commands may, however, improve clarity. We first consider a simple program illustrating the definition and use of a MACRO. Next, we use a macro to perform our simulated comparison of risk ratios and odds ratios at different levels of exposure.
A macro modules is a set of SAS statements that can be repeatly executed with a simple macro command. The statements are grouped into a macro that is given a name. There may be variables that are used by the SAS statements in the MACRO. These variables are MACRO variables. A simple example is given by lec26p7.sas, with the output given in lec26p7t1.lst.
Macros are begun with a %MACRO command, and end with a %MEND command. Macro variables are indicated by a preceeding "&" sign in the program. A Macro is executed by preceeding the macro name by "%".
Program lec27p8.sas provides an example of Risk Ratio/Odds Ratio Simulation with Exposure at 0.1, 0.3, and 0.5..
|
|
Produced and maintained by the
Dept
of BioEpi at UMASS |