|
|
|
|
A common request is to select a simple random sample from a population. The population may consist of paper lists, or computer files. Two types of simple random samples (SRS) can be selected- with replacement and without replacement. The simplest type of SRS is a SRS with replacement. Conceptually, to select such a sample, we follow several steps:
Using SRS with replacement, the same "subject" may be selected more than once, since the slip of paper is returned to the "hat" after each selection. Many books describe how such samples can be selected using random numbers tables. This is a very tedious process, and impractical in many real settings. Instead, the sample should be selected via computer.
We will develop programs for selecting a SRS with replacement following these two steps.
Outline:
1. Determine the sample size=n, and population size=N
2. Assume that the population has been numbered from 1 to N
3. Use a Uniform Random Number Generator to select the samplea. expand the range from (0,1) to (0,N) by multiplication
b. add 1 to the result to have numbers go from 1 to (N+1)
c. use the integer part of the random number
d. repeat the process "n" times, and OUTPUT the results.
We illustrate this process with the program LEC25P1.SAS, resulting in the sample given in Table 1.
The results are illustrated in the following tables
|
Table 1. List of Sample of n=5 Subjects |
|
Table 2. Listing of the Population |
|
Table 3. Listing of the Population with Assigned Numbers |
|
Table 4. Listing of the Identified Sample |
SRS without replacement can be selected in a similar manner. Conceptually, to select such a sample, we follow similar steps as when selecting a simple random sample with replacement. The steps are:
1. Number subjects in the population from 1 to N
2. Write each subject's number on a slip of paper (of equal size)
3. Place all the slips of paper in a "hat"
4. After being blindfolded,a. Reach into the "hat" and pick one slip of paper.
b. Record the number selected.5. Repeat step (4) until "n" subjects have been selected.
Following these steps using a random numbers table is tedious. It is particularly tedious since when a new random number is selected, one has to be sure that the number hasn't previously been included, prior to including the number.
However, to select the second subject, we can no longer use the random number generator in an identical manner. We could re-number the remaining subjects, and follow the same plan, but the continual re-numbering of subjects will be confusing.
1. Generate a uniform random number. If the number is less than n/N, select the 1st subject in the list. If the number is greater than n/N, the first subject is not in the sample.2a. If the result of (1) was to include the first subject in the sample, we now need to select (n-1) subjects out of the remaining (N-1). Generate a second uniform random number, and select the second subject if the number is less than (n-1)/(N-1). Otherwise, exclude the subject from the sample.
2b. If the result of (1) was to exclude the first subject from the sample, we now need to select "n" subjects out of the remaining (N-1). Generate a second uniform random number, and select the second subject if the number is less than n/(N-1). Otherwise, exclude the subject from the sample.
3. Continue with this process until "n" subjects have been selected.
We illustrate selection of a Simple Random Sample Without Replacement with the program LEC25P3.sas. Note that in the program, there are statements such as:
nsamp=nsamp-1;
Clearly, the value of "nsamp" can not equal itself minus "1". This statement can be understood by thinking of the value on the left hand side as the "new" value, and the value on the right hand side as the "initial" value. Thus, the statement with take an initial value of "nsamp"=4, and give it a "new" value of "nsamp"=3.
The resulting sample is indicated in Table 2. Notice in Table 2 that the probability of a subject being included in the sample varies as the data are read, ranging from a maximum of 0.55 to a minimum of 0. These probabilities give the impression that there is a different chance for subjects to be included in the sample. While for a particular sequence of random numbers, this is true, when one averages over all possible random numbers, one can show that each distinct sample has the same chance of occurring.
We illustrate that this sampling strategy does give equal probability to each subject by simulating the sampling. We develop the simulation for 2 populations in LEC25P4.SAS, where Table 4.1 lists the populations, and Table 4.2 lists the selected samples.
Note the function of the OUTPUT statement may appear somewhat unexpected since subjects not selected in the sample do not appear in the dataset. In a DATA Step, SAS does not expect to encounter an OUTPUT statement, and automatically acts as if there is such a statement when none are given. As a result, for every set of instructions given, when the end of a "line" of data is reached, the results are written to the SAS dataset. In contrast, when an OUTPUT statement is included in a DATA Step, no "automatic" OUTPUT statements are placed at the end of the DATA Step. For this reason, subjects not selected in the sample are not written out to the dataset.
We conduct the simulation with LEC25P5.SAS, resulting in the following tabulation of the frequency of subject selections (given in Table 5.2).
We can examine how the program works in more detail by having intermediate results written as the program is processing. We do so by including PUT statements. A PUT statement can list whatever text you would like (in quotation marks) or the value of variables. The program LEC25P6.sas includes some PUT statements to provide details of the sample selection (see output). Using PUT statements can help to understand details of the processing of the data when programs are complicated.
The program LEC25P7.SAS illustrates in Table 7.1 that if initial values are simply fixed, even with a RETAIN statement, the function of the DATA Step is substantially altered. Using assignment statements to fix the values does not allow them to change as consecutive records are read in the data.
With this understanding of the function of the initial values in the RETAIN statement, we can consider the function of the RETAIN statement itself. We do so by comparing the results of the previous program, with LEC25P9.SAS where the RETAIN statement was deleted (see output). Note in the OUTPUT that values of NPOP and NSAMP have missing values (represented in SAS by ".") for all records except the first. These values are missing since after the first record, there is nothing in the program for subsequent records that defines their values. Only if the value from a previous record is known can these variables assume values.
Selection of a Stratified Sample from Anaconda
A study was recently (1996) conducted in Anaconda, Montana to evaluate soil ingestion among pre-school age children. The town had a copper smelting plant which contaminated large areas with arsenic and lead. The purpose of the study was to assess the exposure to these elements among children due to ingestion of soil. The town consisted of a long road, with the smelter on one end of town. Due to the proximity to the smelter and the wind direction, different parts of the town received different amounts of contamination. For this reason, the town was divided into strata corresponding to contamination gradients.
The company responsible for clean up of the lead and arsenic is ARCO. One study conducted by ARCO was a study of arsenic levels in urine among children. As part of this study, a census of all children between the ages of 12 and 48 months of age was obtained from researchers studying urine Arsenic among all children in Anaconda. The list included all children in a geographically defined study area that participated in the urine study. Children not included in the sample frame were children from families that refused to participate in that study, children who had moved, and families who had not yet completed the initial urine study (approximately 12 families). A total of 258 families with children between the ages of 12 and 48 months were identified in that list. Data on these families is contained in a SAS data set anc2v1.sd2.
Families were grouped in geographic areas which were defined from a west to east fashion and were relatively homogeneous in terms of age of housing. These areas were collapsed into six areas that were relatively homogeneous for age of housing and geographically contiguous. The collapsed six areas that formed the strata for sample selection, and number of families from the original study areas are given in Table 1.
Table 1. Geographic Areas Used to form Strata for Anaconda Soil Ingestion Study
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
As a first step, we examine the ANACONDA population data (anc2v1.sd2) using the program LEC25P10.SAS. Sample lists of these data are given in Table 10.1 and Table 10.2, with a cross classifiction of strata and age in Table 10.3. We also tabulate the stratum totals for households where children's age is known (Table 10.4).
|
|
Produced and maintained by the
Dept
of BioEpi at UMASS |