Practical Data Management and Statistical Computing (BioEp691F)

Contacts

Outline
Assignments

Resources


Outline: Lec21 Lec22 Lec23 lec24 Lec25
Lectures: Lec21 Lec22 Lec23 Lec24 Lec25


Lecture 22


1. Selection a Simple Random Sample with Replacement from a Population 

A common request is to select a simple random sample from a population. The population may consist of paper lists, or computer files. Two types of simple random samples (SRS) can be selected- with replacement and without replacement. The simplest type of SRS is a SRS with replacement. Conceptually, to select such a sample, we follow several steps:

Using SRS with replacement, the same "subject" may be selected more than once, since the slip of paper is returned to the "hat" after each selection. Many books describe how such samples can be selected using random numbers tables. This is a very tedious process, and impractical in many real settings. Instead, the sample should be selected via computer.


There are two principal steps in selecting a SRS with replacement.

We will develop programs for selecting a SRS with replacement following these two steps.


Selecting the Sample

Outline:

1. Determine the sample size=n, and population size=N
2. Assume that the population has been numbered from 1 to N
3. Use a Uniform Random Number Generator to select the sample
a. expand the range from (0,1) to (0,N) by multiplication
b. add 1 to the result to have numbers go from 1 to (N+1)
c. use the integer part of the random number
d. repeat the process "n" times, and OUTPUT the results. 

We illustrate this process with the program LEC25P1.SAS, resulting in the sample given in Table 1.


Once a subset of numbers for a list of 1,...,N numbers in the population has been selected, this list has to be combined with the population data to obtain the sample. We do this in LEC25P2.SAS. The steps in the process are:

The results are illustrated in the following tables

Table 1. List of Sample of n=5 Subjects

Table 2. Listing of the Population

Table 3. Listing of the Population with Assigned Numbers

Table 4. Listing of the Identified Sample


2. Selecting a Simple Random Sample Without Replacement

SRS without replacement can be selected in a similar manner. Conceptually, to select such a sample, we follow similar steps as when selecting a simple random sample with replacement. The steps are:

1. Number subjects in the population from 1 to N
2. Write each subject's number on a slip of paper (of equal size)
3. Place all the slips of paper in a "hat"
4. After being blindfolded,
a. Reach into the "hat" and pick one slip of paper.
b. Record the number selected.

5. Repeat step (4) until "n" subjects have been selected.

 

Following these steps using a random numbers table is tedious. It is particularly tedious since when a new random number is selected, one has to be sure that the number hasn't previously been included, prior to including the number.


Using a computer program to select a simple random sample without replacement is also not as straightforward as selecting a sample with replacement. If we adapt the "with replacement" method to "without replacement", we might start with the first selection in the same manner.

However, to select the second subject, we can no longer use the random number generator in an identical manner. We could re-number the remaining subjects, and follow the same plan, but the continual re-numbering of subjects will be confusing.


A different strategy is used to select a simple random sample without replacement. To select the sample, we do the following:

1. Generate a uniform random number. If the number is less than n/N, select the 1st subject in the list. If the number is greater than n/N, the first subject is not in the sample.

2a. If the result of (1) was to include the first subject in the sample, we now need to select (n-1) subjects out of the remaining (N-1). Generate a second uniform random number, and select the second subject if the number is less than (n-1)/(N-1). Otherwise, exclude the subject from the sample.

2b. If the result of (1) was to exclude the first subject from the sample, we now need to select "n" subjects out of the remaining (N-1). Generate a second uniform random number, and select the second subject if the number is less than n/(N-1). Otherwise, exclude the subject from the sample.

3. Continue with this process until "n" subjects have been selected.

 

We illustrate selection of a Simple Random Sample Without Replacement with the program LEC25P3.sas. Note that in the program, there are statements such as:

nsamp=nsamp-1;

Clearly, the value of "nsamp" can not equal itself minus "1". This statement can be understood by thinking of the value on the left hand side as the "new" value, and the value on the right hand side as the "initial" value. Thus, the statement with take an initial value of "nsamp"=4, and give it a "new" value of "nsamp"=3.

The resulting sample is indicated in Table 2. Notice in Table 2 that the probability of a subject being included in the sample varies as the data are read, ranging from a maximum of 0.55 to a minimum of 0. These probabilities give the impression that there is a different chance for subjects to be included in the sample. While for a particular sequence of random numbers, this is true, when one averages over all possible random numbers, one can show that each distinct sample has the same chance of occurring.

We illustrate that this sampling strategy does give equal probability to each subject by simulating the sampling. We develop the simulation for 2 populations in LEC25P4.SAS, where Table 4.1 lists the populations, and Table 4.2 lists the selected samples.

Note the function of the OUTPUT statement may appear somewhat unexpected since subjects not selected in the sample do not appear in the dataset. In a DATA Step, SAS does not expect to encounter an OUTPUT statement, and automatically acts as if there is such a statement when none are given. As a result, for every set of instructions given, when the end of a "line" of data is reached, the results are written to the SAS dataset. In contrast, when an OUTPUT statement is included in a DATA Step, no "automatic" OUTPUT statements are placed at the end of the DATA Step. For this reason, subjects not selected in the sample are not written out to the dataset.

We conduct the simulation with LEC25P5.SAS, resulting in the following tabulation of the frequency of subject selections (given in Table 5.2).


Examining How the Sampling Program Works: Use of PUT statements.

We can examine how the program works in more detail by having intermediate results written as the program is processing. We do so by including PUT statements. A PUT statement can list whatever text you would like (in quotation marks) or the value of variables. The program LEC25P6.sas includes some PUT statements to provide details of the sample selection (see output). Using PUT statements can help to understand details of the processing of the data when programs are complicated.


A key ability of the program used to select SRS without replacement is the capacity to remember the results of the previous records processing when processing the next record. This capacity is required since the probability of selecting a subject depends on the number of subjects remaining to be selected, and the size of the remaining population from which they will be selected. The RETAIN statement is used to remember these values. As part of the RETAIN statement, initial values for "nsamp" and "npop" are specified. Specifying them in the RETAIN statement does not mean that these values "stay the same" as the DATA Step continues to process. By including the initial values in the RETAIN statement, the variables take on these values only for the first iteration of the DATA Step. After the first iteration, the values of these variables "remembered" is the value that they were last given in the DATA Step.

The program LEC25P7.SAS illustrates in Table 7.1 that if initial values are simply fixed, even with a RETAIN statement, the function of the DATA Step is substantially altered. Using assignment statements to fix the values does not allow them to change as consecutive records are read in the data.


The actual function of the initial values in a RETAIN statement can be duplicated by using the SAS automatic variable "_N_". This variable can be used in DATA Steps, but is not included in resulting datasets unless it is set equal to another variables. The value of "_N_" is 1, 2, ..., up to N, where N is the number of records in the dataset. As the next record is read, the value of "_N_" is increased. The initial values in the RETAIN statement are valid only for the first record. This can be seen by examining the results of LEC25P8.SAS.

With this understanding of the function of the initial values in the RETAIN statement, we can consider the function of the RETAIN statement itself. We do so by comparing the results of the previous program, with LEC25P9.SAS where the RETAIN statement was deleted (see output).  Note in the OUTPUT that values of NPOP and NSAMP have missing values (represented in SAS by ".") for all records except the first. These values are missing since after the first record, there is nothing in the program for subsequent records that defines their values. Only if the value from a previous record is known can these variables assume values.


3. Selecting a Stratified Simple Random Sample Without Replacement

 Selection of a Stratified Sample from Anaconda

A study was recently (1996) conducted in Anaconda, Montana to evaluate soil ingestion among pre-school age children. The town had a copper smelting plant which contaminated large areas with arsenic and lead. The purpose of the study was to assess the exposure to these elements among children due to ingestion of soil. The town consisted of a long road, with the smelter on one end of town. Due to the proximity to the smelter and the wind direction, different parts of the town received different amounts of contamination. For this reason, the town was divided into strata corresponding to contamination gradients.

The company responsible for clean up of the lead and arsenic is ARCO. One study conducted by ARCO was a study of arsenic levels in urine among children. As part of this study, a census of all children between the ages of 12 and 48 months of age was obtained from researchers studying urine Arsenic among all children in Anaconda. The list included all children in a geographically defined study area that participated in the urine study. Children not included in the sample frame were children from families that refused to participate in that study, children who had moved, and families who had not yet completed the initial urine study (approximately 12 families). A total of 258 families with children between the ages of 12 and 48 months were identified in that list. Data on these families is contained in a SAS data set anc2v1.sd2.

Families were grouped in geographic areas which were defined from a west to east fashion and were relatively homogeneous in terms of age of housing. These areas were collapsed into six areas that were relatively homogeneous for age of housing and geographically contiguous. The collapsed six areas that formed the strata for sample selection, and number of families from the original study areas are given in Table 1.

Table 1. Geographic Areas Used to form Strata for Anaconda Soil Ingestion Study 
Area
# Families
Areas
Stratum
# Families
A
32
A
1
32
B
35
B,C
2
62
C
27
D,I,J
3
33
D
14
E
4
34
E
34
F
5
83
F
83
G,K
6
14
G
12

I
8

J
11

K
2

The study design called for a stratified sample of 64 families from this population, using the Area groupings as strata (ie. stratum) with probability proportional to the strata size. Since the age of the child was important in the study, only families with children whose age (in years) was known are to be considered eligible to be included.


Initial Examination of Anaconda Population Data

As a first step, we examine the ANACONDA population data (anc2v1.sd2) using the program LEC25P10.SAS. Sample lists of these data are given in Table 10.1 and Table 10.2, with a cross classifiction of strata and age in Table 10.3. We also tabulate the stratum totals for households where children's age is known (Table 10.4).


The final step is to select simple random samples from each stratum without replacement. We do this in LEC25P11.sas, with the resulting sample listed in Table 11.1, and summarized in Table 11.2.

Produced and maintained by the Dept of BioEpi at UMASS
Send comments or questions about this web site to Ed Stanek
Email:
stanek@schoolph.umass.edu
\be691f\web\webready\lec22.html
Last Update: 11/29/99