Sampling Exercises

Simple Random Sampling With Replacement of a Population

This assignment explores Samples and Populations, and the basic connection between the two resulting from probability sampling. We use computer selections of samples to understand the ideas. The exercises require running example SAS programs.

Introduction

Populations are rarely studied in their entirety. Instead, a sample of subjects is selected, and the results on the sample are used to estimate a parameter in the population. We describe how the results from a sample relate to the population by repeating the sampling many times, and comparing the estimates from the sample with the population parameter. The sample estimate is rarely equal to the population parameter. We consider two characteristics of sample estimates to describe how they relate to a population parameter.

• First, is the average of the estimates equal to the population parameter. If it does equal the population parameter, the estimator is called unbiased.
• Second, how widely spread out are the sample estimates. We use the variance to measure of spread of the estimates.

We can draw many samples in the same manner using the computer. This exercise uses the computer to help understand what the distribution of estimates looks like when we sample repeatedly from a population.

• Selecting Simple Random Samples with replacement.
• Selecting Simple Random Samples without replacement.
• The shape of the distribution formed by repeated evaluations of a sample mean.

Selecting Simple Random Samples with Replacement

We use a small example to illustrate ideas. Suppose a population of N=6 fragile diabetics in a clinic have been identified. We wish to know the average number of hypoglycemic episodes reported in the past month by the patients. We assume that the actual number of hypoglycemic episodes reported in the patient's record in the past month for the diabetics is the following: (ID, Value):

 Subject ID # Hypoglycemic Episodes 1 7 2 6 3 1 4 2 5 16 6 4

The first program will do the following:

• Read in the population data.
• Calculate population parameters corresponding to the mean and variance.
• Create a Percent Frequency Histogram of population data.

1. Do the following:

• Start the SAS program
• Copy the program a54p9.sas into the Program Window in SAS.
• Run the program (click on the Running Person)
• Move to the top (use Pg Up) of the Output Window.
• Page up and down through the SAS output. The first page contains the data for the population. The second page tabulates population parameters.
• Print a copy of the table that contains the population parameters.
• Print a copy of the Percent Frequency Histogram for the population.

Use the SAS output to answer the following questions.

• a. What is the average number of hypoglycemic episodes in the population?
• b. What is the population variance? Write out the formula that is used to calculate the population variance. Later on, using your calculator, check that the computer program used this formula to calculate the variance.

Rather than reviewing all records, suppose we wish to save time by selecting a sample of patient records (with replacement) for n=3 patients and use the sample to estimate the number of hypoglycemic episodes.

The next program will:

• Select a simple random sample of size n=3 with replacement from the population.
• Calculate sample statistics based on values for subjects selected in the sample.
• Repeat the sampling for 40 samples, and list the results in a table.
• Create a Percent Frequency Histogram of the sample means for the 40 samples.

2. Do the following:

• Copy the program a54p10.sas into the Program Window in SAS. The program selects 40 samples of size 3 (with replacement). Run the program by clicking on the Running Person Icon, and review the output.
• Print a copy of the 40 samples selected.

• a. What is the formula used to estimate the variance in the sample?.
• b. Later on, check the results of the calculation of the sample variance for the first sample using your hand calculator.
• c. Do any samples occur where the same patient was selected on each of the selections? Which ones? What is the chance (probability) that this will happen? Would you expect it to happen as often as it did (or more often) if only 40 samples were selected?
• d. Print a histogram of the sample mean for the 40 samples. Consider the height of the bars on the histogram as weights. If the x-axis was the top of a scale, does it appear likely that the histogram could be balanced at a value equal to the population mean?

We can  use the computer program to select many samples, and evaluate the results of the sample selections. Use use this technique to understand the properties of statistics calculated from samples. The properties we develop concern the average value of the statistic (expected value), and the variance of the statistic.

3. Do the following:

The next program will:

• Select a simple random sample of size n=3 with replacement from the population.
• Calculate sample statistics based on values for subjects selected in the sample.
• Repeat the sampling for 4320 samples.
• Prints a Table containing the frequency of occurrence of each possible sample.
• Creates a Percent Frequency Histogram of the sample means for the 4320 samples.
• Create a Percent Frequency Histogram of the sample variances for the 4320 samples.
• Creates a Percent Frequency Histogram of the sample std. deviations for the 4320 samples.
• Creates a Percent Frequency Histogram of the sample minimum for the 4320 samples.
• Creates a Percent Frequency Histogram of the sample maximum for the 4320 samples.
• Creates a Percent Frequency Histogram of the sample range for the 4320 samples.
• Evaluates the average, standard deviation, variance, minimum and maximum of sample statistics, and prints them in Table 2.

• Copy the program a54p11.sas into the Program Window in SAS. The program selects 4320 samples of size 3 (with replacement). Run the program, and review the output.

• a. If we count samples as different if either the subjects included are different, or if the order of selection of subjects is different, how many different possible samples could we select?
• b. How many times would you expect that each sample would occur if 4320 samples were selected?
• c. Consider the sample that selects ID=3 three times. Compare the number of times this sample occurs among the 4320 samples with the number of times you would expect it to occur.
• d. Review the frequency table of different possible selections of sample ID's.
• e. How many different samples are in the frequency table.
• f. Table 1 is a frequency table of 216 different possible samples of size 3 based on simple random sampling with replacement. How many of the 216 samples are there that include the subjects ID=1, ID=2, and ID=3, regardless of the order of inclusion?
• g. In how many different orders can the same three subjects be selected? (This is the number of permutations.)
• h. If 216,000 samples were selected, how many times do you expect each distinct sample would be selected?

In the program, the sample mean is calculated for each sample.

• i. Print a copy of the histogram of the sample mean for the 4320 possible samples.
• j. Using the histogram, guess the value (number of hypoglycemic episodes) where the histogram could be balanced? (This is the expected value, and equal to the mean.)

Histograms are included in Figures 2-6 for the percent frequency distribution of estimates (based on the sample) of the variance, the standard deviation, the minimum, the maximum, and the range. Inspect each of these distributions. Distributions that have a long tail to the right are called right skewed, while those with long tails to the left are called left skewed.

• k. Which distributions are skewed to the right?
• l. Which distributions are skewed to the left?
• m. Based on the histograms, guess where the expected value (the mean) would be for each sample statistic. Write down your guesses.
• n. See how good your guesses are by comparing them with the results in Table 2. at the end of the output.

When the expected value of a statistic is equal to the value in the population, the statistic is called "unbiased".

• Compare the population values for the mean, variance, standard deviation, minimum, maximum, and range with the expected values from Table 2. Which statistics do you think are "unbiased"?

Computer programs can generate many simple random samples effortlessly. We can use the programs to study the properties of the sample mean as a function of the number of subjects selected in the sample. You will study how the mean and variance of the sample mean varies with the sample size here.

The next program will:

• Select a simple random sample of size n=2 with replacement from the population.
• Calculate sample mean using values for subjects selected in the sample.
• Repeat the sampling for 4320 samples.
• Creates a Percent Frequency Histogram of the sample means for the 4320 samples.
• Calculates the mean and the variance of the sample means.

4. Do the following:

• Copy the program a54p12.sas into the Program Window in SAS. The program selects 4320 samples of size 2 (with replacement).
• Run the program, and review the output.

Complete the following:

• a. Print a histogram of the Percent frequency distribution of sample means.
• b. Record the mean and variance of the sample means using results from Table.
• c.Using the histogram, guess at what value will the distribution of sample means balance (what is the expected value of the sample mean)?
• d. Is the distribution of sample means skewed? Which way?

Modify the program a54p12.sas to select simple random samples without replacement of size n=4 by changing the line to:

%LET tsamp=4;

Rerun the program.

• e. Print a histogram of sample means
• f. Record the mean and variance of the sample means using results from Table 1.
• g. Using the histogram, guess at what value will the distribution of sample means balance (what is the expected value of the sample mean)?
• h. Is the distribution of sample means skewed? Which way?
• i. Repeat this process for samples of size n=5, n=6, and n=10.

Use your results to summarize the relationship between the sample size and the mean and variance of the sample mean. Please do the following:

• j. By hand, make a bar chart that has as the x-axis (the abscissa) the sample size, and as the y-axis (the ordinate) the expected value of the sample mean.
• k By hand, make a line plot that has as the x-axis (the abcissa) the sample size, and as the y-axis (the ordinate) the variance of the sample means.
• l. Use this plot to deduce a simple relationship between the expected value of the variance of the sample means, the population variance, and the sample size.