|
|
|
|
The Central Limit Theorem is an important theorem in Statistics. The theorem states that when simple random samples are selected repeatedly from a population, and the average is calculated from observations selected in each sample, then the distribution of these sample averages will follow a "normal" distribution (as long as the sample sizes are sufficiently large), where the mean of the normal distribution is given by the mean of the original population, and the variance is given by 2/n, where "n" is the sample size. The importance of the Central Limit Theorem is due to the fact that the same type of distribution results for the averages, regardless of the distribution that one starts with.
The difficulty with the central limit theorem is that how large "sufficiently large" is depends on the particular setting. With experience, it is possible to "judge" how close the distribution of the sample mean will be to a "normal" distribution. In some populations, a very large sample size is needed before the distribution of the sample averages will be approximately "normal". In other settings, a relatively small sample size (say n=5 or 10) is adequate for the distribution of sample means to be approximately normally distributed. It is possible to illustrate the Central Limit Theorem by using a simple SAS program. We consider such an example where the population is binomially distributed.
For example, suppose that the probability that a person is absent from school (due to illness) is given by "P". Let us define a random variable "y" that indicates whether or not the person is absent from school, where y =
1 if person is absent on a day0 if person is present on a day
where P = Prob( person is absent).
We can think of "y" as the result of a flip of a coin, where one side of the coin is marked with a "1" and the other side is marked with a "0", and the probability (long run change of getting a "1" is given by "P".
If we keep track of attendance over a week (as if the coin is flipped n=7 times), then the total number of absent days will simply be the sum of days that the person is absent in the week. We denote this sum by "x". An estimate of the average number of days absent will be equal to x/n, which will estimate the Prob( person is absent). For different weeks, the estimate of the Prob( person is absent) will differ, since the person may be absent for different numbers of days on different weeks. Over a year, we might construct 52 estimates (with each estimate based on a sample of n=7 days) of the Prob(person is absent). The distribution of these estimates is the subject of the Central Limit Theorem.
In order to simulate the setting above on the computer, we need to be able to "flip" a coin, and generate the result of the flip on the computer. We do this using a Random Number Generator. There are a number of random number generators in SAS. The simplest is the Uniform Random Number Generator, which picks a number at random in the interval (0, 1). All numbers in this interval have the same chance of being selected. To use the Uniform Random number generator to simulate "flipping a coin", we let the result be a "1" if the number is less than 0.20, and "0" otherwise.
We illustrate this in lec24p1.sas which makes use of the UNIFORM(seed) random number generator. The value of the seed (=334433) initiates a sequence of numbers that in the long run follow a uniform distribution. By specifying the seed, the same sequence of numbers appears every time we run the program. Leaving the seed blank will cause the UNIFORM function to start the random numbers based on the internal computer clock. The result is that different random numbers will be generated for different executions of the program. A list of the resulting data is given in Table 1, with the proportion sick summarized in Figure 1.
The seed is used in random numbers generators to initiate the random number sequence. Only the first occurance of seed in a DATA Step is used. One initiated, the same sequence of numbers is produced, as illustrated in lec24p2.sas. This sequence is illustrated in Table 2.1 for three random uniform variables. Note that the value of the seed remains constant. Also notice in Table 2.2 the correspondance between the random numbers generated with the same seed.
In addition, notice that the random numbers generated by X, Y, and Z for OBS=1 correspond to the first three random numbers of A1. Similarly, the values of X, Y, and Z for OBS=2 correspond to OBS 4-6 for the varialbe A1. Using these patterns, we can see that the value of A1 on OBS=11 will be 0.77629. In a similar manner, we can conclude that the value of B2 on OBS=11 will be 0.32410.
We will illustrate the Central Limit Theorem in the context of a "coin flipping" experiment, or Binomial Trial. A function in SAS will generate the results of "n" such flips, where the probability of "being sick" is given by "P". The Binomial Random Number Generator (RANBIN)produces a number equal to the number of "sick" episodes in the trial. We can use this to estimate "P", the probability of being "sick", by dividing by "n". Let the estimate be denoted by "p".
The Central Limit Theorem states that if the sample size is large enough, the "distribution" of the sample mean, will be equal to the true mean, "P", and have variance given by
var(p) = P(1-P)/n.
To generate the distribution of the sample estimates of "P", we can repeat our trial of selection of 7 subjects, estimating "P" in each trial with the estimate from the trials. The distribution of these estimates should be "normally" distributed if the sample size is large enough.
To see if the "central limit holds", we can compare the distribution with the "true" distribution from a normal distribution with µ=P=0.2, and P(1-P)/n=var(p). To do so, we use another SAS function, RANNOR. RANNOR will generate random variables that are normally distributed with µ=0 and var=1. To change these random variables into random variables with mean µ and variance P(1-P)/n, we multiply the random variable by the square root of P(1-P)/n, and then add µ.
The program LEC24P4.SAS uses these random number generators to simulate 50 binomial samples. There are three arguments to the function RANBIN corresponding to the SEED, n, and P ( see the statement: x=RANBIN(3333,7,.2);). The result is the sample total, which, when divided by n gives the sample proportion (Table 3.1). Table 3.2 lists the 50 sample results, while Table 3.3 summarizes the distribution in a chart.
We evaluate the Central Limit Theorem by repeating the sampling of size 7, and comparing the relative frequency distribution. To have smoother distributions, we select samples of size 7 many times (200 times). In order to compare the distributions, we need to plot Charts using the same scale, so that the charts are comparable. The "Normal" distribution has the property that most observations (99.7%) will occur within 3 standard deviations of the mean. We use this fact to set the scale for the Charts, having the axes range from
0.2 - 3 * (0.151) up to 0.2 +3 * (0.151)
or between -0.25 and 0.65. We choose increments of 0.05 to generate the bars. We evaluate the central limit theorem with the program LEC24P5.SAS. The results of the sample proportions are summarized in Figure 4.1, with the corresponding distribution (if the data were normally distributed) given in Figure 4.2.
Notice the inclusion of MACRO variables in the program. A macro variable is defined the by statement
%LET x=2000;
This variable can be referred to any where in the SAS program, where a reference is indicated by &x.
Note that in this setting, the Central Limit Theorem does not provide a very good approximation to the distribution of sample means of n=7.
As a second example, suppose we estimated the probability of a person being sick by observing the person for a month (30 days). Assuming the days are independent, the proportion of days that the person is sick will be binomially distributed. Now, since n=30, P=0.2 and var(p)=(0.2)(0.8)/30 = 0.0053333, where the SE(p)=0.07303.
Taking three standard errors on either side of P, we will set the range for our charts to go from 0.05 up to 0.35. Some simple changes will provide a comparison of the results using LEC24P6.SAS, with the frequency distribution of the sample proportions given in Figure 5.1, and the corresponding normal distribution given in Table 5.2.
|
|
Produced and maintained by the
Dept
of BioEpi at UMASS |