On Pooling Error Terms in Repeated Measures Designs

Richard S. Bogartz

University of Massachusetts, Amherst

Correspondence concerning this article should be addressed to Richard S. Bogartz, Department of Psychology. University of Massachusetts, Amherst, MA 01003 (email: bogartz@psych.umass.edu)

On Pooling Error Terms in Repeated Measures Designs

Abstract

Pooling error terms in repeated measures designs can result in extreme inflation of the Type I error rate or extreme loss of power. Using Bartlett's test as a preliminary procedure to protect against the error terms being too different does not result in protection from such biases. It is shown here that in a repeated measures design with two repeated measures factors, B and C, even when Bartlett's test is nonsignificant at the .20 level, pooling can still produce an inflated Type I error rate. The results settle the case for avoiding pooling with repeated measures and using correct error terms.

On Pooling Error Terms in Repeated Measures Designs

The purpose of the present paper is to show that pooling error terms in repeated measures designs entails the risk of too many or too few Type I errors and that the suggestions in the literature for testing the homogeneity of these error terms do not adequately cope with the problem.

In research on infant cognition and perception, infants may be hard to obtain and a high percentage of the infants may be unusable due to fussiness, short attention span, or change of state. Limited availability of research participants occurs in the study of some kinds of impairment, e.g. deafness or Williams syndrome. Investigators shift to repeated measures designs to cope with scarcity of participants, but even then they often have designs with a small number of degrees of freedom for error.

Consider a design with two repeated measures variables, B and C. There are three error terms: one for B; one for C; and one for the BC interaction. With low numbers of degrees of freedom, these tests will lack power. This problem has tempted some investigators to pool within participant error terms. In other cases, computer programs have resorted to pooling error terms when data points are missing (e.g. Baillargeon (1987), using SAS with missing data, reports a pooled error term with 126 df when the correct, unpooled error term would have only 22). Pooling is done by adding the sums of squares for two or more error terms and dividing by the sum of their respective degrees of freedom. Such pooling is extremely dangerous.

Simulation 1

Table 1 shows the expected mean squares for a design with two repeated measures and one between participants variable. Each of the two repeated measures variables, B and C, and their interaction, BC, has a different error term, the respective interactions of those factors with participants. Pooling, say, the sum of squares for AB with the sum of squares for BS to provide a better test for B is one thing and may well be appropriate (see below). This is the type of pooling that is most often referred to in statistics texts. Pooling the three different error terms BS, CS, and BCS is quite another matter. It has been suggested (Winer, Brown, & Michels, 1991, pp. 541-2) that pooling these latter three terms would be appropriate if the variance for all three of those interactions were equal . We can see that this is not the case by inspecting the results of simulation.

Procedure. To assess bias in the Type I error rate, simulations were conducted using the design in Table 1 assuming that the null hypothesis of no B effects was true, and all of the other effects were zero except the BS and BCS interactions and error. The error variance was set to one and the variances of the two interaction effects were varied systematically from one to ten. There were 2 levels of A, 2 levels of B, 4 levels of C, and 15 participants in each level of A. For each combination of the BS and the BCS interaction, 200 replications of the design were created for each of the 100 combinations of the two variances, using independent random sampling of the effects from normal distributions having the indicated variances and zero means

Adequacy of the simulation procedure was assessed by using the correct F-test and comparing the obtained relative frequency of Type I errors with the nominal significance level of .05. The results using the correct F-test indicate that the simulation runs were adequate for the present purpose. For each replication the correct F-test of MSB versus MSBS and the F-test ("pooled F-test") of MSB versus MSpooled were obtained, using a .05 significance level. The obtained average value for the 100 different proportions of significant correct F-ratios was .04945. With 200 replications, the standard error of each of these 100 proportions is Ã[(.05)(.95)/200] = .0154 and the estimated standard error of that proportion using the 100 obtained proportions was .0146.

Results. Table 2 shows the results of the pooled F-test. Consider the value of .42 in the 6th row, 1st column. This shows an inflation of the Type I error rate for the test of the B effect of more than eight times the nominal rate of .05, when the test was conducted using the pooled error term. Over 60 of the 100 Type I error rates are greater than or equal to .10.

On the other hand, the value of .02 in the 2nd row, 5th column shows the opposite result which would produce a loss of power. Almost 30 of the entries are .02 or less, with 17 of these equal to zero to two decimal places.

The general pattern in Table 2 is that with the pooled F-test both positive and negative biasing of the Type I error rate can occur. For a given value of the BS interaction variance, the greater the BCS interaction variance, the smaller the proportion of significant F-ratios. This means the pooled error term tends to be too large and greater loss of power occurs. For a given value of the BCS interaction variance, the greater the BS interaction variance, the greater the proportion of significant F-ratios, and the consequent greater risk of a Type I error.

The reason for the pattern in Table 2 can be seen by comparing the expected value of the numerator of the pooled F with the expected value of the denominator. Using I = 2, J = 2, and K = 4 in Table 1, we can see that the expected value of the numerator in the test of the B effect is s2e + 4s2BS when the null hypothesis is true, and the expected value of the pooled error term is

{[ I(J - 1)(M - 1)][s2e + Ks2BS] + [I(J - 1)(K - 1)(M - 1)][s2e + s2BCS]}/{[ I(J - 1)(M - 1)

+ I(J - 1)(K - 1)(M &endash; 1)}

= { 2[s2e + 4s2BS] + [2*3][s2e + s2BCS]}/{[ 2 + 6} = s2e + s2BS + .75s2BCS.

When s2BS is large relative to s2BCS, the denominator will be small relative to the numerator and result in excessive Type I errors. When s2BCS is sufficiently large relative to s2BS, the denominator will be large relative to the numerator and result in loss of power. When s2BCS is equal to s2BS, the denominator will still be small relative to the numerator and still result in excessive Type I errors.

The present results show that mere equality of the underlying interaction variances will not avoid the dangers of error rate inflation resulting from pooling. The main diagonal of Table 2 confirms that even when the BS and BCS interaction variances are equal, there is a positive bias. Also, this positive bias is not simply due to the zero value of the CS interaction variance. Setting the BS, BCS, and CS interaction variances all equal to the same value still results in positive bias of the Type I error rate for the pooled F-test. Setting them all equal to three gave a Type I error rate estimate of .13; setting them all equal to 6 gave a Type I error rate estimate of.16, and all equal to 9, gave a Type I error rate estimate of .13.

Simulation 2

It has been suggested that preliminary tests of one source against another may be an adequate precaution so far as pooling error terms is concerned (Kirk, 1982). More recently, Winer, Brown, and Michels (1991, p. 541) prescribe using Bartlett's test to test the homogeneity of the error terms and pooling the error terms if the Bartlett's test is nonsignificant, using a .20 level of significance. The purpose of the second simulation was to determine if that Bartlett's test does indeed provide protection against inflated Type I error rates.

Method. The method was basically the same as in Simulation 1 except for minor changes. In order to obtain a sufficiently large number of occurrences of nonsignificant Bartlett's tests, 400 replications were used instead of 200 and the the error variance was permitted to range from 4 through 10. The variance of both the BS and the BCS effects ranged from 1 through 10.

In order to estimate the conditional probability of a significant pooled-error-term F-ratio (for the test of the B effect) given a nonsignificant Bartlett's test using the .20 significance level, the frequency of the joint occurrence of a significant pooled-error-term F-ratio and a nonsignificant Bartlett's test was divided by the frequency of occurrence of a nonsignificant Bartlett's test. This was done at each combination of values of error variance, the variance of the BS interaction effects and the variance of the BCS interaction effects.

Results. The Bartlett's test does not protect against inflation of the Type I error rate resulting from pooling of the error terms. If it did, the entries in Table 3 would hover around .05; in fact, many are larger and none smaller. Clearly there is a positive bias. This bias is unacceptably large in many cases. Forty-nine of the 100 entries in Table 3 are greater than or equal to .08. Each of the proportions of Type I errors conditional upon the occurrence of a nonsignificant Bartlett's test is based on a denominator of at least 124.

The results in Table 3 are for the error variance equal to 10. The results for the other values of error variance selected were comparable and sometimes had Type I error rates much larger. The values for error variance equal to 10 had the largest frequency of nonsignificant Bartlett's tests so the tabled results were considered the most reliable.

Because estimating the conditional probability of rejecting the null hypothesis given a nonsignificant Bartlett's test involves the use of a denominator that is itself a random variable, the risk of a positive bias was considered. Simulations were performed where the probability of event A (rejection of the null hypothesis) given Event B (nonsignificant Bartlett's test) was set at either .20 or .05, and the occurrence of event B was allowed to be a random variable such that event B (nonsignificant Bartlett's test) occurred with probability 315/400 or 120/400. These two values were used because in simulation 2 the number of nonsignificant Bartlett's tests were always between 120 and 315. The number of replications in which event B could occur was varied parametrically. With given values of the probabilities of A and B, for each fixed number of replications, the estimation procedure was simulated 5000 times and the average value of the 5000 estimates of the probability of A given B was compared to the known value of probability of A. It was found that with 400 replications, the number used in simulation 2, the long run value of the estimate of the probability of a Type I error (A) given a nonsignificant Bartlett's test (B) was correct to three decimal places, indicating the possibility of bias due to a variable denominator can be ignored in considering the results of simulation 2.

Properly poolable terms

Finally, as mentioned before, there is a second kind of pooling that can occur in repeated measures designs. That is the pooling of a repeated measures source with the interaction of that repeated measures source with a between participants source. If that interaction is zero, it is properly poolable with the error term against which it would be tested. In Table 1, an instance of this would be pooling the AB interaction sum of squares with that for the BS interaction. Using a liberal significance level, e.g., .25, the AB interaction could be tested against the BS interaction. If the F-ratio was not significant, pooling could occur. Under the worst case where the BS interaction was not zero, there would be some decrease in power but never an increase in the Type I error rate. Caution is needed here because the test will lack power just when it is most needed.

Conclusions

The present investigation shows that pooling error terms in repeated measures designs can result in extremely large positive or negative bias of an unknown amount. Testing the mean squares for these error terms against one another as a preliminary procedure does not necessarily result in protection from biasing of the Type I error rate because bias can occur even when the error terms are homogeneous. Because proper pooling of the error terms would require that the interactions of the factors of interest with participants each be zero, it is hard to imagine any real situation in which such pooling would be justified. The results indicate investigators should avoid pooling error terms for repeated measures.

References

Baillargeon, R. (1987). Object permanence in 3.5- and 4.5-month-old infants. Developmental Psychology, 23, 655-664.

Kirk, R. E. (1982). Experimental design. (2nd Ed.). Belmont, CA: Brooks/Cole Publishing.

Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design (3rd Ed.). NY: McGraw Hill.