UBC - BIOLOGY 300

10. SINGLE FACTOR ANALYSIS OF VARIANCE (ANOVA)

Last week we looked at 2 sample t tests. We used the t test to compare 2 population means, but we often a need to test for differences among several population means. Using a t test to do multiple comparisons among the same samples will inflate our alpha value and produce incorrect and misleading results. Instead we can carry out an ANalysis Of VAriance (ANOVA). ANOVA compares different estimates or components of the total variance for all of the data points. Each of these variance components has a specific cause or source of variation. The ANOVA test compares the component of the total variation which is due to a difference in sample means (the among or treatment component) to the average variance within samples (the within or error or pooled component). If the variance due to differences in sample means is much bigger than the variance due to random variation in sample values, we will reject Ho (all means are the same) and suggest that there must be a significant difference in the means.

As a simple example, assume that we have 3 samples with the values below:

Sample A - 3, 6, 8, 12, 9, 5

Sample B - 48, 51, 43, 47, 50, 52

Sample C - 2387, 2391, 2388, 2385, 2390, 2386

Do these samples have different means? Almost anyone will be able to say that they are different, simply by looking at them. The reason that most people will suggest a difference is that each sample is a very tight grouping (the variance within samples is small), while the differences in magnitude among the samples is very large (C is much bigger than B, which is bigger than A). We have intuitively carried out an ANOVA. We have compared the variation among the samples to the variation within the samples, and concluded that there is much more variation among the samples, so the means must be different. ANOVA is just a mathematical approach to this basic idea.

Mathematically, we carry out an ANOVA by partitioning the total variance of our data set into components. The total variance (often called the combined sample variance) can be calculated by treating the data points from all of our samples as if they came from a single combined sample, and then calculating the variance of that sample using the typical equations for variance.

This total variance is made up of two components or contributors. One component of the total variance is due to random variation within samples. This component, the average variation within samples, is often called the error variance, since it is due to uncontrollable differences in data points resulting from random sampling. The second component of the total variation is the spread among the samples. This is the component caused by differences in means. In experiments this set of differences between the means should be due to the different treatments to which we subject our samples or groups, so this component is often called the treatment variance (or the group variance).

We can compare these variances with an F test. If there is no difference between the means for our treatments, then the treatment variance, which includes the differences between means, should be exactly equal to the error variance, which cancels out any differences in means. If the treatment means are different, then the treatment variance should be bigger than the error variance. Due to the inaccuracies of sampling, it is unlikely that the treatment variance will ever be exactly equal to the error variance, even when there is no difference between means. The ratio of variances (often called mean squares in ANOVA) is compared to the F distribution to see if the treatment variance is not only bigger but significantly bigger.

Since we are always hypothesizing that treatment variance is bigger than error variance, our F test is a one-tailed test of these variances. Our hypotheses about the means, though, are always two tailed, since our test can only detect a difference in means, not whether means are bigger or smaller than each other. It is also very important to realize that rejection of the null hypothesis in an ANOVA means only that there is some difference between sample means. It is not able to indicate which means are different. This requires a different statistical test such as the Tukey test.

ANOVA, just like the 2 sample t test, assumes that we have randomly selected data values from normally distributed populations and that the population variances, represented by our samples, are similar. This final assumption is necessary since our error variance is the pooled, weighted average of the sample variances. The pooled variance is only an accurate representation of the population variance if the samples have similar variances. These assumptions should always be tested when you have the facilities to carry out such a test. ANOVA is a robust test, meaning it can tolerate a fair amount of deviation from these assumptions, but it is less robust when sample sizes are different.

There are two main types of single factor ANOVA: the Fixed Effects Model (Model I) and the Random Effects Model (Model II). In a Model I ANOVA the treatments are specifically chosen, they are repeatable, and we care about specific differences between treatments. Any inferences made in a fixed effects ANOVA apply only to those populations represented by the sample data. In a Model II ANOVA the treatments are randomly selected representatives of all possible treatments (so our specific experiment is not repeatable and we won't care about differences between specific treatments) and inferences are made about the entire population of possible treatments from which the samples are drawn.

Multiple Comparisons

When we reject the null hypothesis that all means are equal, and when we have a fixed effects model, we may wish to know which specific treatments are different. ANOVA is unable to answer this question. We can't carry out t tests for all comparisons or our alpha levels will be strongly inflated. Instead, we can use an a posteriori test (one carried out after getting a specific result from another test) such as the Tukey test.

The Tukey test avoids the inflation of our alpha levels by using conservative estimates of the standard error (it uses the error mean square from our ANOVA instead of the standard error of the mean difference appropriate to each comparison) and by using a critical value called q which gets bigger as the number of comparisons increases. These two differences from a typical 2-sample t test result in a very conservative test that is quite weak, but which does not inflate our presumed alpha.

We don't usually do Tukey tests for random effects ANOVAs for the simple reason that in this model our treatments are random and unrepeatable or we are uninterested in specific treatment differences.

Transformations

If the assumptions of an ANOVA are not met with the raw data (normality at the population level, roughly equal variances for each sample and random sampling), it may be possible to improve normality and variance equality if the data are transformed. Many transformations are possible, but for this course we will deal primarily with 3 types of transformations: square root, arcsin square root and log. All transforms are designed so that they don't change the relationships between our samples and data points but just change the scale on which we measure our variables.

Square root Useful for count type data, as it converts a Poisson distribution to a normal distribution. It is usually better to add a small constant number such as 0.5 to each value in the data set, as theory and practice show that transformations of data sets with numbers very close to zero will not be as effective.

Arcsin square root Useful for proportion type data (or percentages, if you divide all of your values by 100 before transforming them via the arcsin square root transform). The arcsin (or inverse sine) square root transform converts a binomial distribution into a normal distribution.

Log Useful for morphometric data (body characteristics such as wing lengths, weight, etc. where there is often a power relationship). Also useful for data sets with outliers or for improving unequal variances or the linearity of a relationship between two variables. As the log of zero is undefined, it is best to add a constant to each value in the data set to avoid this problem.

Using The Program

ANOVA Testing - To carry out an ANOVA our data file must be formatted with all of the data points in a single column of continuous data, while a second column of categorical information tells us from which sample or treatment each data point was taken. We can fit Y by X for these columns to carry out an ANOVA. Use the categorical list (the nominal variable showing from which sample the values come) as your X variable, and the data variable as your Y. This will force you into the one-way ANOVA options and will produce a display showing the samples side by side. Just like last lab, click on the Red Arrow above this display to open up a sub-menu that will include the option for comparing means via an ANOVA. Choose means, ANOVA test to display the results.

Testing Assumptions - To test for normality in the populations that our samples represent, you should analyse the distribution of Y, using the appropriate subset of your data (just as we did last week: choose Analyze -> Distribution, entering the nominal variable from your data file as the By response. Checking the histograms, boxplots and descriptive statistics is useful, although you should also carry out a goodness of fit test to see if the distribution is normal.

To test if the variances for each treatment are similar, we choose unequal variances from the choices of analysis types. The computer will produce a table which includes the O'Briens, Brown-Forsythe, Levenes and Bartlett's test, which are all powerful procedures to test for equality of variances. If these tests indicate a problem, we can carry out a Welch's ANOVA, which doesn't use a pooled variance, but has a highly conservative estimate of the degrees of freedom.

Non-Parametric Tests - The computer allows us to use non-parametric tests as well, by clicking on Non-parametric on the analysis sub-menu. The Kruskal-Wallis test, the van der Waerden's test, or the Median test are the available choices when there are more than 2 samples. Like last lab, the Kruskal-Wallis is the most powerful non-parametric option for most data sets.

Multiple Comparisons - We can carry out a Tukey test by choosing Compare Means -> All pairs, Tukey HSD from the list of analysis types. This will produce a diagram where each sample is represented by a circle. If the circles don't overlap or barely overlap, the treatments are significantly different. If the circles overlap strongly or are nested, the treatments are not significantly different. Judging the amount of overlap makes this subjective, which we are trying to avoid by doing statistical analysis. Luckily, the computer also produces a report of the results. In this report the computer shows us the differences in the means, the critical value of q used for comparison to the calculated q values, and a table showing a comparison between each pair of mean differences and the least significant difference for that pair. In this final table, positive values show that the means for the pairs of treatments are significantly different.

If you are using recent versions of JMP such as the version in the lab, the Tukey results are a bit easier to interpret. The approaches mentioned above still work, but you can also use the nested circle report to determine significance. Clicking on a circle will cause it to turn red. Treatments which are NOT significantly different will also turn red. Significantly different treatments will stay black. Another interpretive approach has also been added, which is very similar to the underlining approach we use in class. This is called a connecting lines report. JMP places letters beside each treatment. Treatments that share the same letter are NOT significantly different.

Transformations - If we do not meet the assumptions of ANOVA but would still like to use ANOVA because of its greater power than a Welch's ANOVA or one of the non-parametric tests, we can try transforming our data to see if a different scale of measurement makes our samples more normal or makes our variances more similar.

To transform data, we need to create a new column. Choose column info and label this column to identify it as transformed data. Format the column so that it is based on a formula, then in the calculator window set up the equation for the appropriate transform for your data type.

For count data, the formula should be:

,

where raw is the column of data that you wish to transform.

For morphometric (body measurement) data, the formula should be:

ln (raw +0.5) or log (raw + 0.5)

Log (base 10 logarithms) and ln (natural or base e logarithms) are found under transcendental items in the functions list of the calculator window.

For proportion data (numbers between 0 and 1), the formula should be:

Arcsin is found in the list of trigonometrics as inverse sine. If your data are percentages you will have to first divide the values by 100 to convert them to proportions as arcsin can only be calculated for values between 0 and 1.

Problems

1. A researcher was interested in the effectiveness of different insect traps. Three types of traps were tested: thirty traps of each type were used. The number of insects in each trap were counted after 24 hours. The data are stored in a file named INSECT.

a) What are the assumptions of ANOVA? Examine the data and judge whether or not these assumptions might be met.

 

 

 

 

b) How can you decide whether or not to transform the data? If a transformation was required for this data set, which transform would be appropriate?

 

 

 

 

c) Are the same number of insects captured in all trap types tested? Show all steps taken in testing the null hypothesis.

 

 

 

 

 

 

 

 

 

 

 

d) If there is a difference among trap types, summarize the similarities and differences (carry out a Tukey test).

 

 

 

 

 

 

e) Why is it invalid to test multisample hypotheses by applying two-sample tests to all possible pairs of samples? Compensating for this problem is why Tukey tests sometimes result in overlapping similarities (Type 2 errors).

 

 

 

f) Due to this problem, an ANOVA may result in rejection of the null hypothesis but the subsequent Tukey test will fail to detect any differences among means. How can experiments be designed to avoid this problem?

 

 

 

g) What are the assumptions of the Tukey test and why is it considered an "a posteriori" test?

 

 

 

h) Is this example a Model I (Fixed effects) or Model II (Random effects) ANOVA?

 

 

 

 

2. The proportion moisture content was determined in ten samples from each of four soil types. The data are stored in a file named SOIL.

a) Visually inspect the data, as usual, and decide which test might be most appropriate for making comparisons among soil types.

 

 

 

 

 

 

b) Is this example a Model I or Model II ANOVA?

 

 

 

c) Is there a difference in the average moisture content of the four soils tested? Show all steps taken in testing the null hypothesis. If there is a difference, summarize the similarities and differences.

 

 

 

 

 

 

 

 

 

 

 

 

d) Suppose the assumptions of ANOVA can't be met for this data. Which transformation would be appropriate? Transform the data using this method and carry out a new ANOVA. Do you obtain similar F values, probabilities and results? Why or why not?

 

 

 

 

 

 

 

 

 

 

 

 

e) Suppose the assumptions of ANOVA can't be met even after data transformation. What test would you use if the data is not normal? Try it (show all steps taken in testing the null hypothesis). Do you reach the same conclusion as before? If not, which conclusion would you agree with? Why?

 

 

 

 

 

 

 

 

 

 

 

3. Medical researchers wanted to compare the length of stay in hospital of patients following a certain minor surgical procedure performed by all physicians. The length of stay (days) of eight patients were determined from the records of three randomly chosen physicians. The data are stored in a file named STAY.

a) Visually inspect the data, as usual, and decide which test might be most appropriate for making comparisons among physicians.

 

 

 

 

b) Is there a difference in the average length of stay of patients among physicians? Show all steps taken in testing the null hypothesis.

 

 

 

 

 

 

 

 

 

 

 

 

c) Is this data a Model I or Model II design?

 

 

 

 

d) Why would a Tukey test be inappropriate for this data? Why would it be inappropriate even if the data were normally distributed?

 

 

 

 

Answers for this Lab Assignment

Return to Main Lab Page

Return to Main Course Page