UBC - BIOLOGY 300

8. TWO-SAMPLE INFERENCE FOR PAIRED NUMERIC DATA

We can ask questions about two samples almost as easily as for a single sample. We can analyse samples that are either paired (dependent) or unrelated (independent). The approach for the two situations is quite different although both types of tests rely on the t distribution for parametric data. This week we will look at paired data and formally introduce non-parametric testing, while we will come back to testing two independent samples in next week's lab.

Data that are on interval or ratio measurement scales are usually examined through parametric approaches. Parametric testing and estimation (i.e. hypothesis tests and confidence intervals) such as we will do this week, use probabilities derived from the t distribution. They are only valid if our data are randomly sampled from normally distributed populations. These two assumptions must be met and, whenever possible, they should be tested before any parametric analyses are carried out.

Data that is ordinal can be analysed using tests based on a binomial approach to ranked data. These non-parametric tests make no assumptions about the distribution that our samples come from and are often known as distribution free tests. Even for interval or ratio data, we are occasionally forced to throw away some of our information and revert to ordinal testing, when we cannot meet the assumptions of parametric testing.

Parametric Hypothesis Tests

Paired comparisons are appropriate when observations are sampled in pairs. If data are measured on a ratio or interval scale, the paired sample t test is often used to make these comparisons. The paired sample t test is an extension of the one-sample t test, but involves a comparison of the mean difference between pairs of observations to a hypothesised difference.

Each pair may be composed of the same individual measured twice or of two individuals with common characteristics. For example, the same individuals may be measured before and after receiving some treatment; the blood pressure of ten students might be measured before they begin their midterm exam and then measured again at the end of the exam. Alternatively, each unit may be equally divided in such a way that half is treated one way and the other is treated a second way; to test the strength of two antigens, we might inject one into each arm of a single individual. Pairs may also be formed on the basis of some common characteristic: twins or siblings may be randomly assigned to two different treatments, or individuals may be matched on the basis of body size or age and treated differently.

In general, a characteristic that makes the two sets of data dependent on one another allows us to carry out paired sample testing. One way to look for pairing in data that you have not personally generated is to ask if the first data point in sample A must be compared to the first data point in sample B. If there is no reason to tie those specific points together (i.e. it makes just as much sense to compare point 1 in sample A to point 6 or point 23 in sample B), then the data is probably not paired. Only with this direct link between pairs of data points can we carry out paired sample testing.

But what is the purpose of paired sampling? The objective of this experimental design is to eliminate a maximum number of sources of extraneous variation, by making pairs similar for as many characteristics as possible. The removal of this external variation, the noise in the experimental design, makes paired testing one of the most powerful forms of statistical analysis.

If we hypothesise no difference between pairs of observations, then the null hypothesis is m d = 0 (the mean of the distribution of differences is hypothesised to be 0). We can also test against a specific difference (e.g. Ho: md = 6.3) or carry out one-tailed tests (e.g. Ho: md £ 5). In all cases, the test statistic for the paired-sample t test is:

In this equation, is the mean of the differences between paired observations, md is the hypothesised mean difference between paired observations (usually 0), and sd is the standard error of . The paired-sample t statistic has n-1 degrees of freedom, where n is the number of pairs of observations. Therefore, instead of performing the analysis with the original measurements, we use the differences between individual pairs of observations as the single variable of interest. Consequently, the paired-sample t test is essentially a one-sample t test. Since we have reduced the problem to an analysis of a single set of measurements, only one variance is involved. We need only to assume that the population of differences between individual pairs of observations is normally distributed and that our data were randomly sampled from this distribution.

Non-parametric Hypothesis Tests

If the population of differences is not normal and the central limit theorem can not be used due to small sample sizes, there are several non-parametric alternatives. The sign test is a simple but fairly weak non-parametric test. To conduct this test, we merely calculate how many values in the difference sample are negative and how many are positive. If there is no difference between our paired samples, there should be equal numbers of positive and negative values. We can calculate the probability of getting an outcome like our difference sample by using a binomial test with p = 0.5. This test is useful as an introduction to non-parametric testing but is seldom used in the real world because of its poor ability to detect a difference between samples.

The most powerful non-parametric test for this type of data is the Wilcoxon paired sample test, which has about 95% of the power of a paired t-test under optimal conditions. This test provides an index based on how many data points in the difference sample are negative or positive. It also considers how many of the negative differences are large differences and how many are small. It does this by ranking all differences from smallest to largest regardless of the sign of the difference, and then summing the ranks of all the negative differences and all the positive differences. These rank sums serve as the test statistic, which can be subjected to a binomial test to determine the probability of our type of outcome. The null hypothesis again is that there is no difference between the samples and therefore that the sum of the negative ranks should equal the sum of the positive ranks.

For large sample sizes, binomial calculations can be difficult, and both tests may be approximated by a normal distribution. In the case of the Wilcoxon test, n should be at least 100 to use a normal approximation to critical values, while n should be at least 30 before using the normal approximation for the sign test. For hand calculation these approaches are used infrequently, since large sample sizes usually allow us to invoke the central limit theorem and use parametric testing. Computers, however, will often provide the values for these normal approximations.

Confidence Intervals

Confidence intervals for the mean difference between two paired samples are essentially no different from those for single samples. We simply treat our created sample of differences as if it were the only sample. Thus, for paired data the 95% confidence interval for the difference between the individual sample means is just:

where m d is the parameter for the mean difference between the samples, sd is the standard error for the difference sample, n is the degrees of freedom (n - 1), and is the mean of the difference sample. As usual, this provides us with a measure of the reliability of the mean sample difference as an estimate of m d.

Using the Program

To use the t-distribution to test a null hypothesis based on paired observations, we can take two different approaches. The first approach is to enter our data as two separate columns and then create a third column of differences. We can analyse the distribution for the column of differences to carry out a paired t test (here the paired t test on our two related samples is conducted as a one sample t test on the sample of differences). Create the difference column by creating a formula for the column which is simply column 1 - column 2 (or whatever you have named the columns for your pairs of data). The only warning for this approach is that the pairs of observations must occur on the same line.

When we analyse the distribution of our sample of differences, the boxplots will display the confidence interval for the mean difference. As was the case in last week's lab, the moments table will show us the upper and lower limits for this 95% confidence interval.

The Wilcoxon test is accessible through this same approach if we check the box in the Test Mean... sub-menu. JMP does not carry out sign tests.

The second approach to carrying out a paired t test is to enter the data in two columns (again making sure that the pairs of observations occur on the same line), then carry out an analysis using Matched Pairs. Select both of your original data columns and assign them to Y, paired response. This second approach does not provide the option of a non-parametric test.

Problems

1. Before proceeding with further research into the mechanisms regulating erythrocyte pH in toads (Bufo marinus), scientists compared two methods of measuring intracellular pH to determine whether or not the methods give the same results. Arterial blood (0.8 ml) was collected from 37 toads. Each sample was equally divided and erythrocyte pH in each aliquot was determined either by a freeze-thaw (FT) method or a method involving C14-labelled 5,5-dimethyl-2, 4-oxazolidinedione (DMO). The data are stored in a file named toads in the shared directory.

a) State the null and alternate hypotheses in a paired-sample t test of these data.

 

 

 

b) What assumptions are required to test the null hypothesis stated above? Test the assumption you can check. Which group of the data do you need to analyse to test this assumption? What are the implications for parametric testing if we violate this assumption?

 

 

 

 

 

 

 

 

 

 

 

c) Do the two methods of measuring intracellular pH give the same results? Show all steps taken in testing the null hypothesis.

 

 

 

 

 

 

 

 

 

 

d) Calculate the 95% confidence interval for the difference between means.

 

 

 

 

 

 

e) Suppose the researchers had reason to believe that the DMO method would provide higher values for intracellular pH than the FT method. What would the hypotheses have been in this case? What result would they have obtained for this statistical analysis? Is it legitimate to base your hypotheses on the results you obtain and any obvious trends in the data?

 

 

 

 

 

 

 

f) Now suppose the researchers had used 74 toads and measured intracellular pH in 37 of these by the FT method and in the remaining 37 by the DMO method. Assuming they could not match toads according to some common characteristic, could they test for a paired difference between the two methods? (You have or will hear about unpaired, independent approaches to analysis in lecture. The next lab deals with independent samples.) Is this independent test a better experimental design? Is there more or less uncertainty to our conclusions from a 2 independent sample test? Why?

 

 

 

g) Use the second testing procedure for a paired sample t test (either via matched pairs, or via the column of differences) and compare your results to your earlier statistics. Do the two methods provide the same values?

 

 

 

 

 

 

 

 

2. Scientists studying the effect of slash burning examined the diversity of spiders in clear-cut areas of coastal forests. The number of species of spiders was measured at 27 sites of equal size (1.4 ha). The sites were burned and species numbers at each site were measured again 4 years later. The data are stored in a file named spider in the shared directory.

a) Examine the data to determine if the assumptions of parametric testing are met.

 

 

 

 

 

 

b) In which ways does this data show any deviations from normality? Use assorted tools such as histograms, boxplots, quantile plots and descriptive statistics such as quantiles and moments.

 

 

 

 

 

 

 

 

c) Is there a difference in the number of species of spiders in clear-cut areas between the beginning of the experiment and 4 years later? Show all steps taken in testing the null hypothesis.

 

 

 

 

 

 

 

 

 

 

 

 

 

d) Suggest a better experimental design to determine the effects of burning on diversity of spiders in clear-cuts.

 

 

 

 

 

 

e) What is the 95% confidence interval for the mean difference in species numbers?

 

 

 

 

 

 

f) Diversity data often do not have a normal distribution. Suppose it were not possible to assume a normal distribution of differences in diversity of spiders before and after burning. What test would you use? Try it. Do you reach the same conclusion as before? If not, which conclusion would you agree with? Why?

 

 

 

 

 

 

 

g) How does the computer's p value approach to hypothesis testing compare to the table method used in hand calculations? In what ways are they similar and how do they differ? Which one is more approximate?

 

 

 

 

 

 

 

Answers for this Lab Assignment

Return to Main Lab Page

Return to Main Course Page