UBC - BIOLOGY 300

11. RELATIONSHIPS BETWEEN TWO VARIABLES

NOTE: This week's exercise contains an embedded figure. Make sure you click on it and print it out along with the rest of the exercise!

Previous to this exercise, you have seen only one statistical analysis that works on two variables: the contingency test. This test has the advantage that it will work with very low level, categorical data. Unfortunately, because it requires little information and has few assumptions, it can only provide very basic answers. The contingency test will merely show whether or not two variables are independent.

This week we will look at two more powerful analyses: simple linear regression and simple linear correlation. Both of these are parametric tests and have some fairly rigorous assumptions that must be met. If we can meet these assumptions, correlation and regression can provide us with detailed information about the relationship between two variables. Of the two, regression is used to describe a much more powerful relationship, one in which changes in an independent variable (X) cause changes in a dependent variable (Y). Regression allows us to predict Y if we know X. On the other hand, correlation is a way to describe two variables that are associated with each other, that vary together. It has no predictive ability and does not imply causation.

Warning: carrying out a regression analysis and finding a significant relationship does not necessarily mean that you have found a cause and effect relationship. Demonstrating a cause and effect relationship also requires proper experimental design, with appropriate controls to rule out other possibilities.

Simple Linear Regression

Linear regression is a useful tool for describing a causal or predictive relationship between two variables using a straight line. The line takes the form, Y = a  + b  X, where X is the independently measured variable used to predict the dependent variable Y, a  is the Y intercept (the mean value of Y when X is zero), and b  is the slope of the line (the amount that Y changes per unit change in X). Note that the equation of the line you learned in first year physics is different from this, so forget about that equation for now.

In lecture, you have learned, in order, how to estimate these parameters, how to test hypotheses about them, and finally, how to evaluate whether data satisfy the assumptions of regression and how to remedy problems. These steps are done in reverse order when you use a computer to analyse data, since calculations are carried out so quickly. The best way to proceed is to:

(1) Evaluate the assumptions and take remedial action when necessary.

(2) Test hypotheses about parameters only after the assumptions are satisfied.

(3) Concern yourself with the equation of the line and its biological significance only if there is a statistically significant relationship.

In this lab, we emphasise graphical tools that help you evaluate the assumptions that underlie regression analysis. These methods rely on visual and statistical inspection of data, and conclusions that you draw while using them will be subjective. Your goal is to try to make the data fit the assumptions as closely as possible, and then decide whether the fit is close enough to proceed with the analysis. Be prepared to try several remedies and to choose the best among them.

Assumptions of Regression

Linear regression rests on five assumptions. (1) The true relationship between the two variables is actually a straight line. (2) At any value of X, there is a distribution of possible Y values whose mean falls on the line. The variance of these Y values is assumed to be the same at all values of X. (3) The distribution of possible Y values at each X is normally distributed. (4)  Observations are independent of another. The assumption of independence can be met only by proper experimental design. (5)  There are no errors in measurements of X. Again, we can only rely on experimental design for this one, which is commonly violated, but luckily causes minimal impact on test validity.

The information below suggests a number of transforms that may work in particular situations. In general, though, log or ln transforms are the most likely transform to correct a problem with one of the assumptions of regression.

1. Linearity - Scatterplots of Y against X are useful for determining whether the relationship between two variables is linear. By adding the regression line to a scatterplot, you provide a straight reference line against which deviations from linearity can be compared. You might also see how well an alternative curve that is not constrained to be linear fits the same data. Spline fitting is a technique that plots the trend in Y against X without assuming that the relationship is a straight line. If the spline fit closely mimics the linear regression, then the assumption of linearity is a safe bet. If not, then something must be done to the data before regression can be used. When the computer fits a spline to a set of data, it generates a series of polynomial regressions (curved lines) using subsets of the data, and then smoothes these curves together to demonstrate the general trends in the data set. If the spline fit tracks roughly alongside a straight line, and doesn't show a systematic and regular deviation from the straight line, then it is likely that the relationship between X and Y is linear. When you carry out a spline fit, you must select a value of lambda, which specifies how straight the spline will be. A lambda of between 100 and 1 will usually allow the spline fit to wander enough to determine if the relationship appears to be linear. Be prepared to experiment with different spline values.

Suppose that the relationship between Y and X is not linear. Transformations offer no hope if the real curve contains either a distinct peak or a distinct valley (for instance a bimodal distribution). However, they might work if the curve increases uniformly or decreases uniformly from left to right. Log and square root transformations can be used to linearise relationships. Try both transformations and use the one that works best. You also have a choice of transforming either X or Y. Your choice of variable depends upon the shape of the curve. Transform Y if the data might be linearised by pulling down extremely large values of Y. Transform X to pull down large values of X (see figure below). In some cases, the relationship between X and Y will be clearly nonlinear, but transformation pulls down the large values of a variable so far that the curve is bent in the opposite direction from the original curve. For example, the curve in the upper left panel of the figure below is bent into the curve in the upper right panel. This result often occurs when the true relationship between two variables is a power function (e.g. Y = 3 e1.4X), and it can be remedied by transforming both X and Y.

Click here to show 4 basic graphs and how to transform them. You should set your printer to landscape mode for this figure if you can.

2. Equality of variance - It is difficult to test this assumption statistically, because we generally only have one or two values of Y for any X, preventing reasonable variance calculations. Instead, we usually rely on graphical analysis by viewing the spread of data points around the regression line. If the variance is constant, the spread should remain roughly constant along the line. The spread should change inconsiderably if the variance changes, becoming small if the variance decreases and large if the variance increases. A simple scatterplot of Y against X together with the fitted regression line is useful for this purpose, but it has the disadvantage that the viewer must look at the spread around a sloping line, which is not as easy as viewing the spread above and below a horizontal line. Because our goal is to observe the spread in Y relative to the predicted values of the line, the effect of the slope can be removed by first calculating the difference between Y and the predicted value Y. This difference (Y - ) is known as the residual. A scatterplot of residuals against X should yield a horizontal band of data points with constant height if variance is indeed constant.

Transformations to equalize variance must be applied to Y. Both the log and square root transformations are available. Try both to see which does a better job at equalizing variance. A third transformation is available when analyzing proportions. Proportions represent a special case because variance is expected to rise as p increases from 0 to 0.5 and then fall as p increases from 0.5 to 1. The arcsine-square root transformation is appropriate under these circumstances.

3. Normality - The assumption is that Y is normally distributed about the predicted value of Y at any value of X. This assumption is the least important of the assumptions of regression analysis, and we will not discuss it further. Note that most transformations of Y that linearise data or control variance also tend to normalize data. Use large sample sizes, and don't worry about this assumption if you've got the others down.

Testing the Significance of a Regression Relationship

Generally, regression analysis is only done with parametric data. There are some fairly quantitative non-parametric approaches, but they are not widely used. Since we are trying to demonstrate a very rigorous relationship (causal), we need rigorous experimental designs and analyses. We need parametric data.

There are two main ways to test for the significance of a regression relationship. The simplest (at least by hand) is the F test, which just tells us if the slope of the line is significantly different from zero. A zero slope or flat line means that no matter how X changes, Y is unaffected. Therefore, a significant slope means that X is causing changes in Y. To do an F test we will carry out an ANOVA in which we act as if each value of X represents a different sample or treatment for Y. If the mean Y values for each X are the same, then the line is flat and there is no relationship. The calculations of the intermediate components of the regression ANOVA are slightly different than for a typical ANOVA, simply because we usually have only one or two values of Y for any X. The theory behind the analysis is identical, however. As usual, the ANOVA is a two-tailed test, but we calculate a one-tailed F, since we assume that the regression variance is bigger than the residual variance.

If we want to test for a positive slope or a negative slope or a non-zero slope, we must use a different approach. A t test of the significance of b allows us to carry out one-tailed, two-tailed or non-zero tests.

Describing the Regression Relationship

If we find that there is a significant relationship, then we usually would like to describe the relationship. The equation of the line is our best description of the relationship between X and Y. We can use this equation to predict values of Y from X, and for fixed effects regression models we have a limited ability to predict X from Y (inverse prediction). We must be careful not to extrapolate beyond the range of our data, however. Prediction from regression is only valid within the range of our data, since this is the only area where we have tested our assumptions (e.g. linearity). If we have transformed one or both of our variables we must include that transform when we report the equation of the line. If we had transformed Y to ln (Y + 0.5) to solve a variance problem we would report the equation as ln (Y + 0.5) = a + bX.

Another useful tool for describing the relationship is r2, the coefficient of determination. This value tells us the strength of the relationship, or what proportion of the change in Y can be predicted from changes in X. r2 varies between 0 (no predictive power) and 1 (the equation of the line explains all of the changes in Y).

Finally, we can produce confidence intervals for our regression line. These intervals give us an indication of the predictive accuracy of our line. The intervals are tightest at mean X and mean Y since our equations force the line to go through this point. As we move further away from the mean pair of values, our uncertainty about the position of the line increases and our intervals become larger. These intervals are produced by summing our uncertainty about the intercept and our uncertainty about the slope.

Simple Linear Correlation

In correlation analysis, values of X are not preselected by the investigator, but occur at random. Here we are not trying to demonstrate a causal or predictive relationship. We are merely trying to see if two variables are associated or vary together. Thus, correlation does not designate one variable as dependent and the other as independent.

The unfortunate result of this relationship is that correlation has many assumptions:

1. For each value of Y there is a normally distributed population of X values.

2. For each value of X there is a normally distributed population of Y values.

3. For each value of Y there is equal variance in the population of X values.

4. For each value of X there is equal variance in the population of Y values.

5. Both Y and X values were selected randomly and independently.

6. The relationship between X and Y is linear.

It is hard to meet these assumptions and can be even more difficult to demonstrate that we meet the assumptions. We are usually satisfied if the trend in our data set is a symmetric, oval cloud of data points whose slope angles upwards or downwards.

In a correlation analysis, the strength of the relationship between two variables is indicated by r, the correlation coefficient. This value is mathematically equivalent to the square root of r2 in regression analysis. It should not be considered equivalent, however, because of the theoretic differences underlying correlation and regression. The correlation coefficient may take on values from -1 (negative linear correlation) to +1 (positive linear correlation). A value of 0 indicates no linear correlation. We can also produce confidence intervals on r. If r is near -1 or 1, however, the intervals will not be symmetric and a normal distribution is not adequate to describe the data.

The significance of a correlation is tested via a t test of r. As usual, the t test is a mean difference divided by the standard error of the mean difference. In this case, t equals the difference between our calculated r and any hypothetical value of rho (the parameter for the relationship between X and Y), divided by the standard error in r.

Non-parametric correlation is fairly common since the assumptions of correlation are sweeping and difficult to test. Generally, we can examine the scatterplots to find evidence of a non-linear relationship or violations of the assumption of equal variances among the samples. Spearman's or Kendall's rank correlation are the most common non-parametric equivalents to correlation. They have similar power and assumptions and one of the two should be used if the assumptions of parametric correlation do not seem to be met, and sample sizes are small.

Using the Program

Beginning a regression analysis - To carry out either regression or correlation analysis we need two continuous variables. To carry out a regression, choose fit Y by X and designate the causative or predictive variable as X. Your Y variable should be the variable that is the effect, the variable that is being predicted. The computer will display a scatterplot of the data. It may be possible to see trends in the data just by examining this plot. If you click on the Red Arrow beside Bivariate Fit, you will be presented with a number of options. First, choose Fit Line. This will carry out a simple linear regression. We won't worry about the summary tables for this line just yet. For now, we want the regression line for its use in determining if this association is linear. Once we have a fitted line displayed, clicking on the Red Arrow for the particular line will allow us to plot confidence intervals or do other analyses for that line.

Checking linearity - To test if the data are linear, carry out a spline fit from the same fitting dialog box, with a lambda of about 1 or 10. Go higher or lower if these values show too little or too much wobble and remember that we are just trying to see if the general trend of the line is straight. This curved line should track more or less alongside the linear fit. We will worry if there is a strong trend away from the data. In this case, your eyeball estimate is as good as it gets. The diagrams above show us how we can correct some of the more common problems in linearity.

Checking variance -To test if the variances in Y are equal, we can examine a residual plot. If the residuals form a fairly symmetric band across the graph, then our variances are roughly equal. If the scatter of points around the residual line looks like the spray from the end of a hose (variance in Y increases as X increases), we need to transform Y. In fact, any shape other than an equal band suggests we should transform Y. To call up a residual plot, again click on the Red Arrow for our linear fit and choose Plot Residuals. The current version of JMP produces several different residual plots. The one we are interested in is the residual by predicted plot.

Transformations - generally, log (common or natural) transforms are the safest bet for data sets which violate the assumptions of regression. The best method of transforming your data is to create new columns of data by applying a transform formula to our raw data. This is the same method we used in last week's ANOVA lab. This method will produce new scales for the axis of our scatterplots and residual plots, letting us judge whether the transform worked.

Regression Statistics - When we carry out a linear fit, the program produces summary tables showing the ANOVA results, the equation of the line, the parameter estimates (a and b), the standard errors for our parameters, t tests for both the intercept and the slope, and the value for r2.

In the table of parameter estimates, our estimate of the intercept is a, and our estimate for the coefficient of X is b (listed as the estimate of X). The t statistic and p value for b provide a t test of slope. The probability shown is a two-tailed probability so you will have to adjust the p value if you want to carry out a one-tailed test. The appropriate p value will either be one half of the two-tailed value or 1 minus one half of the two-tailed value. You must work out which is appropriate, based on your hypotheses.

Correlation - To carry out a correlation, we can fit Y by X, designating either variable as X or Y (they are interchangeable here since we are not designating a dependent and an independent variable). Click on the fitting button from the resulting scatterplot and choose density ellipses. The probability we choose for our density ellipse is used to show a region in which we have that specified level of certainty that the bivariate normal (normally distributed in both X and Y) population of paired values should fall. The probability that we choose will not affect the statistics for the correlation. The summary statistics appear in the bivariate table. The values displayed here include r, and the probability of obtaining r (the p value from a t test of the significance of r).

A second way to carry out a correlation is to choose analyze, then Multivariate Methods then Multivariate (for older versions omit the middle step as there is only one available option). Add the two variables you want to check for a correlation, leaving the estimation method at default. This will produce a report including r, the correlation coefficient. Clicking on the Red Arrow for Multivariate lets us do pairwise correlations, whose summary table includes r and a significance test of r. These values are identical to those produced by the density ellipse approach. Clicking on the Red Arrow again, we can call up the Non-Parametric Correlations that will conduct the Kendall's, Spearman's and Hoeffding's tests. Kendall's and Spearman's are usually the more powerful options.

Problems

1. The original description of regression effects was from a study of parent heights and child heights conducted by Sir Francis Galton in 1877. The data from this study is stored in the file Galton.jmp in the shared directory. In this study parent height (X) is the predictor for child height (Y).

a) Is a straight line adequate to describe the relationship between parent height and child height?

 

 

 

 

 

b) Does variance in child height remain constant across all values of parent height?

 

 

c) Test the hypothesis that parent height can be used to predict child height.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

d) Why do the confidence bands for the regression curve out at their ends?

 

 

 

 

2. The nitrogen content of leaves (mmol/m2) is thought to be a predictor of photosynthetic rate (mmol/m2/s) and productivity of agricultural crops. Both of these variables were measured and stored in a file named NITRO.

a) Examine the data and decide if the assumptions of linearity and equal variance are met for this data set.

 

 

 

 

 

 

b) Suppose there is a problem with the assumption of equal variance. Transform the appropriate variable (by the method of adding a new column) and compare the residual plots to see if there is any improvement.

 

 

 

 

 

 

 

c) Is it possible to predict photosynthetic rate based on the nitrogen content of leaves? If so, describe the relationship algebraically.

 

 

 

 

 

 

 

 

 

 

 

 

 

d) How well does the relationship predict the photosynthetic rate?

 

 

 

 

 

3. Crows often steal food from birds of other species. To test whether potential victims are sensitive to the presence of potential robbers, an experiment was conducted in which a model crow was placed near the nests of pigeon guillemots. Observers recorded the amount of time that arriving guillemots sat and waited before delivering food to their hungry nestlings. A negative relationship between time waited and distance of the model from the nest would suggest that would-be victims respond to the risk of robbery. Data from the experiment are stored in the file ROBBER. Variable #1 is distance of the model crow from guillemot nests (m), and variable #2 is amount of time that guillemots paused (min) before returning to their nests.

a) Is a transformation required to linearize the data?

 

 

 

b) Is the assumption of equality of variances met in this example?

 

 

 

c) Test the hypothesis that pigeon guillemots are reluctant to approach their nests when a potential robber is close at hand.

 

 

 

 

 

4. Brain weight (g) and testes weights (mg) were measured in 8 male chipmunks (Tamias spp.). The data are stored in a file named CHIPMUNK.

a) Examine and comment on a scatter plot of the data.

 

 

 

b) Is there a relationship between brain weight and testes weight in these chipmunks? Show all steps taken in testing the null hypothesis.

 

 

 

 

 

 

 

 

 

 

 

 

c) Try testing the significance of the correlation using whichever method you didn't use in part b. Are the results the same?

 

 

d) Suppose a significant relationship was found between the variables measured in this study. Is it possible to predict brain weight from measurements of testes weight in these chipmunks?

 

 

 

 

e) What is the value of the correlation coefficient in this analysis. Do you feel that the correlation coefficient indicates a strong relationship between the variables measured in this study?

 

 

 

f) Suppose the bivariate normality assumption of correlation analysis is not met in this study with its small sample size. What test would you use to determine the strength of the relationship between the variables measured? Try it (show all steps taken in testing the null hypothesis). Do you reach the same conclusion as before? If not, which conclusion would you agree with?

 

 

 

 

Answers for this Lab Assignment

Return to Main Lab Page

Return to Main Course Page