Generalized linear models

In this workshop we will fit general linear models to data, implemented in the R command glm. See the "fit model" submenu of the Rtips pages for help.

A generalized linear model is useful when the response variable is not normally distributed and when a transformation is undesirable or impossible. Example situations include
  • when the response measurement is binary, such as 1 or 0, dead or alive
  • when the response variable is a count or frequency, such as number of offspring, leaves, or tattoos
  • when the response variable is a categorical variable having two or more categories. Examples might include food choice, color morph, or genotype

Natural selection in song sparrows

The song sparrow population on the island of Mandarte has been studied for many years by Jamie Smith, Peter Arcese, and other collaborators. The birds have been measured and banded and their fates on the island have been recorded. Here we will look for evidence of natural selection using the relationship between phenotypes and survival. The data file located here gives survival of young-of-the-year females over their first winter (1=survived, 0=died). The file includes measurements of beak and body dimensions: body mass (g), wing length, tarsus length, beak length, beak depth, beak width (all in mm), year of birth, and survival. These data were analyzed previously in D. Schluter and J.N.M Smith (1986, Evolution 40: 221-231).

Read and examine the data

  1. Read the data from the file.
  2. View the first few lines of data to make sure it was read correctly.
  3. Use the str command to summarize the variable types. This is a good time to change any variable types you want changed. For example, "year" is read as a numeric variable by default, but it is typically used as a grouping variable (named category) instead. Changing it to a character or factor will achieve this.
  4. Plot survival against tarsus length of female sparrows. There will be a lot of overlap of points because the response variable is discrete. Use "jitter" to reduce overlap and see the distribution more clearly. 
  5. Examine the plot. Can you visualize a trend? Use "lowess" (or "loess") to smooth the data and see if any trend is present. Notice that the predicted values for "lowess" are not constrained to lie between 0 and 1. (Optional: try fitting a cubic spline instead to examine the trend. See the "fit model" submenu help page for instruction).

Fit a generalized linear model

In the following analysis, we are going to ignore the fact that the data are from multiple years. If time permits we can add "year" to the model later to see how much difference it makes.
  1. Fit a logistic regression to the data. Add the fitted regression curve to your plot. What is the relationship between survival and tarsus length?
  2. The glm approach fits the response data on the logit scale rather than on the original (survival) scale. To get a sense of what the fit looks like on that scale, plot the predicted values from the model against tarsus length. If you've done this correctly you should see a series of dots all in a straight line. In other words, the generalized linear model is actually a linear model on the transformed scale. (Note: the "predict" command yields the predicted values on the logit scale, whereas "fitted" yields the predicted values on the original scale).
  3. Use the summary command to extract the estimated regression coefficients. What do these coefficients mean? They are the slope and intercept of the predicted line visualized in (2). Use these coefficients to calculate the predicted value for a song sparrow having tarsus length 20.5 mm*. Then, using the inverse of the logit transformation, convert this prediction to a probability of survival on the original scale*.
  4. Return to the plot of the original survival data on tarsus length, and redraw the fitted logistic regression curve. Does predicted survival at 20.5 mm match that calculated in (3)?
  5. What is the relationship between the coefficients you extracted in step (3) and the fitted logistic regression curve? The sign of the slope coefficient tells us whether the curve is increasing or decreasing. The value of -(Intercept)/slope gives the point at which probability of survival is changing most rapidly. In toxicology this point is known as the LD50.** Calculate this value and compare it visually with the fitted curve. Finally, the slope of the curve at any value x for the explanatory variable is
         slope * p(x) * ( 1 - p(x) ),
    where p(x) is the predicted probability of survival at that x.
  6. Add the 95% confidence bands for the fitted line to the plot. Note how the curves are asymmetric on the original scale. This occurs because error variance is less when the predicted probability is close to 0 or to 1, other things being equal.
  7. Calculate the likelihood-based 95% confidence interval for the logistic regression coefficients.
  8. Examine the residual plot and other diagnostic plots for the fitted model (i.e., using plot(z), where z is the glm object). I did warn you that these would look strange and can be difficult to interpret!  However, the diagnostic plots can sometimes be helpful in identifying major outliers and points with high leverage. 
  9. Use the log-likelihood ratio test to test the null hypothesis of zero slope.
  * -1.148577; 0.2407491
**   19.58683


Crab satellites

Males of the horseshoe crab, Limulus polyphemus, have two alternative reproductive morphs. Some males attach to females with a special appendage. The females bring these males with them when they crawl onto beaches to dig a nest and lay eggs, which the male then fertilizes. Other males are satellites, which are unattached to females but crowd around nesting pairs and steal fertilizations. What attributes of a female horseshoe crab determine the number of satellite males she attracts on the beaches?

The data here provide measurements of 173 female horseshoe crabs and record the number of satellites she attracted. The data were gathered by Brockman (1996. Satellite male groups in horseshoe crabs, Limulus polyphemus. Ethology 102:1-21) and were published by Agresti (2002, Categorical data analysis, 2nd ed. Wiley). The variables are female color, spine condition, carapace width (cm), mass (kg), and number of satellite males. 

Read and examine the data

  1. Read the data from the file.
  2. View the first few lines of data to make sure it was read correctly. Use the "str" command to see the variable types. This is when any changes to variable types should be made.
  3. Plot the number of satellites against the width of the carapace, a measure of female body size. Fit a lowess curve to examine the trend. (Optional, fit a cubic spline to see the trend.)

Fit a generalized linear model

  1. Fit a generalized linear model appropriate for count data to estimate the relationship between number of satellite males and carapace width. Assume for now that the dispersion parameter =1 (more details below). Add the fitted curve to your plot. What is the relationship? 
  2. Examine the residuals and other diagnostic plots. See that one of the data points has fairly high leverage compared to the others, although its Cook's distance is not large (generally, a Cook's distance less than about 1 is not worth worrying about). The appropriate step to take at this point would be to see if the fitted curve and parameter estimates change much when this point is deleted, but let's not worry about that right now.
  3. Extract the estimated regression coefficients. What do these coefficients mean? Because we've used the log link function, the resulting curve on the original scale is an exponential function of the form
          y = c * d^x
    where c is exp(Intercept) and d is exp(slope) from the log-linear regression. In other words, we've fitted an exponential relationship to the data without having had to log-transform the data (which is good, because we wouldn't have been able to log-transform the value nsatellites = 0).
    With an exponential function, a 1-unit increase in the explanatory variable increases the predicted response by the multiple exp(slope). Calculate exp(slope) to do determine this multiple. 
  4. Calculate the likelihood-based 95% confidence interval for the log-linear regression coefficients. The most useful estimate is that for the slope, since exp(slope) represents the multiple to the response variable accompanying a 1-unit change in the explanatory variable. Take the exp of the confidence limits for slope to obtain the confidence interval for the multiple.
  5. Add the 95% confidence bands for the fitted line to the plot. Note how the difference between the upper and lower band widens at larger values of the explanatory variable. This occurs because the error variance of a count increases with increasing x, other things being equal. (This is a worry for ordinary linear model fitting but not for glm, because the increase in variance of residuals with increasing x is modeled directly).
  6. Test the relationship between number of satellite males and female carapace width using analysis of deviance. Notice how small the P-value is for the null hypothesis test for the slope. I'm afraid that this is a little optimistic. Why? Read on.
  7. The Poisson distribution assumes that the variance of the residuals for the response variable at any x equals the mean response at x (in glm jargon, this is the assumption that the dispersion parameter = 1). However, this is rarely the case in the real world because other variables affecting the response variable are not included in the fitted model, leading to a higher than expected variance in the residuals ("overdispersion", i.e., a dispersion parameter > 1). The glm method has a built-in extension of the Poisson distribution based on quasi-likelihood that better accounts for overdispersion. The procedure also yields an estimate of the dispersion parameter.
    Using this extension, fit a generalized linear model to estimate the relationship between number of satellite males and carapace width in the horseshoe crabs. Save the results in a new glm object so that you can compare your results with the previous fit. 
  8. Estimate and examine the coefficients of this new glm fit. Have they changed? Have the standard errors changed?
  9. Examine the estimated dispersion parameter of this new glm fit. Is it greater than 1? On this basis, which of the two glm fits to the same data would you regard as the more reliable?
  10. Add the new fitted curve to your plot. Are the predicted values changed? Recalculate the 95% confidence bands and add them to your plot. Are they changed?