Model selection

Selecting among candidate models requires a criterion for evaluating and comparing models, and a strategy for searching the possibilities. In this workshop we will explore some of the tools available in R for model selection. If you are working from your own computer you may need to download and install the leaps package from the CRAN website to carry out all the exercises.

For help with these exercises, see the "Fit models" link on the Rtips web page.

Scaling of basal metabolic rate in mammals

Savage et al. (2004, Functional Ecology 18: 257–282) used data to reevaluate competing claims for the value of the allometric scaling parameter ß relating whole-organism metabolic rate to body mass in endotherms:
BMR = αM^ß
In this formula BMR is basal metabolic rate, M is body mass, and α is a constant. On a log scale this can be written as
log(BMR) = log(α) + ß log(M),
where ß is now a slope parameter of an ordinary linear regression. Recent theory predicts that the exponent should be ß = 3/4, whereas we would expect ß = 2/3 if metabolic rate scales with heat dissipation and therefore body surface area. These alternative scaling relationships represent clear and distinct biophysical hypotheses. We will use them as candidate models and apply model selection procedures to compare their fits to data.

Savage et al. compiled data from 626 species of mammals. To simplify, and reduce possible effects of non-independence of species data points, they took the average of log(BMR) among species in small intervals of log(M). The resulting values of basal metabolic rate and mass can be downloaded here. Body mass is in grams, whereas basal metabolic rate is in watts.

Plot the data. Is the relationship between mass and metabolic rate linear on a log scale?
Fit a linear model to the log-transformed data. What is the estimate of slope?
Produce a 95% confidence interval for the estimate of slope. Does the interval include either of the candidate values for the scaling parameter ß?
Add the best-fit regression line to the plot in (1).
Now compare the fits of the two candidate models to the data. To accomplish this you need to force a regression line through the (log-transformed) data having a specified slope (3/4 or 2/3). The easiest way to do this, for example in the case of ß = 3/4, is to rewrite the model
   log(BMR) = log(α) + (3/4) log(M)
as
  log(BMR) – (3/4) log(M) = log(α).

To fit this model, you can use
z1<- lm( (logbmr - (3/4)*logm) ~ 1)
where the variable names above are self-explanatory, or you can accomplish the same in two steps
  y <- logbmr - (3/4)*logm

z1<- lm(y ~ 1)

The coefficient of this model will be the estimate of the constant, log(α), in the best-fit model constrained to have slope 3/4.

Repeat this procedure for the other candidate value of slope.
Replot the data indicating the relationship between log(M) and log(BMR). Add to this plot the best-fit line having slope 3/4. (The command abline is an easy way to do this. Type ?abline for more info) Repeat this for the slope 2/3. By eye, which line appears to fit the data best?
Compare the residual (error) mean square (or residual standard error) of the two models you fit in (5). Which has the smaller residual variance?
Calculate the log-likelihood of the parameters for each model fitted in (5). Which has the higher value?
Calculate AIC for the two models, and the AIC difference*. Which model has the lower AIC value? How big is the AIC difference for the poorer model? (Remember: the smaller the AIC the better.)
Calculate the Akaike weights of the two models**. Which has the higher weight of evidence in its favor? By how much is it larger? These weights would be needed for model averaging, which we won't go into at this point. The weights should sum to 1. (They are sometimes interpreted as the probability that the given model is the best one, given that one of the models in the set is the best, but we will avoid that here.)
Summarize the overall findings. Do both models have some support, or does one of the two models have essentially no support?

* 23.73591
** 9.999930e-01 7.011488e-06

Bird abundance in forest fragments

In the following example we employ no pre-selecting of candidate models to. Instead, we have measurements of a set of possible explanatory variables and we want to which of all possible models is the best. We also (sensibly) wish to determine those models that are near-best and should be kept under consideration (e.g., for use in planning, or subsequent model averaging).

The data are measurements of the abundance of forest birds in 56 forest fragment in southeastern Australia by Loyn (1987, cited in Quinn and Keough [2002] and analyzed in their Box 6.2). Abundance is measured as the number of birds encountered in a timed survey (but units aren't explaiend). Six predictor variables were measured in each fragment:

area: fragment area (ha)
dist: distance to the nearest fragment (km)
ldist: distance to the nearest larger fragment (km)
graze: grazing pressure (1 to 5, indicating light to heavy)
alt: altitude (m)
yr.isol: number of years since isolation.

The data can be downloaded here.

Using only a single command, calculate the mean bird abundance for each category of grazing pressure.
Using histograms, scatter plots, or the pairs command (see "Graphs, Multiple Variables" on the "display" tab of the Rtips pages), explore the frequency distributions of the variables. Several of the variables are highly skewed, which will lead to outliers with excessive leverage. Transform the highly skewed variables to solve this problem.
Use the cor command to estimate the correlation between pairs of explanatory variables. Which are the most highly correlated variables?
Using the model selection tool, all subsets regression, determine which linear model best predicts bird abundance. Ignore interactions and quadratic terms, etc. (Note: Make sure you include the names= option in your leaps command, because a command below needs it)
Plot Mallow's C_p against p (number of predictors, including the intercept). How many predictors does the best model have? Are there other acceptable models (have C_p < p)?
Use lm to fit the best model to the data. Produce a summary of the results. Which variables are included? which variables are not included?
Run leaps2aic (See the Model selection help pages) to calculate AIC and other quantities of interest. Save the results in an object (the function will create a data frame).
Using the results from (7), which model minimized AICc?
In a new data frame, keep only those cases for which the difference in AICc is less than 10. How many models were retained?
Using the AICc values, calculate the Akaike weights of all the models retained. How much weight is given to the best model? Are there common features shared among the models having the highest weights?

Let's try analyzing the data using stepAIC, which would also allow us to include interaction terms if we wished.

Create a new data frame (or return to the original data frame) that includes the single reponse variable and all the explanatory variables you wish to fit. Leave out any variables that you don't want to include in the model. Leave out any interactions for now.
Run stepAIC using the additive model (no interaction terms). Review the results printed out on the screen. See the Rtips pages on model selection (under fit model) for help interpreting the steps.
Summarize the lm results for the best-fitting model from (2). Is the best model the same or different from the one picked out by leaps? Would this have been the model that resulted had you used stepwise multiple regression with null hypothesis significance testing instead? How can you tell?
Calculate AIC for the best model analyzed in (3). Write this number down somewhere because we will compare it with another model fitted below. (Note: the AIC value that you compute will differ from that printed out by stepAIC for this model. Not to worry, stepAIC uses the command extractAIC instead of AIC. The computations differ only by a constant, so AIC differences are unaffected).
Run stepAIC again, but this time use a model that includes all two-way interaction terms. This is already pushing the data to the limit, because there are only 56 data points. Scan the printed output on the screen to see the sequence of steps that stepAIC takes to find the best model.
Summarize the lm results for the best-fitting model from (5). Is the best model the same or different from the one picked out in (2) from the analysis of the additive model?
Calculate AIC for the best model analyzed in (6). How does it compare to the AIC value computed in (4) for the best additive model? Considering the difference in AIC between the two models, which has more support? Do the two models have roughly equivalent support or does one have "essentially no support", as defined in lecture.

Model selection

Scaling of basal metabolic rate in mammals

z1<- lm( (logbmr - (3/4)*logm) ~ 1)

y <- logbmr - (3/4)*logm

z1<- lm(y ~ 1)

Bird abundance in forest fragments