Likelihood

In this workshop we will use likelihood methods to estimate parameters and test hypotheses from data. Likelihood methods are especially useful when data conform to a non-normal probability distribution (e.g., binomial, exponential, etc). Refer to the "probability" tab of the Rtips web pages for instruction.

Maximum likelihood estimate

The likelihood method takes the data as given and varies the parameter of interest to find that parameter value for which the probability of obtaining the data is highest. This value is the maximum likelihood estimate of the parameter. By plotting the log-likelihood against different values for the parameter we also generate the likelihood function, which can be used to find the maximum and to obtain a likelihood-based confidence interval for the parameter.

The likelihood-based confidence interval is an approximation, and may be inaccurate for small sample size. It is based on the fact that as sample size gets larger and larger, twice the difference between the maximum log-likelihood and the log-likelihood at the calculated limits of the confidence interval approaches the corresponding critical value of the χ² distribution. When estimating only a single parameter, the approximate 95% confidence interval is obtained as the two values of the parameter corresponding to

max(loglike)-1.92

Log-likelihood ratio test

The log-likelihood ratio test can be used to compare the fits of two models to the same data. The "full" model fits the data using the maximum likelihood estimates for the parameter(s) of interest (for example, a proportion p). The "reduced" model constrains the parameter values to represent a null hypothesis (for example, that p = 0.5, or that p is the same between two treatments). The G statistic is calculated as twice the difference between the log-likelihoods of the two models ("full" minus "reduced").

G = 2 *(loglikefull - loglikereduced)

G is also known as the deviance. Under the null hypothesis, G has an approximate χ² distribution with degrees of freedom equal to the difference between the "full" and "reduced" models in the number of parameters estimated from data. We'll work through an example below.

Warmup

We'll start by getting familiar with the commands in R to calculate probabilities.

The probability of heads in a coin toss is 0.5. If you flip a coin 10 times, what is the probability of obtaining exactly 5 heads and 5 tails?
The fraction of human babies born who are boys is about 0.512. If 20 newborn babies are randomly sampled, what is the probability that exactly 10 are boys?
Plot the entire probability distribution for the number of boys in families having six children. Assume that the probability that any one child is a boy is 0.512.
If mortality is independent of age, then the probability of dying X years after birth will follow a geometric distribution, where X is any integer from 0 to infinity. If the probability of dying in any given year is 0.1, what fraction of individuals are expected to die at year 10?*
Refer to the previous question. If the probability of death in any give year is 0.1, what fraction of individuals die before they reach their 6th birthday?**
In an environment where prey are randomly distributed, the search time between discovered prey items will follow an exponential distribution. Imagine an environment in which the mean search time between prey items is 0.5 hours. What is the probability density corresponding to a search time of 2 hours?***
Refer to the previous problem. Create a line plot of the exponential curve over the majority of the range of possible values for search time between items.

*0.03487
**0.46856
***0.03663

Illegal tender

Jenkins (2001, Forensic Sci Int 121:189-93) collected 50 US 1-dollar bills and tested them for drug contamination. Forty six of the bills were found to contain trace amounts of cocaine (subsequent studies have corroborated this astonishing finding -- can you think of a reason?). For the purposes of this exercise, assume that the sample of bills was a random sample (in fact, 10 bills were sampled from each of 5 cities).

Generate a vector of possible values for the population proportion p, ranging from 0.01 to 0.99 in increments of 0.01.
Calculate the log-likelihood of each possible value for p, given the data.
Create a line plot of the log-likelihood against values for p.
Repeat steps (1) to (3) using a narrower range of values for p.
Use your results to determine the maximum likelihood estimate of the proportion of US 1-dollar bills contaminated with cocaine.
Provide a likelihood-based 95% confidence interval for the population proportion.*

*0.823, 0.975

Left-handed flowers

Individuals of most plant species are hermaphrodites (with both male and female sexual organs) and are therefore prone to inbreeding of the worst sort: having sex with themselves. The mud plantain, Heteranthera multiflora, has a simple mechanism to avoid such "selfing." The style deflects to the left in some individuals and to the right in others. The anther is on the opposite side. Bees visiting a left-handed plant are dusted with pollen on their right side, which then is deposited on the styles of only right-handed plants visited later. To investigate the genetics of this variation, Jesson and Barrett (2002, Proc. Roy. Soc. Lond., Ser. B, Biol. Sci. 269: 1835–1839) crossed pure strains of left- and right-handed flowers, yielding only right-handed F1 offspring, which were then crossed with one another. The expectation under a simple model of inheritance would be that their F2 offspring should consist of left- and right-handed individuals in a 1:3 ratio (i.e., 1/4 of the plants should be left-handed). The data from this cross can be downloaded here.

Calculate the log-likelihood surface and the maximum-likelihood estimate of the proportion of left-handed flowers (to the nearest hundredth). Use the data in the data frame, rather than summaries of the frequencies of left- and right-handed flowers, to calculate the likelihoods (practice with this approach will help you later below).
Use the same approach to calculate the likelihood-based 95% confidence interval of the population proportion.
We can compare the fits of two models to these same data, to test the null hypothesis that the proportion of left-handed flowers in the cross is 1/4 (i.e., the proportion predicted by the simplest genetic model for flower handedness). To begin, obtain the log-likelihood corresponding to the maximum likelihood estimate of the proportion of left-handed flowers. This represents the fit of the "full" model to the data. This model estimated one parameter from the data (p, estimated using maximum likelihood).
Now obtain the log-likelihood of the value for p specified by the null hypothesis. This represents the fit of the "reduced" model to the data. This reduced model estimated zero parameters from the data (instead, p was specified by the null hypothesis).
Calculate the G statistic for the log-likelihood ratio test*. To obtain a P-value for the test, calculate the tail probability from the χ² distribution as follows,

1 - pchisq(G, df)

where df is the degrees of freedom, calculated as the difference between the two models in the number of parameters estimated from the data.
How similar is the result from your log-likelihood ratio test to that from an ordinary χ² goodness of fit test? Analyze the same data using the chisq.test command in R and comment on the outcome.

* 0.1140369

Voyaging voles

The movement or dispersal distance of organisms is often modeled using the geometric distribution, assuming steps are discrete rather than continuous. For example, M. Sandell, J. Agrell, S. Erlinge, and J. Nelson (1991. Oecologia 86: 153-158) measured the distance separating the locations where individual voles, Microtus agrestis, were first trapped and the locations they were caught in a subsequent trapping period, in units of the number of home ranges. The data for 145 male and female voles are here. The variable "dispersal" indicates the distances moved. If the geometric model is adequate, then
Pr[X = 0 home ranges moved] = p,
Pr[X = 1 home range moved] = (1-p) p,
Pr[X = 2 home range moved] = (1-p)²p,
and so on.

Using the appropriate commands in R, calculate the maximum-likelihood estimate of p, the parameter of the geometric distribution. Use the data from both sexes combined.
Use the appropriate R command and the maximum likelihood estimate of p to calculate the predicted fraction of voles dispersing 0, 1, 2, 3, 4, and 5 home ranges.
Use the result in (2) to calculate the expected number of voles (out of 193) dispersing 0 - 5 home ranges, assuming a geometric distribution. How does this compare with the observed frequencies?*

* 165.59 23.51 3.34 0.47 0.07 0.01

Life of bees

Life spans of individuals in a population are often approximated by an exponential distribution. To estimate the mortality rate of foraging honey bees, P. K. Visscher and R. Dukas (1997. Insectes Sociaux 44: 1-5) recorded the entire foraging life span of 33 individual worker bees in a local bee population in a natural setting. The 33 life spans (in hours) are here.

Plot the frequency distribution of lifespans of the 33 bees. Choose the option to display probability density instead of raw frequency. Does the distribution of lifespans resemble an exponential distribution?
Use the exponential approximation and maximum likelihood to estimate the hourly mortality rate of bees.*
Using the maximum likelihood estimate, calculate the probability density for the exponential distribution across a range of values for lifespan. Draw this relationship between probability and lifespan on top of the frequency distribution you plotted in (1). Comment on the fit between the data and the exponential distribution you fitted. Is the exponential distribution a good fit to these data?
Assume (for the purposes of this exercise) that an exponential probability model is reasonable. Calculate the likelihood-based 95% confidence interval for the mortality rate.
Calculate the maximum likelihood estimate for the mean lifespan, with approximate 95% confidence interval.**

* 0.036
** 0.025, 0.050

Counting elephants

Counting elephants is more challenging than you might think, at least when they live in dense forest and feed at night. L. S. Eggert, J. A . Eggert, and D. S. Woodruff (2003. Molecular Ecology 12: 1389–1402) used mark-recapture methods to estimate the total number of forest elephants inhabiting Kakum National Park in Ghana without having to see a single one. They spent about two weeks in the park collecting elephant dung, from which they extracted elephant DNA. Using five genetic markers, they generated a unique DNA fingerprint for every elephant encountered in this way. Over the first seven days of collecting they identified 27 elephant individuals. Refer to these 27 elephants as marked. Over the next eight days they sampled 74 individuals, of which 15 had been previously marked. Refer to these 15 elephants as recaptured. We would like to use these numbers to estimate the total number of elephants in the park.

If we can make the following assumptions,

There were no births, deaths, immigrants, or emigrants while the study was being carried out, and
The dung of every elephant, marked or unmarked, regardless of its output, had an equal and independent chance of being sampled.

then the number of recaptured (i.e., previously marked) individuals X in the second sample should have a hypergeometric distribution with parameters k (the size of the second sample of individuals), m (total number of marked individuals in the population when the second sample was taken), and n (total number of unmarked individuals in the population when the second sample was taken).

Using the appropriate command in R for the hypergeometric distribution, calculate the maximum likelihood estimate for the total number of elephants in the park. Note that the total number is n + m, where n is the unknown parameter to be estimated. Note also that only integer values for n are allowed, and that n cannot be smaller than k - X, the observed number of unmarked individuals in the second sample.*
Calculate a likelihood-based 95% confidence interval for the total number of elephants.**

* 133
** 104, 193