UBC - BIOLOGY 300

5. DISCRETE PROBABILITY DISTRIBUTIONS

Probability distributions may be discrete or continuous. This week we will examine two common discrete distributions: the binomial and Poisson. We will use JMP to generate random samples from these distributions and explore their characteristics.

Binomial

The binomial distribution is one of the most commonly encountered discrete probability distributions in biology. It is based on nominal scale data that come from a population with only two categories. One of the two categories is arbitrarily referred to as a "success" and the other a "failure" (based on which you are looking for) and these categories are mutually exclusive (i.e. male and female, black and white, left and right). The process of selecting an individual at random from the population is called a trial. The probability of a success remains constant from trial to trial. Furthermore, the outcome of any particular trial is not affected by the outcome of any other trial (i.e. trials are independent). The terms of the equation for the binomial distribution are calculated automatically by the computer using the equation:

In this equation P(X) is the probability of seeing X successes, n is the number of trials, p is the proportion of total occurrences that are successes, and q is the proportion that are failures (since p and q are mutually exclusive, q = 1 - p). Another term commonly encountered in binomial calculations is N, the number of observations.

The binomial distribution has several convenient mathematical features. First, the variance (s2) = n x p x q (the standard deviation is the square root of this). Second, the mean  (m) = n x p. Third, for large numbers of trials (perhaps n > 25), the shape of the binomial distribution is very similar to a normal distribution, allowing us to estimate probabilities without repeatedly using the cumbersome formula given above.

Poisson

Another discrete probability distribution commonly encountered in biology is the Poisson distribution. This distribution is important in describing random occurrences of objects in space or events in time. The occurrence of an object or event is assumed to have no effect on the probability of a second occurrence of the same object or event (i.e. objects or events are independent). The formula for calculating the probability of an occurrence is:

In this equation, P(X) is the probability of seeing X successes, m is the mean number of occurrences, and e is a constant (2.718..., the base for natural logarithms). An interesting property of the Poisson distribution is that the mean and variance of the number of events per interval are equal. Thus for a Poisson distribution, the variance to mean ratio, often called the coefficient of dispersion (s2 /) = 1, which is an indication that events are randomly spaced. If this is not the case, the distribution is not a Poisson. A uniform distribution, for instance, has a variance/mean ratio less than 1, while a clumped distribution produces a ratio greater than 1.

When there are a large number of trials, and p is very small, the Poisson distribution is very similar in shape to the binomial distribution. When the mean is large the normal distribution approximates the Poisson distribution.

Using the Program

To experiment with these distributions you will need to use the calculator functions of JMP. Start by opening a new file (double click on the blank page icon), then setup four columns. To format a column to illustrate a probability distribution, select the column, then choose Formula from the pull-down menu for columns. The program will open up a new window, the calculator. This platform allows you to create complex formulas to produce the data for a variable. For today’s exercise we will concentrate on the random number generating functions of this calculator and on the probability functions.

In the calculator window, click on random from the set of choices to the right of the keypad. This will produce a set of terms in the far right window that will allow us to generate samples from a wide variety of probability distributions.

Problems

1. Format column 1 as a binomial distribution by choosing binomial after you have selected the random number generator (it’s probably worth changing the column name to binomial just for clarity’s sake). As described above, two parameters, the number of trials and the probability of success characterize the binomial. The calculator requires these values to be selected before it can generate a binomial distribution. Let’s assume that we wish to examine the distribution of male and female offspring in rats with litter sizes of 10. Click to highlight the first empty square in the formula window. This is where you select the number of trials. For this example choose 10 (the number of offspring in each litter). Select the second empty box, the one for probability. Again, for this example, where males and females are equally likely, set the probability of success to 0.5, then click OK. This will close the calculator window. If you now add 10 rows to your column you should obtain a random sample of 10 values from this particular binomial distribution.

a) Plot a histogram and boxplots of the data in your binomial column and describe the distribution using the terminology from last week (skewed, bimodal, uniform, normal etc.). Does this sample appear to be symmetric? Does it have any outliers? Note the mean and standard deviation of the distribution.

 

 

 

 

b) Delete the 10 rows you have in this data set and immediately re-add them. Are the numbers the same as before? Why or why not? How much have the mean and standard deviation changed?

 

 

 

 

c) Choose formula again to access the calculator window and alter the probability of success. For now let’s assume that males are more common in this strain of rats and change the probability of success to 0.75 from its current value. Plot new histograms and boxplots and describe the changes to the distribution of the sample from this new binomial population. How are the mean and variance of the distribution changed? Write down these values.

 

 

 

 

d) Add 9990 rows to the column. What effect does this have on the distribution shape? On the sample mean and variance? Use the formulas from the introduction to this lab exercise to calculate the true mean and standard deviation for a population with these characteristics (10 trials or offspring = n, p: the probability of success = 0.75, and q: the probability of failure = 0.25). How much do your calculated parameters differ from the sample statistics for this set of 10000 values?

 

 

 

e) Compare the differences in parameter (true value) and statistics (sample estimate of a parameter) for your sample of 10000 values to those from your previous sample with only 10 values. Which sample size provides a more reliable estimate of population parameters?

 

 

 

 

f) Under column info from the columns pull-down menu, edit the modeling type to convert the column into nominal data (we’ve actually been cheating by leaving the data listed as continuous - this allowed us to see values for mean and standard deviation). Change the probability of success back to 0.5 to simplify hand calculation. The histogram window should now provide a table of probabilities of obtaining specific outcomes for any number of successes from 1 to 10. Compare the randomly generated probability of obtaining 10 males out of a litter of 10 with the theoretical value you hand calculate yourself using the binomial formula from the introduction to this lab. You can also obtain the theoretical value by choosing probability -> binomial probability from the formula window (in some versions of the program you may need to choose discrete probabilities, then binomial probabilities). This formula requires you to enter the probability of success (p), the number of trials (n), and the number of successes (for this question we want to know the probability of getting 10 successes).

 

 

 

 

2. Set up a new column and label it Poisson. As you did in the previous question, call up the calculator window and format this column as a Poisson distribution by choosing Poisson after you have selected the random function. In this case, we could be examining the number of Asian Gypsy moths found in insect traps throughout the lower mainland. For this distribution you must set one characteristic, the mean (equivalent to lambda in this instance). Set this value to 0.5 for now, meaning that on average we found one insect in every second trap.

a) Produce a histogram and boxplots for the distribution and describe their shape. Why are there nothing but integer values for this distribution? Note the mean and standard deviation for this set of values.

 

 

 

 

 

 

 

b) Temporarily reduce the number of rows to 10. A quick way to do this is to select row 11, scroll to the bottom of the column using the arrows on the side of the window, then use shift - click to select all the values from the 11th row to the end of the file. What effect (if any) does this have on the shape of the distribution? Again, note the mean and standard deviation for this distribution (do this for parts c and d as well).

 

 

 

 

c) Edit the formula for your Poisson distribution, and change the mean to 0.01 (and increase the number of rows back to 10000), then produce a new histogram and boxplots and describe how the shape of the distribution has changed.

 

 

 

 

d) Increase the mean number of insects found per trap to 1.5 and describe the shape of the curve. Increase it to 5 then 20 then 50 and describe what happens to the shape of the distribution. What distribution is this starting to resemble?

 

 

 

 

 

e) Refer back to the means and standard deviations from parts a through d. What is the approximate relationship between these values (remember that we are just sampling from a Poisson, so these are only estimates of the true mean and standard deviation and will vary randomly)? What is the relationship between mean and variance for these distributions?

 

 

 

 

f) As with the binomial, the random functions can’t directly calculate the true probabilities for a specific number of successes. With 10,000 values in our sample, however, our estimates should be quite accurate (if you haven’t increased the number of rows back to 10,000 do so now). Convert the modeling type to nominal so that we will get a list of probabilities for each number of successes. Set the mean to 1 for your Poisson distribution and use the equation from the introduction to this exercise to calculate the probability of finding 1 gypsy moth in a trap. Compare this to the estimated probability from your sample. As with the binomial, you can use the calculator window to generate the theoretical value for the poisson. From formula choose Probability -> Poisson probability and enter 1 for your lambda, since lambda is equivalent to the mean, and 1 for your k value, the number of successes for which we want the probability. Repeat this for the probability of finding no gypsy moths in a trap.

 

 

 

Answers for this Lab Assignment

Return to Main Lab Page

Return to Main Course Page