UBC - BIOLOGY 300

6. THE NORMAL DISTRIBUTION

This week we will examine the normal distribution and explore some more issues in samples and populations. First, we will use JMP to generate random samples from this distribution and explore its characteristics. Secondly, we will explore tests involving normality and ways to estimate probabilities (areas of) associated with the normal distribution.

The normal distribution, as you have already learned, is one of the most important continuous distributions in statistical analyses. It is the result of random sampling of a continuous variable. We use the normal distribution so widely in statistical analysis because we understand it and have developed a wide range of statistical tests that utilize it.

Unfortunately, the normal distribution has a fairly complex equation for the probability density for any X. For this distribution, where X is continuous and variable, we refer to the probability density rather than the probability. Probability is an area of the total distribution curve for a given value, and the area under the curve for any single point on the curve is zero, since a single point has no width. For a continuous curve such as the normal distribution, the probability density is the height of the curve at a particular value of X. The equation for the normal curve is:

Note: Although the equation is ugly, you won’t be required to hand calculate it during this course. For the purposes of this week’s lab exercise, in fact, you will be using built-in functions of the program.

The normal distribution is characterized by two parameters (fixed characteristics of a distribution). The population values for the mean and standard deviation define any normal distribution, so in fact there are an infinite number of normal curves. All normal distributions are symmetric with tails that extend to positive and negative infinity.

The existence of an infinite number of normal distributions could present problems for analysis, since each one will have different widths and different modes. We solve this problem by using the standard normal distribution for many of our calculations. This distribution has a mean of zero and a standard deviation of one. Any normal distribution can be converted to the standard normal by subtracting away the mean and dividing by the standard deviation. The formula for this is:

Z is a measure along the X axis and has units of standard deviations. The areas under the curve for the standard normal distribution have been calculated for almost all values of Z and are available in the statistical tables in the back of Zar and using JMP.

Using the Program

Sampling from a standard normal distribution: To experiment with the normal distribution you will again need to use the calculator functions of JMP. Set up a new column, then choose Formula from the pull-down menu for columns. In the calculator window, click on random from the set of choices to the right of the keypad, then choose normal. As in our earlier lab exercise, this will generate samples from a standard normal distribution (one with a mean of zero and a standard deviation of one).

Testing normality: Data sets can be tested for normality using a goodness of fit test. Since potentially normal data is continuous, the Chi-squared and G tests are awkward to use. Goodness of fit tests better suited to continuous data sets include the Kolmogorov-Smirnov test, the Kolmogorov-Smirnov-Lilliefors test and the Shapiro-Wilks test. The latter two are especially powerful and are available using JMP. Once you have produced a histogram of your continuous data, click on the Red Arrow beside the variable name in the top-left corner of the window. This will open up the sub-menu where we can look at various analyses including testing that the distribution is normal. Choose Continuous Fit -> Normal, then at the right hand side of the window click on the Red Arrow beside Fitted Normal and choose Goodness of Fit. For samples under 2000 values the computer conducts a Shapiro-Wilks test, while larger samples are analysed using a Kolmogorov-Smirnov-Lilliefors test. Both tests compare your observed values to expected values generated from a normal distribution with the same standard deviation and mean as your sample data. If the probabilities given beside the test statistic are less than or equal to alpha we can reject the null hypothesis that the sample is from a normal distribution.

Generating values from a standard normal distribution: JMP can also generate probabilities for any Z value or Z values for any probability. To use this ability you need to open the calculator window and choose the probability option from the function browser. There are 3 probability functions related to the normal distribution. Normal distribution returns the probability of obtaining a value less than or equal to any Z value that you enter. Normal quantile returns the Z value that corresponds to a probability area from the lower tail of the standard normal distribution (the exact opposite of the Normal distribution option). Normal density provides the height of the curve for any Z value that is entered, although this final option won’t be used today.

Normal Quantile Plots: These graphs are available from the variable sub-menu for the distribution platform. A normal quantile plot is a cumulative frequency plot that has a Y axis scaled so that a normal distribution will appear as a straight, diagonal line. These curves are useful in assessing how and where a non-normal distribution differs from the ideal. Exponential relationships will appear as curves, while skews will cause the quantile plot to fall above or below the ideal straight line.

Problems

1. Follow the instructions above to produce a random sample from a standard normal distribution. Begin with 1000 rows.

a) Plot a histogram and boxplots of the data in your column and describe the distribution. Does this sample appear to be symmetric? Does it have any outliers? Note the mean and standard deviation of the distribution. Click on the button marked more moments at the top of the moments table to see values for skewness and kurtosis. Is the sample skewed? Is the sample normal, platykurtic (negative kurtosis) or leptokurtic (positive kurtosis)?

 

 

 

 

 

 

 

b) Add a new column which is also a random sample from a standard normal distribution, but edit the equation for this column to multiply the values by 3. The equation in your calculator window should read:
Random normal ( ) * 3

Display the histogram for this data set alongside that for your first column (You can display multiple analyses by adding more than one column to the list of variables to be analysed). Use the Red Arrow menu to ensure that axes for each data set are uniform to improve our ability to contrast the distributions. How do the graphs differ?

 

 

 

 

 

c) Add a third column of random samples from a standard normal but this time add 5 to all the values. Your equation should look like:
Random normal ( ) + 5

Display the histogram for this data set alongside that for your first two columns. How does this graph differ from the previous ones?

 

 

 

 

 

 

d) Finally, add a column in which the standard normal samples are multiplied by 3 and then increased by 5. Your equation should look like:
Random normal ( ) * 3 + 5

Display the histograms and again compare the curves. Compare the means and standard deviations for these sample distributions. Your equations contain the parameters (true population values) for the mean and standard deviations of each distribution. How do these compare to the sample values?

 

 

 

 

 

 

 

 

2. Open a new file and column and open the calculator window for that column. Use the normal distribution function to calculate the probability of obtaining a specific Z value or the normal quantile function to calculate a Z value from a specific probability. It is easiest to use these functions if you align the data window and equation window so you can see the answer in the data table after you hit the evaluate button in the equation window.

a) What is the probability of obtaining a Z value less than or equal to -1.00?

 

 

 

 

b) What is the probability of obtaining a Z value less than or equal to -1.96?

 

 

 

 

c) What is the probability of obtaining a Z value greater than or equal to 2.50? What is the probability of obtaining a Z value greater than 2.50?

 

 

 

 

d) What is the probability of obtaining a Z value greater than -0.65?

 

 

 

 

e) What is the probability of obtaining a Z value between -2.3 and 0.7?

 

 

 

 

f) What is the probability of obtaining a Z value less than -1.2 or greater than 0.2?

 

 

 

 

g) What is the probability of obtaining a Z value less than -1.2 and greater than 0.2?

 

 

 

 

h) Using the normal quantile function, for what value of Z does the lower portion of the standard normal distribution equal 0.05?

 

 

 

 

i) What value of Z excludes 0.01 of the standard normal in the upper tail?

 

 

 

 

j) What values of Z exclude 0.25 of the normal distribution spread evenly in both tails?

 

 

 

3. Open the bigclass file in the shared directory. This is the same data you used in a previous lab.

a) Test whether the height variable is normally distributed using a test for normality as described in the introduction to this lab. Follow the usual procedures for testing a hypothesis: Formulate your null and alternate hypotheses, set up an alpha level, compare the probability matching your test statistic to alpha, and reach a conclusion.

 

 

 

 

 

 

 

 

b) Carry out a goodness of fit to a normal distribution on the weight variable. Call up a normal curve and a smoothed curve on the histogram for the weight variable (These are available as options under fit distribution for the histogram. Does the visual appearance of the data, especially in contrast to the normal curve, suggest that the data are from a normal distribution? How does this compare to your conclusion from the statistical test?

 

 

 

 

 

 

 

c) Add a column which is a random sample from a uniform distribution, then increase the number of rows to 2100. Carry out a test that the distribution is normal. Use the Red Arrow sub-menus to produce a normal quantile plot. If the data are non-normal, where does the sample differ from normality?

 

 

 

 

 

 

 

4. Open the file cntrlmt.jmp. This file has a set of 5 columns, each of which uses random numbers between zero and one that have been raised to the fourth power. The first column consists of random samples of single values (each value is a random number from 0 to 1 that has been raised to the fourth power). The second column takes a group of five of these numbers and reports the mean for the five values. The third column reports the mean of 10 numbers. The fourth and fifth column take means of samples of 50 and 100 values repeatedly. In each case, the underlying distribution is a random number raised to the fourth power.

a) Add 500 rows to each of the 5 columns. The analysis could take a bit of time on some computers. Display histograms for each of the 5 columns and compare their general shapes. Which ones appear not to be normal?

 

 

 

 

 

 

b) Compare the skew and kurtosis values for each of the 5 columns. What happens to these values as sample size increases?

 

 

 

 

 

c) Carry out tests for normality on each of the columns. Which ones are significantly non-normal according to a test for normality?

 

 

 

 

 

 

 

d) What causes successive columns to become more and more normal even though the underlying distribution is not normal?

 

 

 

 

 

e) What is useful about the central limit theorem for statistical analysis?

 

 

 

 

Answers for this Lab Assignment

Return to Main Lab Page

Return to Main Course Page