This page provides some guidance to carrying out a few basic multivariate analyses in R. See the “Display” tab on the R tips web pages for ideas on how to visualize multivariate data.
Principal components analysis finds the most variable direction in the data, the first principal component. The next most variable direction that is uncorrelated with (“orthogonal to”) the first is the second principal component. And so on.
When carried out using the covariance matrix between variables
(scale. = FALSE
, the default), the procedure does not alter
distances between pairs of points. However, retaining only the first few
principal components will alter these distances because some information
is lost. Using the correlation matrix instead
(scale. = TRUE
) standardizes the variables before carrying
out the principal components analysis. This can be useful when the
variables are in different units or have different variances.
The analysis will yields the most sensible results if all the variables are on a common scale, which means they have similar units with similar variances (they will not be the same, but they will have the same order of magnitude). This can usually be accomplished with a log transformation when the data are morphological traits measured in similar units. If the traits are a mixture of linear, surface area, and volume or mass measurements, then after taking logs it is advisable to divide surface area measurements by 2 and volume or mass measurements by 3 (equivalent to taking the square and cube roots before log-transforming).
You can check the variances as follows. If mydata
is a
data frame of numeric variables only, the covariance matrix
v
is
v <- cov(mydata) # the covariance matrix -- variances are along the diagonal
v <- var(mydata) # this works too
diag(v) # extracts the diagonal, i.e., the variances.
You can still do a principal components analysis if the traits cannot be put on the same scale (e.g., because they are in such different units). In this case you’ll want to have all the traits standardized to have a variance of 1 before analyzing by indicating that you want to use the correlation matrix instead of the covariance matrix when carrying out the principal components analysis.
If your data include missing values, create a new data frame that eliminates all rows containing missing values. This will make it easier later when you want to save the principal component axes and have them correspond to the data used to calculate them.
Most of the functions to be used can handle missing values, but it can often be convenient to remove them before you start, as follows. This is optional.
mydata <- na.omit(myoriginaldata)
Use the formula method of the prcomp()
command. This has
better handling of missing values than other methods.
z <- prcomp( ~ x1 + x2 + x3, data = mydata, na.action = na.exclude)
# shortcut formula with the specific columns to use in the analysis
z <- prcomp( ~ ., data = mydata[ ,1:3], , na.action = na.exclude)
If you really must have the variables standardized before analysis
(use the correlation matrix instead of the covariance matrix), use the
scale. = TRUE
option.
z <- prcomp( ~ ., data = mydata[ ,1:3], scale. = TRUE, na.action = na.exclude)
How many cases were used in your analysis? A quick way to determine this is to see how many non-missing values for PC1 were computed.
table(!is.na(z$x[, "PC1"]))
A screeplot illustrates the eigenvalues of the principal components — the variances of the principal components.
screeplot(z, type="lines") # "scree" plot of eigenvalues
The biplot()
command in base R makes a scatter plot of
the data points along the first two principal components and overlays
arrows to indicate the contributions of each trait to the principal
components. The graphs can get messy if there are too many variables.
Reduce the value of cex
to prevent a confusion of
overlapping point labels.
biplot(z, cex = 0.5, las = 1)
# Biplot for PC3 and PC4
biplot(z$x[,3:4], z$rotation[,3:4], cex = 0.5, las = 1)
Use the following commands to extract results from the
prcomp
object (here called z
). If you included
k variables in the analysis, then there will be k
principal components.
z$rotation[, 1:5] # eigenvectors (loadings) for first 5 components
z$sdev^2 # eigenvalues (variances)
Use predict
to extract the principal component scores,
the measurements of every individual on the principal component axes.
The output will be a matrix object (see the Data tab on the R
tips web page for help with matrices), which you can convert to a data
frame for easier handling.
x <- predict(z) # Scores for PC1, PC2, ...
x <- as.data.frame(x) # Convert matrix object to data frame
An optional ggbiplot
function can also be used to draw a
biplot. To use it, you must first install and load, as follows.
# First install the devtools package if you don't already have it
install.packages("devtools", dependencies=TRUE, type="binary")
library(devtools)
# Then install the ggbiplot package
install_github("vqv/ggbiplot")
library(ggbiplot)
However, ggbiplot()
is currently intolerant of missing
values and so to use it you’ll need to rerun your PCA using the
na.action = na.omit
before using it.
ggbiplot(z, choices = c(1,2))
ggbiplot(z, ellipse = TRUE, choices = c(2,3), groups = mydata$groupvar)
Linear discriminant function analysis is used to find axes (linear combinations of variables) that best separate predefined groups. The axes maximize variation between groups relative to variation within groups. In contrast, principal components analysis pays no attention to groupings in the data and finds axes that maximize total variation.
The method is also used to classify individuals into groups. Often this is carried out by dividing the data randomly into halves. The discriminant analysis is carried out on the first half, referred to as the “training” data set. The discriminant function is then used to classify individuals in the second half of the data into groups. Compare the resulting classification with the original groupings gives an idea of the misclassification rate.
The necessary commands are in the MASS
library. So to
begin,
library(MASS)
The method expects a grouping variable (i.e., a factor) and one or
more numerical variables to be used in calculating the discriminant
function. For example, mydata
is a data frame having a
grouping variable named “group” and 3 numerical variables, x1, x2, and
x3. In this case the following commands are equivalent,
z <- lda(group ~ x1 + x2 + x3, data = mydata) # formula method
z <- lda(group ~., data = mydata) # shortcut
By default, the method will use the frequencies in each group as the prior probabilities for classification. To change the default, provide a vector of prior probabilities ordered in exactly the same way as the group factor levels. For example, if the grouping factor has three levels, the following command will specify equal prior probabilities.
z <- lda(group ~ x1 + x2 + x3, data = mydata, prior = c(1/3, 1/3, 1/3))
Results are extracted from the lda
object (here called
z). If there are k groups in the analysis, then there should be
k - 1 discriminant axes.
plot(z) # scatter plot of new discriminant functions
print(z) # trait loadings and other statistics
z$scaling # trait loadings
Use predict
to obtain the discriminant function scores,
the measurements of every individual on the discriminant function axes.
The results are in matrix format rather than in a data frame. (See the
“Data” page at the R tips web site for help with matrices.)
predict(z)$x # yields the individual values (scores) for df1, df2,...
The predict
command can also be used for classification.
For example, assume that the same variables have been measured on a
separate set of individuals in the data frame newdata
. The
following command will classify the new set according to the
discriminant function,
z1 <- predict(z, newdata)
If left unspecified, the prior probabilities will be the same as
those used in the preceding lda
command. See above for
information on how to specify a different prior. The results of the
classification are stored as separate items in a list (here named
z1
). The groups are stored as a factor. The posterior
probabilities and predicted scores are matrices.
z1$class # The groups into which the newdata were classified
z1$posterior # posterior probabilities of group classifications
z1$x # yields scores of df1, df2,... for newdata
If the newdata
are not provided then
predict
will classify the individuals in the original
training data set instead. The misclassification rate is expected to be
on the low side when the data to be classified are the same as the data
used to generate the discriminant function.
z1 <- predict(z)
table(mydata$group, z1$class) # accuracy of classification
Correspondence analysis is used to ordinate species whose presence or absence (or abundance) is recorded at multiple sites. The data are arranged in a contingency table with each species as a different column and each site as a different row. The method finds uncorrelated axes that maximize the correspondence between species scores and site scores. Species and sites are treated equivalently and both may be plotted on the same axes.
The method converts the presence/absence data (indicated by 1’s and 0’s in the table) or abundances (integer counts) to associations that quantify whether a species’ observed count at a give site is greater than (<0), less than (<0) or similar to (~0) that expected under the assumption of independence of species and sites. This association matrix is then used to find the axes of maximum correspondence between species and sites.
The method assumes that each species has a unimodal abundance distribution along an unknown, underlying environmental gradient. The ordination can be distorted when assemblages at the ends of gradients have few species in common. The method may place sites closer to one another in the plot than their species compositions warrant. This leads to an “arch” shaped ordination of species and sites. One reason this occurs is that the measure of distance between assemblages, which is based on a χ2 statistic, is not linear. As a result, correspondence analysis may be problematic when beta diversity is high.
The commands are in the MASS
library. So to begin,
library(MASS)
To analyze, put the species as separate columns of a data frame and
the sites as separate rows. The correspondence analysis is carried out
as follows. nf
refers to the number of axes to be
extracted, usually just 1 or 2. Results are stored in the correspondence
analysis object, here called z.
z <- corresp(mydata, nf = 2)
If nf
= 2, the following command creates a scatter plot
of both sites and species, with their scales arranged to correspond.
plot(z)
Species that are close to one another in the plot tend to occur at the same sites. Likewise, nearby sites in the plot will tend to have similar species compositions (or species abundances, if abundance data are analyzed). Species located next to sites in the graph occur predominantly there, whereas species falling between sites tend to occur in two or more sites.
The eigenvalues from the analysis are the same for the ordinations of both species and sites. They are the correlations between species scores and sample scores along each of the axes.
z$cor
The column and row scores indicate the contributions of individual species and sites to the correspondence axes.
z$rscore # site (row) scores in vector or matrix format
z$cscore # species (column) scores in vector or matrix format
The results of the correspondence analyses would be easier to compare
with the original data if the rows and columns of the data frame of
frequencies could be reordered to correspond to the positions of the
sites and species along the first principal axis resulting from the
analysis. This can be accomplished as follows. The positions of the
sites and species along all the axes are given in the columns of the
matrices in z$rscore
and z$cscore
of your
correspondence analysis object (here, named z
). To reorder
according to the first axis, use the first column of each of these
matrices.
mydata[order(z$rscore[,1]), order(z$cscore[,1])]
This is a method to represent distances between samples or populations so that their relationships may be visualized in a small number of dimensions. It is commonly used to analyze genetic distances between populations or dissimilarity of species composition of ecological assemblages. In the resulting plot, distances between points will be preserved as much as possible given that the data have been reduced to a small number of dimensions.
This method is also known as principal coordinates analysis. To
begin, obtain your symmetric distance matrix, here called
d
. This could be based on percent sequence divergence,
Euclidean distance, community dissimilarity, etc (this step will be
situation specific). “Symmetric” means that d[i,j] = d[j,i]
for all rows and columns i
and j
of the
matrix.
To reduce the distances to k=2
dimensions, enter the
following command.
z <- cmdscale(d, k = 2, eig = TRUE)
The object z contains the following results.
z$points # measurements along the first two coordinate axes
z$eig # eigenvalues
This method preserves only the rank order of distances between
points, as much as possible given that the data have been reduced to a
small number of dimensions. The command is in the MASS
library, so load it to begin
library(MASS)
To reduce the distances to k = 2
dimensions, enter the
following command.
z <- isoMDS(d, y = cmdscale(d, k), k = 2)
The object z
contains the following results.
z$points # measurements along the first two coordinate axes
z$eig # eigenvalues
© 2009-2024 Dolph Schluter