Multivariate methods
This page provides some guidance to carrying out a few basic multivariate analyses in R.See the display tab on the R tips web pages for ideas on how to visualize multivariate data. See the data tab for information on using the apply family of commands to carry out operations on whole arrays of data at once.
Principal components analysis
Principal components analysis is the most straightforward method to visualize and ordinate multivariate data. It rotates the data to find the most variable direction in the data, the first principal component. The next most variable direction that is uncorrelated with ("orthogonal to") the first is the second principal component. And so on. Because it only carries out a rotation of the data, the procedure does not alter distances between pairs of points. However, these distances ARE altered if only the first few principal components are retained and the later ones are dropped from the data set.Preparing variables
The analysis will yields the most sensible results if all the variables are on a common scale. This can usually be accomplished with a log transformation when the data are morphological traits measured in similar units. Log-transforming will make variances more homogeneous between traits that have different means, and will usually render relationships more linear between traits. If the traits are a mixture of linear, surface area, and volume or mass measurements, then after taking logs it is advisable to divide surface area measurements by 2 and volume or mass measurements by 3 (equivalent to taking the square and cube roots before log-transforming).If the traits are a mixed bag having such different units that cannot be put on the same scale, then it may be necessary to use the correlation matrix instead of the covariance matrix when carrying out principal components analysis. This step standardizes all the traits to have a variance of 1 before the principal components analysis.
If your data include missing values, create a new data frame that eliminates all rows containing missing values. This will make it easier to save the principal component axes in the same data frame as the traits.
x <- na.omit(mydata)
Covariance and correlation matrix
The covariance matrix is a square array (k x k, where k is the number of variables) containing the variances of the traits down the diagonal, and the covariances between pairs of traits off-diagonal. If x is a data frame of numeric variables, the covariance matrix isv <- cov(x)
The variances can be extracted from the covariance matrix,
diag(v)
The correlation matrix is
r <- cor(x)
Although the prcomp command applied to the data is recommended for the analysis, as explained below, eigenvalues and eigenvectors can be calculated directly from the covariance (or correlation) matrix in R,
z <- eigen(v)
Principal components analysis
Use the command prcomp instead of princomp. If the data frame mydata has just 3 numerical variables, x1, x2, and x3, then the following are equivalent,z <- prcomp(mydata)
z <- prcomp( ~x1+x2+x3, data = mydata) # formula method
z <- prcomp( ~., data = mydata) # shortcut
If you really must use the correlation matrix instead of the covariance matrix,
z <- prcomp(mydata, scale=TRUE)
A biplot displays the data points along two principal components (the first and second components, by default) and adds arrows to indicate the contributions of each trait to these principal components. The graphs can get messy if there are too many variables. Use the cex option as well as the xlim and ylim options to help fit the labels onto the graph.
biplot(z, cex=0.7)
To visualize other components, set the choices option. For example, to plot the 3rd and 4th principal components, use
biplot(z, cex=0.7, choices=c(3,4))
Results are extracted from the prcomp object (here called z). . If you included k variables in the analysis, then there will be k principal components. To see eigenvalues and eigenvectors,
summary(z) # square root of eigenvalues, variance
proportions
print(z) # square root of eigenvalues;
eigenvectors too
plot(z,type="lines") # "scree" plot of eigenvalues
screeplot(z, type="lines") # same
z$rotation # eigenvectors (with trait
loadings)
z$sdev^2 # eigenvalues (variances)
Use predict to obtain the principal component scores, the measurements of every individual on the principal component axes. Note that this function will yield a matrix rather than a data frame. (See the start tab on the R tips web page for help with matrices.)
predict(z) # yields the measurements for pc1, pc2, ...
The sign of the vector of loadings for a principal component is arbitrary. For example, the following two eigenvectors point in opposite directions but they are otherwise the same. They represent the same principal axis of variation.
z$rotation[,1]
z$rotation[,1] * (-1)
Discriminant function analysis
Linear discriminant function analysis is used to find axes (linear combinations of variables) that best separate predefined groups. The axes maximize variation between groups relative to variation within groups. In contrast, principal components analysis pays no attention to groupings in the data and finds axes that maximize total variation.The method is also used to classify individuals into groups. Often this is carried out by dividing the data randomly into halves. The discriminant analysis is carried out on the first half, referred to as the "training" data set. The discriminat function is then used to classify individuals in the second half of the data into groups. Compare the resulting classification with the original groupings gives an idea of the misclassification rate.
The necessary commands are in the MASS library. So to begin,
library(MASS)
Discriminant function analysis
The method expects a grouping variable (preferably a factor, or it will convert) and one or more numerical variables to be used in calculating the discriminant function. For example, "mydata" is a data frame having a grouping variable named "group" and 3 numerical variables, x1, x2, and x3. In this case the following commands are equivalent,z <- lda(group ~ x1+x2+x3, data = mydata) # formula method
z <- lda(group ~., data = mydata) # shortcut
By default, the method will use the observed proportions of individuals in each group as the prior probabilities for classification. To change the default, provide a vector of prior probabilities ordered in exactly the same way as the group factor levels. For example, if the grouping factor has three levels, the following command will specify equal prior probabilities.
z <- lda(group ~ x1+x2+x3, data = mydata, prior=c(1/3,1/3,1/3))
Results are extracted from the lda object (here called z). If there are k groups in the analysis, then there should be k - 1 discriminant axes.
plot(z)
# scatter plot of new discriminant functions
print(z) # trait loadings and other
statistics
z$scaling # trait loadings
predict(z)$x # yields the scores for df1, df2,...
Classification
The predict command can also be used for classification. For example, assume that the same variables have been measured on a separate set of individuals in the data frame newdata. The following command will classify the new set according to the discriminant function,z1 <- predict(z, newdata)
If left unspecified, the prior probabilities will be the same as those used in the preceding lda command. See above for information on how to specify a different prior.The results of the classification are stored as separate items in a list (here named z1). The groups are stored as a factor. The posterior probabilities and predicted scores are matrices.
z1$class # The groups into which the newdata were classified
z1$posterior # posterior probabilities of group classifications
z1$x # yields scores of df1, df2,... for newdata
If the newdata are not provided then predict will classify the individuals in the original training data set instead. The misclassification rate is expected to be on the low side when the data to be classified are the same as the data used to generate the discriminant function.
z1 <- predict(z)
table(mydata$group, z1$class) # accuracy of classification
Simple correspondence analysis
Correspondence analysis is used to ordinate species whose presence or absence (or abundance) is recorded at multiple sites. The data are arranged in a contingency table with each species as a different column and each site as a different row. The method finds uncorrelated axes that maximize the correspondence between species scores and site scores. Species and sites are treated equivalently and both may be plotted on the same axes.The method converts the presence/absence data (indicated by 1's and 0's in the table) or abundances (integer counts) to associations that quantify whether a species' observed count at a give site is greater than (>0), less than (<0) or similar to (~0) that expected under the assumption of independence of species and sites. This association matrix is then used to find the axes of maximum correspondence between species and sites.
The method assumes that each species has a unimodal abundance distribution along an unknown, underlying environmental gradient. The ordination can be distorted when assemblages at the ends of gradients have few species in common. The method may place sites closer to one another in the plot than their species compositions warrant. This leads to an "arch" shaped ordination of species and sites. One reason this occurs is that the measure of distance between assemblages, which is based on a χ2 statistic, is not linear. As a result, correspondence analysis may be problematic when beta diversity is high.
The commands are in the MASS library. So to begin,
library(MASS)
Correspondence analysis
To analyze, put the species as separate columns of a data frame and the sites as separate rows. The correspondence analysis is carried out as follows. nf refers to the number of axes to be extracted, usually just 1 or 2. Results are stored in the correspondence analysis object, here called z.z<-corresp(mydata, nf=2)
If nf = 2, the following command creates a scatter plot of both sites and species, with their scales arranged to correspond.
plot(z)
Species that are close to one another in the plot tend to occur at the same sites. Likewise, nearby sites in the plot will tend to have similar species compositions (or species abundances, if abundance data are analyzed). Species located next to sites in the graph occur predominantly there, whereas species falling between sites tend to occur in two or more sites.The eigenvalues from the analysis are the same for the ordinations of both species and sites. They are the correlations between species scores and sample scores along each of the axes.
z$cor
The column and row scores indicate the contributions of individual species and sites to the correspondence axes.
z$rscore # site (row) scores in vector or matrix format
z$cscore # species (column) scores in vector or matrix format
Multidimensional scaling
This is a method to represent distances between samples or populations so that their relationships may be visualized in a small number of dimensions. It is commonly used to analyze genetic distances between populations or dissimilarity of species composition of ecological assemblages. In the resulting plot, distances between points will be preserved as much as possible given that the data have been reduced to a small number of dimensions.Classical (metric) multidimensional scaling
This method is also known as principal coordinates analysis. To begin, obtain your symmetric distance matrix, here called d. This could be based on percent sequence divergence, Euclidean distance, community dissimilarity, etc (this step will be situation specific). "Symmetric" means that d[i,j] = d[j,i] for all rows and columns i and j of the matrix.To reduce the distances to k=2 dimensions, enter the following command.
z <- cmdscale(d,k=2,eig=TRUE)
The object z contains the following results.
z$points # measurements along the first two coordinate axes
z$eig # eigenvalues
Non-metric multidimensional scaling
This method preserves only the rank order of distances between points, as much as possible given that the data have been reduced to a small number of dimensions. The command is in the MASS library, so load it to beginlibrary(MASS)
To reduce the distances to k=2 dimensions, enter the following command.
z <- isoMDS(d, y = cmdscale(d, k), k = 2)
The object z contains the following results.