Graphs and tables
This page provides tips and recommendations for making graphs and tables in R. Updated and revised frequently (click the reload button on your browser to make sure you are seeing the most recent version).Frequency tables
These commands generate tables of frequencies or summary statistics. In the examples below, "x" is a single numeric variable. "A" is a categorical variable (factor or character variable) identifying different groups; "B" is a second such variable.Frequency table
Frequency table for a categorical variable A,table(A)
Contingency table
The following command generates a frequency table for two categorical variables, A and B. The command can be extended to three or more variables.table(A, B)
Tables of descriptive statistics
The command "tapply" creates tables of results. The function argument can be any statistical function that can be applied to a single numeric variable (e.g., mean, standard deviation, median, etc).When creating tables for display purposes, such as in a manuscript, you can round the results to a fixed number of decimal places. For example, to round a table of means "z" to two decimal places, use
round(z,2)
Table display for one variable
For example, here is how to generate a one-way table of group means. "A" is the factor or character variable identifying the groups.tapply(x, INDEX = A, FUN = mean, na.rm=TRUE)
The "na.rm" option removes missing values before computing (otherwise the mean returns "NA" if there are missing values). "na.rm" is not a tapply option -- rather it is an option for the function "mean". In general you can pass optional arguments to FUN by including them immediately afterward.
The following shortcut works if the arguments are listed strictly in the order shown.
tapply(x, A, mean, na.rm=TRUE)
tapply(x, A, sd, na.rm = TRUE)
Table display for two variables
The following example produces a two-way table of group medians. "x" is a numeric variable, whereas A and B are categorical variables (factors or character variables).tapply(x, INDEX = list(A,B), FUN = median)
General graphics options
It is important to examine data visually before analyzing the numbers. Your eye is a great pattern detector and can easily pick out outliers, errors, and associations in data.There are many different types of graphs. Here's we've grouped them according to whether their purpose is to display frequencies of a single variable, or assocations between variables.
(Trellis graphics require special treatment. These tools in R generate panels of plots and are especially good for repeating the same type of graph on multiple different groups or subsets of the data. Trellis graphics are produced by a separate package with its own commands and options and so will be dealt with separately at the end.)
Generic command options
Many (but not all -- try them) of the basic plotting commands will accept the same options to control axis limits, labeling, print a title, change the plotting symbol, change the size of the plotting symbols and text, and change the line types. Here are some of the most frequently modified options. Remember, these are not commands, but options, so use them inside the parentheses of a plotting command. If you are not sure whether a given option applies in your case, try it -- the worst that could happen is you get an error message, or R ignores it.main="Eureka" # add a title above the graph
pch=16 # change plot symbol to a filled circle
col="red" # change the fill color
xlim=c(-10,10) # change limits of the x-axis (horizontal axis)
ylim=c(0,100) # change limits of the y-axis (vertical axis)
lty=2 # change line type to dashed
cex=1.5 # magnify the plotting symbols 1.5-fold
xlab="Body size" # label for the x-axis
ylab="Frequency" # label for the y-axis
To find out all the options (there are many), enter
?par
Plot more than one graph in the same window
To plot more than one graph in the same graph window, precede your plotting command with one of the following commands. Reissue the command if you want to restart at the top of the window.par(mfrow=c(2,2)) # sets up a window for 4 plots, 2 by 2.
par(mfrow=c(3,1)) # sets up a single column of 3
plots.
par(mfrow=c(1,1)) # returns to a 1 plot per window layout
Graphs to display a single variable
These methods display the frequency distribution of a single variable.Histogram - display frequency distribution for a numeric variable
The basic command,hist(x, col="navy")
To control the width and number of histogram bins, use the "breaks" option. For example, the following command creates a series of bins 1 unit wide between the limits 0 and 6 (make sure all the data fall between these limits),
hist(x,breaks=seq(0,6,by=1))
To make sure that data tied with a break point are counted in the next higher interval (recommended), instead of the lower interval,
hist(x,breaks=seq(0,6,by=1),right=FALSE)
To display probability density instead of raw frequencies
hist(x, freq=FALSE)
Superimpose a normal density curve
[to add]
Bar graph - display frequency distribution for a categorical (grouping) variable
In the following examples, "A" is a categorical variable, with each element corresponding to the category for a different individual or replicate. (A can be a factor or a character variable.) The basic command is below. The table command tallies up the frequency of individuals in each category and then barplot plots it.barplot(table(A)) # makes a barplot
R will arrange the categories in alphabetical order. If you want to fix a specific order, specify this in a factor command. For example, if the variable A has three groups "a", "b" and "c", fix a preferred order as follows
A <- factor(A,levels=c("c","a","b"))
To plot the bars in order of decreasing frequency, (a good idea for bar graphs)
barplot(sort(table(A),decreasing=TRUE))
If you already have the frequencies for each category, then modify as follows. Let "A" be a variable that lists each named category exactly once. Let "y" be the variable containing the corresponding frequency of observations in each category.
barplot(y,names.arg=A)
Some other options to include in your barplot command to control appearance (enter "?barplot" for more options)
col="green" # fill color
space="0.1" # adjust space between bars
cex.names=0.8 # shrink labels
names.arg=c("Hi","Ho","He") # set new names under bars
Normal quantile plot - compare data to the normal distribution
"x" is a numeric variable.qqnorm(x)
qqline(x) # adds line through first and third quartiles
Graphs to display association between two variables
The appropriate graph depends on which variables are numeric or categorical.Mosaic plot - association between two categorical variables
A and B can be factors or character variables.mosaicplot(table(A,B),col=TRUE,las=2,cex.axis=0.8)
The "las" option makes the labels perpendicular to the axes so that they don't overlap. The "cex.axis" is optional and is used to adjust the size of the labels (provide a number that is a multiple of the default size).
Grouped bar graph - association between two categorical variables
A and B can be factors or character variables.barplot(table(A,B),beside=TRUE)
The "space" option controls spacing between bars. Two values are needed: the first controls spacing between bars within each group of A, and the second controls the space between the bars from different groups of A. The first number should be smaller than the second number.
barplot(table(A,B),beside=TRUE,space=c(.1,.3))
Other options:
barplot(table(A,B),beside=TRUE,cex.names=0.8) # adj label size
barplot(table(A,B),beside=TRUE,legend.text=TRUE) # add legend
Box plots - association between a categorical and a numeric variable
Generates a box plot for the numeric variable y separately for every group identified in the categorical variable A (A can be a factor or character variable).boxplot(split(y,A)) # box plot
boxplot(y ~ A) # formula method: response ~ explanatory
Options to include in the boxplot command to control appearance (enter ?boxplot to see more)
cex.axis=0.8 # reduce labels to make room
boxwex=.5 # adjust width of boxes
varwidth=TRUE # widths proportional to sqrt(n)
Dot plot or "stripchart" - association between a categorical and a numeric variable
A dot plot can be used instead of a boxplot when the number of data points is not large.stripchart(split(y,A), vertical=TRUE) # split-into-list method
stripchart(y ~ A, vertical=TRUE) # formula method
You can use "jitter" to reduce overlap of points. Changing the value of "jitter" adjusts how much noise is added. Change label size with cex.axis, and change symbol with pch.
stripchart(y ~ A, vertical=TRUE, method="jitter", jitter=0.2)
stripchart(y ~ A, vertical=TRUE, cex.axis=0.8, pch=1)
stripchart(y ~ A, vertical=TRUE, pch="-")
Option "add=TRUE" adds points to a preexisting stripchart. For example, to use different symbols according to unique values of category variable B, try the following. The first of the three commands below set up the plot but withold the points. The following two commands add the points for two groups of data points.
stripchart(y ~ A, vertical=TRUE, method="jitter", pch="")
stripchart(y[B=="b1"] ~ A[B=="b1"], method="jitter",
pch=1, add=TRUE)
stripchart(y[B=="b2"] ~ A[B=="b2"], method="jitter",
pch=16, add=TRUE)
You can add points or lines to a stripchart by taking
advantage of the fact that category variable plotted along the x-axis
also has a numerical interpretation. For example, to plot a single
point at the mean of y for all four categories of a variable A, use
stripchart(y ~ A, vertical=TRUE, method="jitter", pch=1)
points( c(1,2,3,4), tapply(y, A, mean), pch=16)
You can offset the positions of the points by tweaking the values of x,
points( c(1,2,3,4) + 0.2, tapply(y, A, mean), pch=16)
Adding lines to a plot is similar. This is a simple way to add error bars to a stripchart. For example, either of the following commands adds a vertical line from y=5 to y=10 at the position of the first group in a stripchart,
lines( c(1,1), c(5,10)) # vertical line
arrows(1, 5, 1, 10, angle=90, code=3) # same with line ends
Scatter plot - association between two numeric variables
Here's how to to produce a scatter plot for two numeric variables, x and y. The variable listed first in the command is plotted along the horizontal (x) axis.plot(x,y) # a scatter plot if x and y are numeric
plot(x,y,pch=16,col=2) # change symbol to a colored dot
plot(y ~ x) # formula method
Add a smooth curve through the data to estimate the shape of the relationship between y and x. The "lowess" command uses locally weighted regression to accomplish this. "Local" means that y is predicted for each x using only data in the vicinity of that x, rather than all the data. The size of the vicinity is controlled by the option "f", which is a proportion between 0 (very narrow vicinity) and 1 (uses all the data). Try different values of "f" to best capture the relationship. The default is f=2/3.
plot(x,y)
x1<-x[order(x)]
y1<-y[order(x)]
lines(lowess(x1,y1,f=0.5))
The "order" command sorts the values of x and y by x. This ensures that the segments of the lowess curve are drawn in order, sequentially from left to right.Add a line to the scatter plot
plot(x,y)
abline(a = myintercept, b = myslope) # set values for a and b
abline(lm(y~x)) # least squares line
Plot multiple groups with different symbols. You can use "pch" to vary the symbol, or "col" to vary the color, or vary both at the same time. "x" and "y" are numeric variables, whereas "A" is a categorical variable identifying groups. If "A" is already a factor (check with "is.factor(A)") you can use just "A" instead of "factor(A)" in the commands below.
plot(x,y,pch=as.numeric(factor(A))) # or
plot(x,y,col=as.numeric(factor(A)))
You'll want to add a legend too so that you can identify the groups. Issue the following command and then click on the plot (inside the plot region) with your cursor to place the legend. You can add "col=" as a further option but the "pch=" option must be present. If "A" is already a factor (check with "is.factor(A)") you can use just "A" instead of "factor(A)" in the commands below.
plot(x,y,pch=as.numeric(factor(A)))
legend( locator(1), as.character(levels(factor(A))),
pch=1:length(levels(factor(A))) )
Here are two ways to identify individual points on a scatter plot. The first just redraws the scatter plot and then adds the row number next to it (you can use another variable instead). This can get noisy if there are a lot of data points.
plot(x,y)
text(x,y,labels=seq_along(x),pos=1,cex=0.5)
The second method uses the cursor to click those few points on the plot you want identified. This version prints the row number when you click a point. You can change that by setting the "labels" option to a character variable that identifies them too.plot(x,y)
identify(x,y,labels = seq_along(x))
Line plot - display a sequence of measurements
Here's how to plot a sequence of x,y points and connect the dots with lines. This is especially useful when the x-variable represents a series of points in time or across a spatial gradient. The plot command below is the same as in the case of a scatter plot. The "lines" command adds lines to the same plot and connects the dots. The "order(x)" bit is to connect the dots from left to right, in case the x-values are unsorted.plot(x,y,pch=16)
lines(x[order(x)],y[order(x)])
To draw a line plot without any dots,
plot(x[order(x)],y[order(x)],type="l")
You can change line type, thickness, and color using the lty, lwd, and col options. Enter ?par to learn more about these line options.
lines(x[order(x)],y[order(x)],lty=3, lwd=2, col="red")
Graphs to display multiple variables
Interaction plots
Interaction plots are useful for displaying how the mean of a response variable y changes between the levels of two categorical variables, A and B. The graph is especially useful for determine whether an interaction is present between two factors A and B in a factorial experiment, or between a factor A and a blocking variable B. If the lines are parallel then there is no interaction.interaction.plot(A, B, y)
The levels of the variable listed first (here, A) will be displayed along the x-axis of the plot. The y-axis will then display the mean of y separately for each category of the second variable, B. Variations on this command include
interaction.plot(B,A,y) # Put B along x-axis instead
interaction.plot(A,B,y,fun=median) # median of y
interaction.plot(A,B,y,col=1:length(unique(B)))
# color the lines
interaction.plot(A,B,y,las=2) # more room for A's labels
A disadvantage of this plot is that it does not show the data. Use stripchart instead for this purpose.
Overlay scatter plots for multiple groups
To draw a scatter plot between two numeric variables x and y separately for each category of a third variable A,plot(x,y,pch=as.numeric(factor(A)))
To add a legend that identifies the groups, issue the following command and then click the cursor inside the plot region to place the legend.
legend(locator(1), as.character(levels(factor(A))),
pch=1:length(levels(factor(A)))
Pairwise scatter plots for multiple variables
The following command creates a single graph with scatter plots between all pairs of numeric variables in a data frame, "mydata". The oprion "gap" adjusts the spacing between separate plots,pairs(mydata,gap=0.5)
Use the formula method to plot only the three numeric variables x1, x2, and x3 in the data frame mydata.
pairs(~x1+x2+x3, data=mydata)
Trellis graphics
Trellis graphs are displays having multiple panels on the same page, arranged in a grid or trellis. They are a great tool for displaying the same type of graph separately for multiple groups or subsets of the data all on the same scale. Many options are available to control the features of the graphs.The lattice package in R has a variety of tools for m -> ->aking trellis plots. It is included with the basic installation but you need to load the library,
library(lattice)
The graph types available in the lattice package include the standard ones found also in R's basic graphics package, such as box plots, histograms, and so on. The table below lists just a few of the basic graph types and the relevant command.For example, to draw a histogram of a numeric variable "x" separately for four groups all identified by the variable "B" (assume the variables are in the data frame "mydata"), use
histogram(~x | B, data=mydata, layout=c(1,4))
The "layout" option is special to lattice and draws the 4 panels in a grid with 1 column with 4 rows, so that the histograms are stacked and most easily compared visually.To draw a bar graph showing the frequency distribution of a categorical variable "A" separately for each group identified in the variable "B",
barchart( ~table(A) | B, data=mydata)
This produces horizontal bar graphs, which leaves room for the category labels. To draw the bars vertically instead, while tilting the group labels on the x-axis by 45 degrees so that they fit,barchart(table(A) ~ names(table(A)) | B, data=mydata,
scales=list(x=list(rot=45)))
As a third example, draw a scatter plot to show the relationship between the numeric variables "y" and "x" separately for each group in the variable "B". The "pch" option in this example replaces the default plot symbol with a filled dot, and the "aspect" option sets the relative lengths of the vertical and horizontal axes.
xyplot(y ~ x | B, data=mydata, pch=16, aspect=0.7)
This doesn't even begin to describe what's possible using the lattice package. Crawley has a few more examples of trellis graphics in The R Book (p 836). To go even further, see Kuhnert and Venables' An Introduction to R: Software for Statistical Modelling & Computing (see the "Get help" section above).
Table showing a few of the commonly used plotting commands in the lattice package. Each command requires a formula describing the relationship between the plotted variables and the variable defining different groups. Formula structures for the basic graph types are shown in the right column. "x" and "y" are numeric variables, whereas "A" is a categorical variable (character or factor). "B" is a factor or character variable that will define the groups or subsets of the data frame to be plotted in separate panels. A separate plot in the graphics window will be made for each of the groups defined by the variable "B".
command | graph type | command formula structure |
barchart |
bar graph |
barchart(~table(A) | B, data=mydata) |
bwplot |
box plot |
bwplot(x ~ A | B, data=mydata) |
histogram |
histogram |
histogram (~x | B, data=mydata) |
stripplot |
dot plot or "stripchart" |
stripplot(x ~ A | B, data=mydata, jitter=TRUE) |
xyplot |
scatterplot |
xyplot(y ~ x | B, data=mydata) |