Graphs and tables

This page provides tips and recommendations for making graphs and tables in R. Updated and revised frequently (click the reload button on your browser to make sure you are seeing the most recent version).

Frequency tables

These commands generate tables of frequencies or summary statistics. In the examples below, "x" is a single numeric variable. "A" is a categorical variable (factor or character variable) identifying different groups; "B" is a second such variable.

Frequency table

Frequency table for a categorical variable A,

  table(A)


Contingency table

The following command generates a frequency table for two categorical variables, A and B. The command can be extended to three or more variables.

  table(A, B)



Tables of descriptive statistics

The command "tapply" creates tables of results. The function argument can be any statistical function that can be applied to a single numeric variable (e.g., mean, standard deviation, median, etc).

When creating tables for display purposes, such as in a manuscript, you can round the results to a fixed number of decimal places. For example, to round a table of means "z" to two decimal places, use

  round(z,2)


Table display for one variable

For example, here is how to generate a one-way table of group means. "A" is the factor or character variable identifying the groups. 

  tapply(x, INDEX = A, FUN = mean, na.rm=TRUE)


The "na.rm" option removes missing values before computing (otherwise the mean returns "NA" if there are missing values). "na.rm" is not a tapply option -- rather it is an option for the function "mean". In general you can pass optional arguments to FUN by including them immediately afterward.

The following shortcut works if the arguments are listed strictly in the order shown.

  tapply(x, A, mean, na.rm=TRUE)


The command for a one-way table of group standard deviations is similar.

  tapply(x, A, sd, na.rm = TRUE)


Table display for two variables

The following example produces a two-way table of group medians. "x" is a numeric variable, whereas A and B are categorical variables (factors or character variables).

    tapply(x, INDEX = list(A,B), FUN = median)



General graphics options

It is important to examine data visually before analyzing the numbers. Your eye is a great pattern detector and can easily pick out outliers, errors, and associations in data.

There are many different types of graphs. Here's we've grouped them according to whether their purpose is to display frequencies of a single variable, or assocations between variables.

(Trellis graphics require special treatment. These tools in R generate panels of plots and are especially good for repeating the same type of graph on multiple different groups or subsets of the data. Trellis graphics are produced by a separate package with its own commands and options and so will be dealt with separately at the end.)

Generic command options

Many (but not all -- try them) of the basic plotting commands will accept the same options to control axis limits, labeling, print a title, change the plotting symbol, change the size of the plotting symbols and text, and change the line types. Here are some of the most frequently modified options. Remember, these are not commands, but options, so use them inside the parentheses of a plotting command. If you are not sure whether a given option applies in your case, try it -- the worst that could happen is you get an error message, or R ignores it.

  main="Eureka"    # add a title above the graph

  pch=16           # change plot symbol to a filled circle

  col="red"        # change the fill color

  xlim=c(-10,10)   # change limits of the x-axis (horizontal axis)

  ylim=c(0,100)    # change limits of the y-axis (vertical axis)

  lty=2            # change line type to dashed

  cex=1.5          # magnify the plotting symbols 1.5-fold

  xlab="Body size" # label for the x-axis

  ylab="Frequency" # label for the y-axis

 
To find out all the options (there are many), enter

  ?par


Plot more than one graph in the same window

To plot more than one graph in the same graph window, precede your plotting command with one of the following commands. Reissue the command if you want to restart at the top of the window.

  par(mfrow=c(2,2)) # sets up a window for 4 plots, 2 by 2.

  par(mfrow=c(3,1)) # sets up a single column of 3 plots.
  par(mfrow=c(1,1)) # returns to a 1 plot per window layout



Graphs to display a single variable

These methods display the frequency distribution of a single variable.

Histogram - display frequency distribution for a numeric variable

The basic command, 

  hist(x, col="navy")


To control the width and number of histogram bins, use the "breaks" option. For example, the following command creates a series of bins 1 unit wide between the limits 0 and 6 (make sure all the data fall between these limits),

  hist(x,breaks=seq(0,6,by=1))


To make sure that data tied with a break point are counted in the next higher interval (recommended), instead of the lower interval,

  hist(x,breaks=seq(0,6,by=1),right=FALSE)


To display probability density instead of raw frequencies

  hist(x, freq=FALSE)


Superimpose a normal density curve
[to add]

Bar graph - display frequency distribution for a categorical (grouping) variable

In the following examples, "A" is a categorical variable, with each element corresponding to the category for a different individual or replicate. (A can be a factor or a character variable.)  The basic command is below. The table command tallies up the frequency of individuals in each category and then barplot plots it.

  barplot(table(A))   # makes a barplot


R will arrange the categories in alphabetical order. If you want to fix a specific order, specify this in a factor command. For example, if the variable A has three groups "a", "b" and "c", fix a preferred order as follows

  A <- factor(A,levels=c("c","a","b"))


To plot the bars in order of decreasing frequency, (a good idea for bar graphs)

  barplot(sort(table(A),decreasing=TRUE))


If you already have the frequencies for each category, then modify as follows. Let "A" be a variable that lists each named category exactly once. Let "y" be the variable containing the corresponding frequency of observations in each category.

  barplot(y,names.arg=A)


Some other options to include in your barplot command to control appearance (enter "?barplot" for more options)

  col="green"                    # fill color

  space="0.1"                    # adjust space between bars

  cex.names=0.8                  # shrink labels

  names.arg=c("Hi","Ho","He")    # set new names under bars


Normal quantile plot - compare data to the normal distribution

"x" is a numeric variable.

  qqnorm(x)

  qqline(x)  # adds line through first and third quartiles




Graphs to display association between two variables

The appropriate graph depends on which variables are numeric or categorical.

Mosaic plot - association between two categorical variables

A and B can be factors or character variables. 

  mosaicplot(table(A,B),col=TRUE,las=2,cex.axis=0.8)


The "las" option makes the labels perpendicular to the axes so that they don't overlap. The "cex.axis" is optional and is used to adjust the size of the labels (provide a number that is a multiple of the default size).

Grouped bar graph - association between two categorical variables

A and B can be factors or character variables. 

  barplot(table(A,B),beside=TRUE)


The "space" option controls spacing between bars. Two values are needed: the first controls spacing between bars within each group of A, and the second controls the space between the bars from different groups of A. The first number should be smaller than the second number.

  barplot(table(A,B),beside=TRUE,space=c(.1,.3))


Other options:

  barplot(table(A,B),beside=TRUE,cex.names=0.8)  # adj label size

  barplot(table(A,B),beside=TRUE,legend.text=TRUE) # add legend



Box plots - association between a categorical  and a numeric variable

Generates a box plot for the numeric variable y separately for every group identified in the categorical variable A (A can be a factor or character variable).

  boxplot(split(y,A)) # box plot

  boxplot(y ~ A)      # formula method: response ~ explanatory


Options to include in the boxplot command to control appearance (enter ?boxplot to see more)

  cex.axis=0.8       # reduce labels to make room

  boxwex=.5          # adjust width of boxes

  varwidth=TRUE      # widths proportional to sqrt(n)



Dot plot or "stripchart" - association between a categorical  and a numeric variable

A dot plot can be used instead of a boxplot when the number of data points is not large.

  stripchart(split(y,A), vertical=TRUE) # split-into-list method

  stripchart(y ~ A, vertical=TRUE)      # formula method


You can use "jitter" to reduce overlap of points. Changing the value of "jitter" adjusts how much noise is added. Change label size with cex.axis, and change symbol with pch.

  stripchart(y ~ A, vertical=TRUE, method="jitter", jitter=0.2)

  stripchart(y ~ A, vertical=TRUE, cex.axis=0.8, pch=1)

  stripchart(y ~ A, vertical=TRUE, pch="-")


Option "add=TRUE" adds points to a preexisting stripchart. For example, to use different symbols according to unique values of category variable B, try the following. The first of the three commands below set up the plot but withold the points. The following two commands add the points for two groups of data points.

  stripchart(y ~ A, vertical=TRUE, method="jitter", pch="")

  stripchart(y[B=="b1"] ~ A[B=="b1"], method="jitter", 

                pch=1, add=TRUE)

  stripchart(y[B=="b2"] ~ A[B=="b2"], method="jitter",

                pch=16, add=TRUE)

You can add points or lines to a stripchart by taking advantage of the fact that category variable plotted along the x-axis also has a numerical interpretation. For example, to plot a single point at the mean of y for all four categories of a variable A, use

  stripchart(y ~ A, vertical=TRUE, method="jitter", pch=1)

  points( c(1,2,3,4), tapply(y, A, mean), pch=16)


You can offset the positions of the points by tweaking the values of x,

  points( c(1,2,3,4) + 0.2, tapply(y, A, mean), pch=16)


Adding lines to a plot is similar. This is a simple way to add error bars to a stripchart. For example, either of the following commands adds a vertical line from y=5 to y=10 at the position of the first group in a stripchart,

  lines( c(1,1), c(5,10))                 # vertical line
  arrows(1, 5, 1, 10, angle=90, code=3)   # same with line ends


Scatter plot - association between two numeric variables

Here's how to to produce a scatter plot for two numeric variables, x and y. The variable listed first in the command is plotted along the horizontal (x) axis.

  plot(x,y)               # a scatter plot if x and y are numeric

  plot(x,y,pch=16,col=2)  # change symbol to a colored dot

  plot(y ~ x)             # formula method


Add a smooth curve through the data to estimate the shape of the relationship between y and x. The "lowess" command uses locally weighted regression to accomplish this. "Local" means that y is predicted for each x using only data in the vicinity of that x, rather than all the data. The size of the vicinity is controlled by the option "f", which is a proportion between 0 (very narrow vicinity) and 1 (uses all the data). Try different values of "f" to best capture the relationship. The default is f=2/3.

  plot(x,y)

  x1<-x[order(x)]

  y1<-y[order(x)]

  lines(lowess(x1,y1,f=0.5))

The "order" command sorts the values of x and y by x. This ensures that the segments of the lowess curve are drawn in order, sequentially from left to right.

Add a line to the scatter plot

  plot(x,y)

  abline(a = myintercept, b = myslope) # set values for a and b

  abline(lm(y~x))                      # least squares line


Plot multiple groups with different symbols. You can use "pch" to vary the symbol, or "col" to vary the color, or vary both at the same time. "x" and "y" are numeric variables, whereas "A" is a categorical variable identifying groups. If "A" is already a factor (check with "is.factor(A)")  you can use just "A" instead of "factor(A)" in the commands below.

  plot(x,y,pch=as.numeric(factor(A)))  # or

  plot(x,y,col=as.numeric(factor(A)))


You'll want to add a legend too so that you can identify the groups. Issue the following command and then click on the plot (inside the plot region) with your cursor to place the legend. You can add "col=" as a further option but the "pch=" option must be present. If "A" is already a factor (check with "is.factor(A)")  you can use just "A" instead of "factor(A)" in the commands below.

  plot(x,y,pch=as.numeric(factor(A)))

  legend( locator(1), as.character(levels(factor(A))),
         pch=1:length(levels(factor(A))) )


Here are two ways to identify individual points on a scatter plot. The first just redraws the scatter plot and then adds the row number next to it (you can use another variable instead). This can get noisy if there are a lot of data points.

  plot(x,y)

  text(x,y,labels=seq_along(x),pos=1,cex=0.5)

The second method uses the cursor to click those few points on the plot you want identified. This version prints the row number when you click a point. You can change that by setting the "labels" option to a character variable that identifies them too.

  plot(x,y)

  identify(x,y,labels = seq_along(x))


Line plot - display a sequence of measurements

Here's how to plot a sequence of x,y points and connect the dots with lines. This is especially useful when the x-variable represents a series of points in time or across a spatial gradient. The plot command below is the same as in the case of a scatter plot. The "lines" command adds lines to the same plot and connects the dots. The "order(x)" bit is to connect the dots from left to right, in case the x-values are unsorted.

  plot(x,y,pch=16)

  lines(x[order(x)],y[order(x)])


To draw a line plot without any dots,

  plot(x[order(x)],y[order(x)],type="l")


You can change line type, thickness, and color using the lty, lwd, and col options. Enter ?par to learn more about these line options.

  lines(x[order(x)],y[order(x)],lty=3, lwd=2, col="red")



Graphs to display multiple variables

Interaction plots

Interaction plots are useful for displaying how the mean of a response variable y changes between the levels of two categorical variables, A and B. The graph is especially useful for determine whether an interaction is present between two factors A and B in a factorial experiment, or between a factor A and a blocking variable B. If the lines are parallel then there is no interaction.

  interaction.plot(A, B, y)


The levels of the variable listed first (here, A) will be displayed along the x-axis of the plot. The y-axis will then display the mean of y separately for each category of the second variable, B.  Variations on this command include

  interaction.plot(B,A,y)            # Put B along x-axis instead

  interaction.plot(A,B,y,fun=median) # median of y
  interaction.plot(A,B,y,col=1:length(unique(B)))
                                     # color the lines

  interaction.plot(A,B,y,las=2)      # more room for A's labels


A disadvantage of this plot is that it does not show the data. Use stripchart instead for this purpose.

Overlay scatter plots for multiple groups

To draw a scatter plot between two numeric variables x and y separately for each category of a third variable A

  plot(x,y,pch=as.numeric(factor(A)))


To add a legend that identifies the groups, issue the following command and then click the cursor inside the plot region to place the legend.

  legend(locator(1), as.character(levels(factor(A))),
         pch=1:length(levels(factor(A)))


Pairwise scatter plots for multiple variables

The following command creates a single graph with scatter plots between all pairs of numeric variables in a data frame, "mydata". The oprion "gap" adjusts the spacing between separate plots,

  pairs(mydata,gap=0.5)


Use the formula method to plot only the three numeric variables x1, x2, and x3 in the data frame mydata.

  pairs(~x1+x2+x3, data=mydata)




Trellis graphics

Trellis graphs are displays having multiple panels on the same page, arranged in a grid or trellis. They are a great tool for displaying the same type of graph separately for multiple groups or subsets of the data all on the same scale. Many options are available to control the features of the graphs.

The lattice package in R has a variety of tools for m -> ->aking trellis plots. It is included with the basic installation but you need to load the library,

  library(lattice)

The graph types available in the lattice package include the standard ones found also in R's basic graphics package, such as box plots, histograms, and so on. The table below lists just a few of the basic graph types and the relevant command.

For example, to draw a histogram of a numeric variable "x" separately for four groups all identified by the variable "B" (assume the variables are in the data frame "mydata"), use

  histogram(~x | B, data=mydata, layout=c(1,4))

The "layout" option is special to lattice and draws the 4 panels in a grid with 1 column with 4 rows, so that the histograms are stacked and most easily compared visually.

To draw a bar graph showing the frequency distribution of a categorical variable "A" separately for each group identified in the variable "B",

  barchart( ~table(A) | B, data=mydata)

This produces horizontal bar graphs, which leaves room for the category labels. To draw the bars vertically instead, while tilting the group labels on the x-axis by 45 degrees so that they fit,

  barchart(table(A) ~ names(table(A)) | B, data=mydata,

            scales=list(x=list(rot=45)))


As a third example, draw a scatter plot to show the relationship between the numeric variables "y" and "x" separately for each group in the variable "B". The "pch" option in this example replaces the default plot symbol with a filled dot, and the "aspect" option sets the relative lengths of the vertical and horizontal axes.

  xyplot(y ~ x | B, data=mydata, pch=16, aspect=0.7)


This doesn't even begin to describe what's possible using the lattice package. Crawley has a few more examples of trellis graphics in The R Book (p 836). To go even further, see Kuhnert and Venables' An Introduction to R: Software for Statistical Modelling & Computing (see the "Get help" section above).

Table showing a few of the commonly used plotting commands in the lattice package. Each command requires a formula describing the relationship between the plotted variables and the variable defining different groups. Formula structures for the basic graph types are shown in the right column. "x" and "y" are numeric variables, whereas "A" is a categorical variable (character or factor). "B" is a factor or character variable that will define the groups or subsets of the data frame to be plotted in separate panels. A separate plot in the graphics window will be made for each of the groups defined by the variable "B". 
command graph type command formula structure

barchart

bar graph

barchart(~table(A) | B, data=mydata)

bwplot

box plot

bwplot(x ~ A | B, data=mydata)

histogram

histogram

histogram (~x  | B, data=mydata)

stripplot

dot plot or "stripchart"

stripplot(x ~ A | B, data=mydata, jitter=TRUE)

xyplot

scatterplot

xyplot(y ~ x | B, data=mydata)