The goal of this first workshop is to get you started working in R, and to introduce the most commonly-used data types, operations, and functions. Help is on the R tips page. The most useful pages for this exercise are Calculate, Data, and Graphs & Tables for producing frequency tables.
The command line in the R console is where you interact with R.
At its most basic, the command line is a bloated calculator. The basic operations are
+ - * /
for add, subtract, multiply, divide. Familiar calculator functions also work on the command line. For example, to take the natural log of 2, enter the following.
log(2)
## [1] 0.6931472
Try the calculator out to get a feel for this basic application
and the style of the output. Try log10()
for log base 10,
and sqrt()
and abs()
for square root and
absolute value.
R allows you to store or assign numbers and character strings to
named variables called vectors, which are a type of “object” in R. For
example, to assign the number “3” to a variable x
, use
x <- 3
The assign symbol is a “less than” symbol followed by a dash,
<-
, with no space between. Try assigning a single number
to a named variable.
R can also assign character strings (enter using double quotes) to named variables. Try entering
z <- "Wake up Neo" # single or double quotes needed print(z)
At any time, enter ls()
to see the names of all the
objects in the working R environment.
Assign a single number to the variable x
and another
number to the variable y
. Then watch what happens when you
type an operation, such as multiplication:
x * y
Finally, you can also store the result in a third variable.
z <- x * y
To print the contents of z
, just enter the name on the
command line, or enter print(z)
.
The calculator will also give a TRUE or FALSE response to a logical operation. Try one or more variations of the following examples on the command line to see the outcome.
2 + 2 == 4 # Note double "==" for logical "is equal to" 3 <= 2 # less than or equal to "A" < "a" # greater than "Hi" != "hi" # not equal to (i.e., R is case sensitive)
Vectors in R are used to represent variables. R can assign sets of
numbers or character strings to named variables using the
c()
command, for concatenate. (R treats a single number or
character string as a vector too, having just one element.)
x <- c(1, 2, 333, 65, 45, -88)
Assign a set of 10 numbers to a variable x
. Make
sure it includes some positive and some negative numbers. To see the
contents afterward, enter x
on the command line, or enter
print(x)
. Is it really a vector? Enter
is.vector(x)
to confirm.
Use integers in square brackets to indicate subsets of vector
x
.
x[5] # fifth element
Try this out. See also what happens when you enter vectors of indices,
x[1:3] # 1:3 is a shortcut for c(1,2,3) x[c(2, 4, 9)] # need c() if more than one index
Print the 3rd and 6th elements of x
with a single
command.
Some functions of vectors yield integer results and so can be used as indices too. For example, enter the function
length(x)
Since the result is an integer, it is ok to use as follows
x[length(x)]
The beauty of this construction is that it will always give the last
element of a vector x
no matter how many elements
x
contains.
Logical operations can also be used to generate indices. First,
enter the following command and compare with the contents of
x
,
x < 0
Now enter
x[x < 0]
Try this yourself: print all elements of x
that are
non-negative. The which()
command will identify the
elements corresponding to TRUE. For example, try the following and
compare with your vector x
.
which(x < 0)
Indicators can be used to change individual elements of the
vector x
. For example, to change the fifth element of
x
to 0,
x[5] <- 0
Try this yourself. Change the last value of your x
vector to a different number. Change the 2nd, 6th, and 10th values of
x
all to 3 new numbers with a single command.
Missing values in R are indicated by NA
. Try
changing the 2nd value of x
to a missing value. Print
x
to see the result.
R can be used as a calculator for whole vectors of numbers too.
To see this, create a second numerical vector y
of the same
length as x
.
Now try out a few
ordinary mathematical operations on the whole vectors of numbers,
z <- x * y print(z)
z <- y - 2 * x print(z)
Examine the results to see how R behaves. It executes the operation
on the first elements of x
and y
, then on the
corresponding second elements, and so on. Each result is stored in the
corresponding element of z
. Logical operations are the
same,
z <- x >= y # greater than or equal to print(z)
z <- x[abs(x) < abs(y)] # absolute values print(z)
What does R do if the two vectors are not the same length? The answer
is that the elements in the shorter vector are “recycled”, starting from
the beginning. This is basically what R does when you multiply a vector
by a single number. The single number is recycled, and so is applied to
each of the elements of x
in turn. Otherwise this is hard
to control, however, and it is best to operate on vectors of the same
length.
z <- 2 * x print(z)
Use an operation to determine which elements of x
are smaller than the corresponding element in y
. Optional:
Can you make a new vector that contains only the smaller of the two
numbers in x
and y
, element by
element?
Make a data frame called mydata
from the two
vectors, x
and y
. Consult the R tips Calculate
and Data sets pages for help on how to do this. Print
mydata
on the screen to view the result. If all looks good,
delete the vectors x
and y
from the R
environment. They are now stored only in the data frame. Type
names(mydata)
to see the names of the stored
variables.
Vector functions applied to data frames may give unexpected
results – data frames are not vectors. For example,
length(mydata)
won’t give you the same answer as
length(x)
or length(y)
. But you can still
access each of the original vectors using mydata$x
and
mydata$y
. Try printing one of them. All the usual vector
functions and operations can be used on the variables in the data frame.
We’ll do more with data frames below.
Photo by J. Socha, https://www.sciencenews.org/article/how-flying-snakes-stay-aloft
Paradise tree snakes (Chrysopelea paradisi) leap into the air from trees, and by generating lift they glide downward and away rather than plummet. An airborn snake flattens its body everywhere except for the heart region. It forms a horizontal “S” shape and undulates from side to side. By orienting the head and anterior part of the body, a snake can change direction, reach a preferred landing site, and even chase aerial prey. To better understand lift and stability of glides, Socha (2002, Nature 418: 603-604) videotaped eight snakes leaping from a 10-m tower (video here). One measurement taken was the rate of side-to-side undulation. Undulation rates of the eight snakes, measured in Hertz (cycles per second), were as follows:
0.9 1.4 1.2 1.2 1.3 2.0 1.4 1.6
We’ll store these data in a vector (variable) and try out some useful vector functions in R (review the common vector functions section on the Calculate page of the R tips web pages).
Put the glide undulation data above into a named vector. Afterward, check the number of observations stored in the vector.
Apply the hist()
command to the vector and observe
the result (a histogram). Examine the histogram and you will see that it
counts two observations between 1.0 and 1.2. Are there any measurements
in the data between these two numbers? What is going on? The default in
R is to use right-closed, left-open intervals. To change to left-closed
right-open, modify an option in the hist()
command as
follows,
hist(myvector, right = FALSE)
We’ll be doing more on graphs next week.
Hertz units measure undulations in cycles per second. The
standard international unit of angular velocity, however, is radians per
second. 1 Hertz is 2π radians per second. Transform the snake data so
that it is in units of radians per second (note: pi
is a
programmed constant in R).
Using the transformed data hereafter, calculate the sample mean
undulation rate WITHOUT using the function mean()
(i.e.,
use other functions instead)*.
Ok, try the function mean()
and compare your
answer.
Calculate the sample standard deviation in undulation rate
WITHOUT using the functions sd()
or var()
.
Then calculate using sd()
to compare your
answer**.
Sort the observations using the sort()
command.
Calculate the median undulation rate. When there is an even number of observations (as in the present case), the population median is most simply estimated as the average of the two middle measurements in the sample.
Calculate the standard error of the mean undulation rate. Remember, the standard error of the mean is calculated as the standard deviation of the data divided by the square root of sample size.
* 8.63938.
** 2.035985
Missing data in R are indicated by NA
. Many functions
for vectors, such as sum()
and mean()
, will
return a value of NA
if the data vector you used contained
at least one missing value. Overcoming this usually involves modifying a
function option to instruct R to ignore the offending points before
doing the calculation. See the Calculate page of the R tips pages for
help on how to do this.
Use the c()
function to add a single new measurement
to the snake vector created in the previous section (i.e., increase its
length by one) but have the new observation be missing, as though the
undulation rate measurement on a 9th snake was lost.
Check the length of this revised vector, according to R. The length should be 9, even though one of the elements of the vector is NA.
What is the sample mean of the measurements in the new vector, according to R? Use a method that does not involve you directly removing the offending point from the vector.
Recalculate the standard error of the mean, again leaving in the missing value in the vector. Did you get the same answer as in the previous section? If not, what do you think went wrong? Take great care when there are missing values.
Calculate the standard error properly, using the vector of 9 elements (i.e., the vector containing one NA). Use a method that will work on any vector containing missing values.
Here we will read data on several variables from a comma-delimited (.csv) text file into a data frame, which is the usual way to bring data into R. The data are all the known species of Anolis lizards on Caribbean islands, the named clades to which they belong, and the islands on which they occur. A subset of the species is also classified into “ecomorphs” clusters according to their morphology and perching habitat. Each ecomorph is a phylogenetically heterogeneous group of species having high ecological and morphological similarity. The list was compiled by Jonathan Losos from various sources and are provided in the Afterword of his wonderful book (Losos 2009. Lizards in an evolutionary tree. University of California Press).
Download the file anolis.csv (click file name to initiate download) and save in a convenient place.
Open a new script file to write and submit your commands (or cut and paste to the command window) for the remainder of this section.
Read the data from the file into a data frame (e.g., call it
mydata
) using the read.csv()
command. See the
R tips Data tab for further help on this step. For this first attempt,
include no additional arguments or options for the
read.csv()
command other than the file name so we can
explore R’s behavior.
Use the str()
command to obtain a compact summary of
the contents of the data frame. Every variable shown should be a
character which is the default for character data. Some functions will
convert character data to factors, which is a special kind of character
variable whose categories also have a numerical interpretation (useful
when a specific order of categories is desired in a table or plot).
Another useful command is class()
, which will tell you what
data type your object is. Try it out on a vector in your data frame, and
again on the whole data frame.
Use the head()
command to inspect the variable names
and the first few rows of the data frame. Every variable in this data
set contains character strings. (Why, yes, there’s a tail()
command too!)
Let’s focus on the variable Ecomorph
, since it has a
manageable number of categories. Use the table()
function
on the Ecomorph
vector to see the frequency (number of
rows, namely number of species) belonging to each named group.
Notice anything odd about the table? One ecomorph category is blank and has 47 species (rows). Another ecomorph, the Trunk-Crown, seems to be present twice with different numbers of species. Same for Trunk-ground. Do you notice the cause of the problem?
The answer is that one species belongs to the “Trunk-Crown”
(trailing space) category rather than to the “Trunk-Crown” (no spaces)
category, caused by a typo in the data file. Use the
which()
command to identify the row with the typo.
Using assignment (<-
), fix the single typo. Use
the table()
function afterward to check the effect of your
change. Note that other Ecomorph categories still have the same
problam.
Re-read the data from the file into R. This time, use options of
the read.csv()
function in base R to strip leading and
trailing spaces from character string entries and treat empty fields as
missing rather than as words with no letters.
How will you ever remember such a list of options in future, when
it comes time to reading your own data into R? The answer is: you don’t
have to. I couldn’t possibly remember it myself. If you keep a script
file when you analyze the data you can always go back and consult it,
and copy it the next time you need it. Type ?read.csv
in
the command line at any time to get a complete list of all the read
options and their effects. Try it now. Those options can be handy when
you need them.
Use table()
once more to tally up the numbers of
species in each Ecomorph category. Is there an improvement from the
previous attempts? Which is the commonest Ecomorph and which is the
rarest?
What happened to the missing values? Use table()
but
include the useNA = “ifany”
option in your command to
include a count of NA
values in your output table (see the
R tips Display page on frequency tables to see examples of the use of
this argument). In this data set, NA
refers to lizard
species that do not belong to a standard ecomorph category, so it is
worthwhile to include them in the table. Perhaps they should be given
their own named group (“none”), which is less ambiguous than
NA.
How many Anolis species inhabit Jamaica exclusively?*
What is the total number of Anolis species on Cuba?** This is not the same as the number occurring exclusively on Cuba — a few species live there and also on other islands. Figure out a way in R to count the number of species that occur on Cuba. Bonus points for the briefest command! [Hint: check the vector functions for character data on the R tips Calculate page.]
What is the tally of species belonging to each ecomorph on the four largest Caribbean islands: Jamaica, Hispaniola, Puerto Rico and Cuba?***
What is the most frequent ecomorph for species that do not occur on the four largest islands?***
* 6
** 63
*** Trunk-crown
© 2009-2025 Dolph Schluter