Learning R: vapply and tapply
In this lesson, you'll learn how to use vapply() and tapply(), each of which serves a very specific purpose within the Split-Apply-Combine methodology. For consistency, we'll use the same dataset we used in the 'lapply and sapply' lesson.
Some outputs of the code-chunks will not be displayed in this article. Please run these chunks using the swirl()
package to view the outputs in their correct format. Below I have also attached my full R script.
library(swirl)
Hi! Type swirl()
when you are ready to begin.
swirl()
Welcome to swirl! Please sign in. If you've been here before, use the same name as you did then. If you are new, call yourself something unique.
What shall I call you? Lorcán
Thanks, Lorcán. Let's cover a couple of quick housekeeping items before we begin our first lesson. First of all, you should know that when you see '...', that means you should press Enter when you are done reading and ready to continue.
... <-- That's your cue to press Enter to continue
Also, when you see 'ANSWER:'
, the R prompt (>)
, or when you are asked to select from a list, that means it's your turn to enter a response, then press Enter
to continue.
Select 1, 2, or 3 and press Enter
1: Continue.
2: Proceed.
3: Let's get going!
Selection: 1
You can exit swirl and return to the R prompt (>)
at any time by pressing the Esc key. If you are already at the prompt, type bye()
to exit and save your progress. When you exit properly, you'll see a short message letting you know you've done so.
When you are at the R prompt (>)
:
Typing
skip()
allows you to skip the current question.Typing
play()
lets you experiment with R on your own; swirl will ignore what you do...UNTIL you type
nxt()
which will regain swirl's attention.Typing
bye()
causes swirl to exit. Your progress will be saved.Typing
main()
returns you to swirl's main menu.Typing
info()
displays these options again.
Let's get started!
...
To begin, you must install a course. I can install a course for you from the internet, or I can send you to a web page (https://github.com/swirldev/swirl_courses) which will provide course options and directions for installing courses yourself. (If you are not connected to the internet, type 0
to exit.)
1: R Programming: The basics of programming in R
2: Regression Models: The basics of regression modeling in R
3: Statistical Inference: The basics of statistical inference in R
4: Exploratory Data Analysis: The basics of exploring data in R
5: Don't install anything for me. I'll do it myself.
Selection: 1
=============================================================================================================================================100%
Course installed successfully!
Please choose a course, or type 0
to exit swirl.
1: R Programming
2: Take me to the swirl course repository!
Selection: 1
Please choose a lesson, or type 0
to return to course menu.
1: Basic Building Blocks; 2: Workspace and Files; 3: Sequences of Numbers
4: Vectors; 5: Missing Values; 6: Subsetting Vectors
7: Matrices and Data Frames; 8: Logic; 9: Functions
10: lapply and sapply; 11: vapply and tapply; 12: Looking at Data
13: Simulation; 14: Dates and Times; 15: Base Graphics
Selection: 11
0%
In this lesson, you'll learn how to use vapply()
and tapply()
, each of which serves a very specific purpose within the Split-Apply-Combine methodology. For consistency, we'll use the same dataset we used in the 'lapply and sapply' lesson.
...
============ 8%
The Flags dataset from the UCI Machine Learning Repository contains details of various nations and their flags. More information may be found here: http://archive.ics.uci.edu/ml/datasets/Flags
...
================= 12%
I've stored the data in a variable called flags. If it's been a while since you completed the 'lapply and sapply' lesson, you may want to reacquaint yourself with the data by using functions like dim()
, head()
, str()
, and summary()
when you return to the prompt (>
). You can also type viewinfo()
at the prompt to bring up some documentation for the dataset. Let's get started!
...
======================= 16%
As you saw in the last lesson, the unique()
function returns a vector of the unique values contained in the object passed to it. Therefore, sapply(flags, unique)
returns a list containing one vector of unique values for each column of the flags dataset. Try it again now.
sapply(flags, unique)
Output will return a list containing one vector of unique values for each column (name, landmass, zone, area, population, language, religion, bars, stripes, colours, red, green, blue, gold, white, black, orange, mainhue, circles, crosses, saltires, quarters, sunstars, crescent, triangle, icon, animate, text, topleft, and botright) of the flags dataset.
Excellent job!
============================= 20%
What if you had forgotten how unique()
works and mistakenly thought it returns the number of unique values contained in the object passed to it? Then you might have incorrectly expected sapply(flags, unique)
to return a numeric vector, since each element of the list returned would contain a single number and sapply()
could then simplify the result to a vector.
...
=================================== 24%
When working interactively (at the prompt), this is not much of a problem, since you see the result immediately and will quickly recognize your mistake. However, when working non-interactively (e.g. writing your own functions), a misunderstanding may go undetected and cause incorrect results later on. Therefore, you may wish to be more careful and that's where vapply()
is useful.
...
========================================= 28%
Whereas sapply()
tries to 'guess' the correct format of the result, vapply()
allows you to specify it explicitly. If the result doesn't match the format you specify, vapply()
will throw an error, causing the operation to stop. This can prevent significant problems in your code that might be caused by getting unexpected return values from sapply()
.
...
============================================== 32%
Try vapply(flags, unique, numeric(1))
, which says that you expect each element of the result to be a numeric vector of length 1. Since this is NOT actually the case, YOU WILL GET AN ERROR. Once you get the error, type ok()
to continue to the next question.
ok()
That's correct!
==================================================== 36%
Recall from the previous lesson that sapply(flags, class)
will return a character vector containing the class of each column in the dataset. Try that again now to see the result.
sapply(flags, class)
Output will return a character vector containing the class of each column in the dataset.
All that practice is paying off!
========================================================== 40%
If we wish to be explicit about the format of the result we expect, we can use vapply(flags, class, character(1))
. The character(1)
argument tells R that we expect the class function to return a character vector of length 1 when applied to EACH column of the flags dataset. Try it now.
vapply(flags, class, character(1))
That's correct!
================================================================ 44%
Note that since our expectation was correct (i.e. character(1))
, the vapply()
result is identical to the sapply()
result – a character vector of column classes.
...
====================================================================== 48%
You might think of vapply()
as being 'safer' than sapply()
, since it requires you to specify the format of the output in advance, instead of just allowing R to 'guess' what you wanted. In addition, vapply()
may perform faster than sapply()
for large datasets. However, when doing data analysis interactively (at the prompt), sapply()
saves you some typing and will often be good enough.
...
=========================================================================== 52%
As a data analyst, you'll often wish to split your data up into groups based on the value of some variable, then apply a function to the members of each group. The next function we'll look at, tapply()
, does exactly that.
...
================================================================================= 56%
Use ?tapply
to pull up the documentation.
?tapply
That's correct!
======================================================================================= 60%
The 'landmass' variable in our dataset takes on integer values between 1 and 6, each of which represents a different part of the world. Use table(flags$landmass)
to see how many flags/countries fall into each group.
table(flags$landmass)
Output will display a table of 6 columns (landmass integer) with 1 row (number of flags/countries in each landmass).
You got it right!
============================================================================================= 64%
The 'animate' variable in our dataset takes the value 1 if a country's flag contains an animate image (e.g. an eagle, a tree, a human hand) and 0 otherwise. Use table(flags$animate)
to see how many flags contain an animate image.
table(flags$animate)
Output will display a table of 1 columns (0: does not contains animate image & 1: contains animate image) with 1 row (number of flags/countries).
Perseverance, that's the answer.
=================================================================================================== 68%
This tells us that 39 flags contain an animate object (animate = 1) and 155 do not (animate = 0).
...
======================================================================================================== 72%
If you take the arithmetic mean of a bunch of 0s and 1s, you get the proportion of 1s. Use tapply(flags$animate, flags$landmass, mean)
to apply the mean function to the 'animate' variable separately for each of the six landmass groups, thus giving us the proportion of flags containing an animate image WITHIN each landmass group.
tapply(flags$animate, flags$landmass, mean)
You nailed it! Good job!
============================================================================================================== 76%
The first landmass group (landmass = 1) corresponds to North America and contains the highest proportion of flags with an animate image (0.4194).
...
==================================================================================================================== 80%
Similarly, we can look at a 'summary' of population values (in round millions) for countries with and without the color red on their flag with tapply(flags$population, flags$red, summary)
.
tapply(flags$population, flags$red, summary)
Output will give the Minimum, 1st Quartile, Median, Mean, 3rd Quartile, and Maximum.
Great job!
========================================================================================================================== 84%
What is the median population (in millions) for countries without the color red on their flag?
1: 9.0
2: 27.6
3: 0.0
4: 4.0
5: 22.1
6: 3.0
Selection: 6
You're the best!
================================================================================================================================ 88%
Lastly, use the same approach to look at a summary of population values for each of the six landmasses.
tapply(flags$population, flags$landmass, summary)
Perseverance, that's the answer.
===================================================================================================================================== 92%
What is the maximum population (in millions) for the fourth landmass group (Africa)?
1: 1010.0
2: 157.00
3: 119.0
4: 56.00
5: 5.00
Selection: 4
Great job!
=========================================================================================================================================== 96%
In this lesson, you learned how to use vapply()
as a safer alternative to sapply()
, which is most helpful when writing your own functions. You also learned how to use tapply()
to split your data into groups based on the value of some variable, then apply a function to each group. These functions will come in handy on your quest to become a better data analyst.
...
=================================================================================================================================================100%
Would you like to receive credit for completing this course on Coursera.org?
1: Yes
2: No
Selection: 2
That's a job well done!
You've reached the end of this lesson! Returning to the main menu...
Reference
Kross S, Carchedi N, Bauer B, Grdina G (2024). swirl: Learn R, in R.
R package version 2.4.5, commit 82b50ced7149796fd9a78f5112f137f298930b1a,
https://github.com/swirldev/swirl.