Learning R: Looking at Data

Whenever you're working with a new dataset, the first thing you should do is look at it! This lesson will teach you how to answer these questions and more using R's built-in functions.

Jun 20, 2024

Some outputs of the code-chunks will not be displayed in this article. Please run these chunks using the swirl() package to view the outputs in their correct format. Below I have also attached my full R script.

library(swirl)

Hi! Type swirl() when you are ready to begin.

swirl()

Welcome to swirl! Please sign in. If you've been here before, use the same name as you did then. If you are new, call yourself something unique.

What shall I call you? Lorcán

Thanks, Lorcán. Let's cover a couple of quick housekeeping items before we begin our first lesson. First of all, you should know that when you see '...', that means you should press Enter when you are done reading and ready to continue.

... <-- That's your cue to press Enter to continue

Also, when you see 'ANSWER:', the R prompt (>), or when you are asked to select from a list, that means it's your turn to enter a response, then press Enter to continue.

Select 1, 2, or 3 and press Enter

1: Continue.
2: Proceed.
3: Let's get going!

Selection: 1

You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key. If you are already at the prompt, type bye() to exit and save your progress. When you exit properly, you'll see a short message letting you know you've done so.

When you are at the R prompt (>):

Typing skip() allows you to skip the current question.
Typing play() lets you experiment with R on your own; swirl will ignore what you do...
UNTIL you type nxt() which will regain swirl's attention.
Typing bye() causes swirl to exit. Your progress will be saved.
Typing main() returns you to swirl's main menu.
Typing info() displays these options again.

Let's get started!

...

To begin, you must install a course. I can install a course for you from the internet, or I can send you to a web page (https://github.com/swirldev/swirl_courses) which will provide course options and directions for installing courses yourself. (If you are not connected to the internet, type 0 to exit.)

1: R Programming: The basics of programming in R
2: Regression Models: The basics of regression modeling in R
3: Statistical Inference: The basics of statistical inference in R
4: Exploratory Data Analysis: The basics of exploring data in R
5: Don't install anything for me. I'll do it myself.

Selection: 1
=============================================================================================================================================100%
Course installed successfully!

Please choose a course, or type 0 to exit swirl.

1: R Programming
2: Take me to the swirl course repository!

Selection: 1

Please choose a lesson, or type 0 to return to course menu.

1: Basic Building Blocks; 2: Workspace and Files; 3: Sequences of Numbers
4: Vectors; 5: Missing Values; 6: Subsetting Vectors
7: Matrices and Data Frames; 8: Logic; 9: Functions
10: lapply and sapply; 11: vapply and tapply; 12: Looking at Data
13: Simulation; 14: Dates and Times; 15: Base Graphics

Selection: 12

Whenever you're working with a new dataset, the first thing you should do is look at it! What is the format of the data? What are the dimensions? What are the variable names? How are the variables stored? Are there missing data? Are there any flaws in the data?

...

=== 4%
This lesson will teach you how to answer these questions and more using R's built-in functions. We'll be using a dataset constructed from the United States Department of Agriculture's PLANTS Database (http://plants.usda.gov/adv_search.html).

...

====== 8%
I've stored the data for you in a variable called plants. Type ls() to list the variables in your workspace, among which should be plants.

ls()

"plants"

Keep up the great work!

======== 12%
Let's begin by checking the class of the plants variable with class(plants). This will give us a clue as to the overall structure of the data.

class(plants)

"data.frame"

You are quite good my friend!

=========== 16%
It's very common for data to be stored in a data frame. It is the default class for data read into R using functions like read.csv() and read.table(), which you'll learn about in another lesson.

...

============== 20%
Since the dataset is stored in a data frame, we know it is rectangular. In other words, it has two dimensions (rows and columns) and fits neatly into a table or spreadsheet. Use dim(plants) to see exactly how many rows and columns we're dealing with.

dim(plants)

5166 10

Keep working like that and you'll get there!

================= 24%
The first number you see (5166) is the number of rows (observations) and the second number (10) is the number of columns (variables).

...

==================== 28%
You can also use nrow(plants) to see only the number of rows. Try it out.

nrow(plants)

5166

You're the best!

====================== 32%
... And ncol(plants) to see only the number of columns.

ncol(plants)

10

All that hard work is paying off!

========================= 36%
If you are curious as to how much space the dataset is occupying in memory, you can use object.size(plants).

object.size(plants)

745944 bytes

You are quite good my friend!

============================ 40%
Now that we have a sense of the shape and size of the dataset, let's get a feel for what's inside. names(plants) will return a character vector of column (i.e. variable) names. Give it a shot.

names(plants)

"Scientific_Name" "Duration" "Active_Growth_Period" "Foliage_Color" "pH_Min" "pH_Max" "Precip_Min" "Precip_Max" "Shade_Tolerance" "Temp_Min_F"

You're the best!

=============================== 44%
We've applied fairly descriptive variable names to this dataset, but that won't always be the case. A logical next step is to peek at the actual data. However, our dataset contains over 5000 observations (rows), so it's impractical to view the whole thing all at once.

...

================================== 48%
The head() function allows you to preview the top of the dataset. Give it a try with only one argument.

head(plants)

Output will display the first 6 rows in each column of the dataset.

Excellent work!

==================================== 52%
Take a minute to look through and understand the output above. Each row is labeled with the observation number and each column with the variable name. Your screen is probably not wide enough to view all 10 columns side-by-side, in which case R displays as many columns as it can on each line before continuing on the next.

...

======================================= 56%
By default, head() shows you the first six rows of the data. You can alter this behavior by passing as a second argument the number of rows you'd like to view. Use head() to preview the first 10 rows of plants.

head(plants, 10)

Output will display the first 10 rows in each column of the dataset.

Your dedication is inspiring!

========================================== 60%
The same applies for using tail() to preview the end of the dataset. Use tail() to view the last 15 rows.

tail(plants, 15)

Output will display the last 15 rows in each column of the dataset.

All that practice is paying off!

============================================= 64%
After previewing the top and bottom of the data, you probably noticed lots of NAs, which are R's placeholders for missing values. Use summary(plants) to get a better feel for how each variable is distributed and how much of the dataset is missing.

summary(plants)

Output will return the minimum, 1st quartile, median, mean, 3rd quartile, and maximum for numeric data and the length, class, and mode for categorical variables.

All that hard work is paying off!

================================================ 68%
summary() provides different output for each variable, depending on its class. For numeric data such as Precip_Min, summary() displays the minimum, 1st quartile, median, mean, 3rd quartile, and maximum. These values help us understand how the data are distributed.

...

================================================== 72%
For categorical variables (called 'factor' variables in R), summary() displays the number of times each value (or 'level') occurs in the data. For example, each value of Scientific_Name only appears once, since it is unique to a specific plant. In contrast, the summary for Duration (also a factor variable) tells us that our dataset contains 3031 Perennial plants, 682 Annual plants, etc.

...

===================================================== 76%
You can see that R truncated the summary for Active_Growth_Period by including a catch-all category called 'Other'. Since it is a categorical/factor variable, we can see how many times each value actually occurs in the data with table(plants$Active_Growth_Period).

table(plants$Active_Growth_Period)

Output will return how many times each value actually occurs for the Active_Growth_Period.

Keep up the great work!

======================================================== 80%
Each of the functions we've introduced so far has its place in helping you to better understand the structure of your data. However, we've left the best for last....

...

=========================================================== 84%
Perhaps the most useful and concise function for understanding the structure of your data is str(). Give it a try now.

str(plants)

Output will return a concise summary of the structure of the contents of the dataset.

Excellent job!

============================================================== 88%
The beauty of str() is that it combines many of the features of the other functions you've already seen, all in a concise and readable format. At the very top, it tells us that the class of plants is 'data.frame' and that it has 5166 observations and 10 variables. It then gives us the name and class of each variable, as well as a preview of its contents.

...

================================================================ 92%
str() is actually a very general function that you can use on most objects in R. Any time you want to understand the structure of something (a dataset, function, etc.), str() is a good place to start.

...

=================================================================== 96%
In this lesson, you learned how to get a feel for the structure and contents of a new dataset using a collection of simple and useful functions. Taking the time to do this upfront can save you time and frustration later on in your analysis.

...

======================================================================100%
Would you like to receive credit for completing this course on Coursera.org?

1: Yes
2: No

Selection: 2

That's a job well done!

You've reached the end of this lesson! Returning to the main menu...

Reference

Kross S, Carchedi N, Bauer B, Grdina G (2024). swirl: Learn R, in R.
R package version 2.4.5, commit 82b50ced7149796fd9a78f5112f137f298930b1a,
https://github.com/swirldev/swirl.

Discussion about this post

Ready for more?