# this function requires quotes around the package name!
install.packages("tidyverse")Day 1: What are R and RStudio?
CSSS Math Camp 2024
Scope of Math Camp content this week
We do not assume any prior computing knowledge, and we will take a practical hands-on approach. We will cover an introduction to the following topics:
- What is R, and why learn it? What can you use it for?
- Intro to R and to computing in general
- Intro to RStudio
- Reading in data
- Tidy data
- Exploring, manipulating and plotting data in base R and using
tidyverse - Basic statistical tools, simulating from a distribution
- Writing and running scripts
- Matrices in R
- Writing your own functions
- For-loops, vectorization, and efficient coding
- Benchmarking
- Version control (with git)
- Making R packages (why, when, and how?)
We will also get an introduction to Quarto for making documents, slides, and other products that mix text and code.
What are R and RStudio?
R is a programming language.
RStudio is an application for writing code, analyzing data and building packages with R.
- a graphical user interface or GUI
- an integrated development environment or IDE
We will also learn some Quarto, which is a tool we can use within RStudio (and other applications actually) to make reports, manuscripts, slides, books, websites, and more.
- R has been around since 1993
- RStudio has been around since 2009
- RStudio was the name of both the application and the company that developed it, until the company renamed itself Posit in 2021 to signal that it was moving toward developing language-agnostic tools such as Quarto that can interface with Python, Julia, and other languages
How is R related to or different from other languages and software such as Stata, Excel, Python, C?
Why bother with R?
Multifunctionality
Data access (e.g. tidycensus)
Data manipulation
Statistical analysis
Publication-quality data visualizations and tables
Interactive tables, visualizations, widgets, applications
GIS
- R is a GIS itself
- It can also connect to ArcGIS, QGIS, and GRASS GIS
Works for many data types and formats: spatial data, network data, spreadsheets, surveys, SQL data bases, …
Automation
Scripts
Record the set of instructions for your analysis from start to finish in a script, store and use it like a recipe
If you want to make one change, you can just make that one change to your script and rerun the script rather than having to repeat individual steps over and over
Functions
- For tasks you’ll run over and over again, even across different projects
For-loops and other types of control flow
- Iteration, conditional instructions, etc.
Text and data reformatting
Reproducibility and quality control
Someone else can look at your script and see exactly what you did1
Future You can look at your script to remind yourself exactly what you did
Someone else/future you can run your script to exactly reproduce what you did
If you’re troubleshooting or still developing your procedures, you can step through the instructions
Open source
- R is free and open source, and developed and maintained by a large user community (well maintained and new packages come out a lot).
Common
- Coursework, research, and statistical code often use R.
Orienting yourself in RStudio
As the previous section suggests, sometimes you might be in development mode, iteratively designing your analysis/plots/output. Other times you might just be trying to run code that you or someone else has already written.
In this section, we’ll see how to use different parts of the RStudio window for these different modes.
RStudio has four main panes that allow you to organize your code and have a better user experience developing and using R code. We’ll revisit the different purposes of, and relationships among, these four panes over the course of the week. Notice that many of the panes have multiple tabs (e.g. Console vs Terminal) that you can toggle between.
Source pane
- Edit scripts and other files that allow you to save your code
- View objects that appear when you click on them in the Environment pane
- (If you use Quarto/R Markdown, your output and plots may appear here)
Console/Terminal/Background Jobs pane
- When you run code, even from a script, the code and its text/printed output appears here in the Console
- You can also run code from the Console itself
- If you use Quarto or R Markdown, when you knit your file, the Background Jobs tab here will show the results of knitting and any errors that might occur
- You can run shell/bash commands like
cd,lsorpwdin the Terminal tab just as you would in the Terminal application on your computer
Environment/History pane
- Import data
- See what objects you’ve created (such as variables, data sets, model results) that are currently “stored in your environment”
- Look at your command history and rerun previous commands
- If you develop R packages or use version control like git, the Build and Git tabs will appear here
Files/Plots/Help pane
- Navigate files and folders in the Files tab
- Copy folder paths to paste into your code using the gear symbol under the Files tab
- When you make plots, they generally appear in the Plots tab
- Read help documentation for packages and functions in the Help tab
- See your rendered Quarto/R Markdown document in the Viewer tab
The meaning of these different terms and tasks will become clearer as we practice them.
If you open RStudio by opening the application directly instead of opening an R file, then you will not have the Source pane open and likely the Console pane will take up the entire left half of your window:
If instead you open RStudio by clicking on an R file, this screenshot shows the likely order in which the four panes appear on your screen, though you can change the order under Preferences. Here I have two files open: RLab1 and a file whose name starts with exploratory-data-analysis-lesson. When the filename is red with an asterisk, as you see below for the file RLab1, the file has changes that have not been saved yet.
You can rearrange your panes at any time by going to Preferences > Pane Layout. You can also resize/adjust the panes by clicking and dragging their boundaries.
Packages and environment
The different functions/programs we use in R are organized or bundled into what are called packages. A handful packages come with R when you download and install it, but most of them you install afterwards when you want them. It’s easy and quick to install or update packages when you need them.
Click on the Packages tab in the Files pane. You’ll see some list of names in blue with checkboxes next to them; this is the list of packages that you have installed on your computer.
A key point: once you install a package, there is an additional step you need to do before using it. This is a good example to help us understand the concept of the R “environment” or “session”.
Installing a package makes the programs in that package available on your computer. It’s like buying a baking ingredient and putting it on your shelf at home.
Before you can use it, you have to take it off the shelf and put it on the counter or wherever you’re going to bake, whatever your workspace is. Similarly, you have to load/attach the package to your current R environment before you can use it.
If you don’t need an ingredient or it’s in your way, you can put it back on the shelf. Similarly, you can unload/detach a package at any time.
When you’re done baking, you put everything away. Similarly, when you quit R, all your packages are unloaded or detached, so when you open R again next time, you’ll need to load/attach the ones you need before you start working with them.
Just like you can take out all your ingredients at once or one at a time as needed, you can do the same with packages. When you save your code as a script, we generally recommend loading them all at the start so that it’s easy to tell by just looking at the top of the script which packages are required for the code you’re writing/using. When you’re using R interactively, though, you can add and detach packages throughout the session.
When you look at the Packages tab, the packages that have a checkmark next to them are the ones loaded in your current R session for you to use right now, and the unchecked ones are installed but not currently loaded.
The first time you use a package, you’ll need to install it:
Then whenever you want to use it, you can load/attach it:
# for this one, you can use quotes or not
library(tidyverse)Notice that when you start typing, R suggests different options and you can use the up and down arrows to select the right one and hit enter to autocomplete it; this can save a lot of typing.
This is also a handy way to test whether you already have a package: run the
library()command, and if it completes successfully then you already have it installed, and otherwise it will let you the package wasn’t found.
We’ll use this tidyverse package a lot this week. In fact, tidyverse is actually a set or suite of packages that have been grouped together so that you can do just one install command and one library command when you want to use them.
Let’s start working with data!
Many R packages come with built-in data sets, that can be loaded using the data() function. We will use a data set called midwest which is included in the ggplot2 package (part of the tidyverse suite, so library(tidyverse) has already loaded ggplot2). Each row of the midwest data represents a county in one of the five states making up the United State’s Midwest region, and contains information about the county’s population and demographics.
Let’s try some different commands to explore the data. Notice that the text with a hashtag # before it appears in green; those are comments, and the comment character # tells R that any text after it on the same line is meant for humans only, that it shouldn’t run it like code. Other languages have different comment characters.
data(midwest) # loads the data set; where does it appear?
str(midwest) # tells you about the STRucture of the data set
head(midwest) # what does this do?
summary(midwest) # what does this tell you?
?midwest # brings up a help/description page to tell you more about the data set; often has citations
data() # this tells you about all the data sets available in your current environmentTry the above commands, see what they do, and try to answer the following questions about the data:
- Where is the data from?
- How many variables are in this data set, and what are they?
- How many rows are in this data set, and what do they represent? Do you have a row for every respondent to a survey? Every state of the US?
- What format is each variable in?
- How do you take a look at the first few rows? Last few rows?
- What questions do you still have about the data?
How to do math in R
Click in the Console pane and you should see the cursor blinking next to a greater-than symbol > which we call the command line prompt. This is where you can type commands for R to run.
Let’s try a few different commands. To start, type 3+2 and press enter:
We can do basic mathematical computations using +, -, * (multiplication), / (division), and () (grouping). Try the following lines, one line at a time. I recommend typing them yourselves instead of copying and pasting. You can skip the lines with text; these are called comments.
2+3
6*7/2
# The number pi is hard coded into R
pi*1
# Why should the following two lines give you different results? What is the order of operations in each one?
6*3/2-3
6*3/(2-3) # create this line by using the up arrow to copy the previous command and edit it
# What will dividing by zero give you?
1/0Try some on your own!
What about more complicated functions like exponents, square roots, trig functions and the natural logarithm?
See if you can compute the following values yourself before you run them in R, then run them in R and see if you get the answer you expected.
3^2
2^(3+2)
sqrt(9)
log(exp(1)^2)
log(1000)
exp(2)
sin(2*pi)Note that log() computes the natural logarithm (base \(e\)) by default. See ?log to compute a logarithm with a different base; how would you compute \(\log_{10}(1000)\) and what should the correct answer be? Verify it by running your command into the R console.
Creating objects and assigning values
So far, we have been calculating things in R without storing them anywhere. For data analysis and statistics, we need to be able to store and manipulate information instead of just computing things.
We also haven’t been storing our calculations, our code, anywhere. If we want to edit them, we need to retype them. And right now we can scroll back through our command history to remind ourselves what we did and in what order, but what about tomorrow or next month? How can we store them to rerun or edit later?
Numbers, data, formulas, and other statistical information can be stored as . In RStudio, the objects you make and some information about them can be seen in the Environment pane.
For example, if you type the following line into the console,
x = 4you will notice that the blank command line reappears without any output having been printed. All R did was store the number 4 under the name x; we call this “assigning the value 4 to the variable \(x\)”, so you may hear people refer to = as an assignment operator. You can also see under the Environment pane that you now have a variable x that is equal to 4.
Another way to assign a value is to use the <- assignment operator:
x <- 4- Object names are case-sensitive
- Object names should be meaningful and short (“best practice”)
- What’s the difference between
<-and=, and which should we use when?- If you plan to use pre-2001 R code or you want to be 100% backwards compatible just in case, use
<- - Otherwise it’s a matter of preference
- If you use
<-, put a space on either side to improve readability and to avoid confusion with a less-than comparison
- If you plan to use pre-2001 R code or you want to be 100% backwards compatible just in case, use
Run the following code line by line to see how x changes each time. What value does x have after this whole code chunk?
x = 9 # here we assign x the value 9, and R will not print anything
x # here we're not assigning anything; R will just print the current value of x
x = x + 2
x
x = x - 5
xWhat happens if you try to use a variable you haven’t created yet? For example, what happens if you run this line of code?
y - 3Let’s try creating a little data set. Maybe we have data from a household survey on income, the number of household members, and the highest education level in the household. We can create of data, lists of values of the same type, using the function c(). (Here the c stands for concatenate, or to make into a chain.)
This may feel like a toy example, because it is. However, it’s often really helpful to make your own toy data sets when developing code because it’s easier to verify that your code does what you want it to do on a small, simple data set. We often call this creating a minimal reproducible example. This is really helpful not only for your own testing but for those times when you want to ask for help from your instructor/TA/colleague.
Objects can be numbers, strings, matrices, or even more complicated R objects. Examples of R object types:
- integer, numeric, string
- vector, matrix, list
- data.frame
- factor
- lm object (linear model object, e.g. regression)
- formula
id = 1:5
id # to see what this new id object looks like
typeof(id)
hh_income = c(32, 40, 36, 55, 18)
hh_members = c(3, 5, 5, 3, 4)
hh_edu = c("high school", "high school", "some graduate", "college", "high school")We can combine these individual vectors into a data set:
hh_data = data.frame(id, hh_income, hh_members, hh_edu)We can see the types of our data columns (and some other information about the data set) using str():
str(hh_data)R scripts and best practice
Typing commands directly into the console is nice sometimes, especially for developing code or testing things out. If you quit, though, the commands you typed and the results you obtained disappear and will not be there when you reopen RStudio later. To save them, you can store the commands in a file. The simplest kind of file is called a script.
Open a new .R script by clicking File \(\rightarrow\) New File \(\rightarrow\) R Script. This will appear in the Source pane of your RStudio window.
Copy and paste the following lines of code into your script file:
id = 1:5
id # to see what this new id object looks like
typeof(id)
hh_income = c(32, 40, 36, 55, 18)
hh_members = c(3, 5, 5, 3, 4)
hh_edu = c("high school", "high school", "some graduate", "college", "high school")
hh_data = data.frame(id, hh_income, hh_members, hh_edu)
str(hh_data)You can run code many different ways. Here are two:
- Execute a chunk of the code by highlighting it in the script file and typing command-return (Mac) or control-return (PC).
- Alternatively, you can run the code line by line. Place your cursor anywhere on the first line you want to run and type command-return (Mac) or control-return (PC). If you keep repeating this key sequence, you will keep running the lines of code in sequence. Try this out!
Commenting your code
It is good practice to annotate your scripts by including comments that describe what your code does. To do this, you can include lines that start with #; these lines will be treated as comments and they are not run as code. For example, above I made a comment to explain that the second line lets us see what the id object looks like.
You can also use the comment character # to “comment out” code. For example, what value does x have after this code chunk?
x = 9
x = x + 2
x = x - 5What about after this code chunk, which is the same except the middle line is commented out?
x = 9
# x = x + 2
x = x - 5When is code run?
Note that code in your script file is not run until you run it. Therefore, it is also not necessarily run in the order that it is in your script; it is run in whatever order you execute it. For instance, if you quit RStudio, reopen it, and run just the last line of the script, you will receive an error.
Also, if you edit code in the script, it does not update the variables you have already created in your working environment unless you rerun the code. Let’s practice this with our data set example above.
Related: Where does the name Environment pane come from? From the time you open RStudio to the time you close it, we say you are running a session. When you create variables like this, they exist in your working environment which means that they are defined and accessible within R. This is separate from whether they are stored somewhere on your computer for you to access after you close R. This means two things, which will be particularly relevant when we read in and manipulate data sets:
- When you open R or RStudio, data on your computer or anywhere else is not automatically available to analyze in R; first you have to load it and store it as an R object.
- Once you create an object in R, the object exists only during this session unless you save it to a file if you want to be able to access it again later without recreating it. If the object takes a long time to create, this is the best option. Otherwise, the better alternative is usually to save the code you used to create the object, and then you can recreate it easily anytime you need it.
As one last practice session for today, write a script with code that you learned about from today.
- Make sure each instruction appears in the order you would want to run it.
- Use comments to organize and explain it to yourself for tomorrow, as well as for a month or a year from now.
- Then trade code with another math camp participant to give each other feedback. Please share at least one thing that you like about (or learned from) the other person’s code and at least one suggestion or idea you have for their code.
Extra resources
Some of many R tutorials and resources you might find useful, in no particular order:
- R for Data Science. Free online textbook by Hadley Wickham.
- R and Social Science. Free online textbook by Michael Clark.
- R for Social Science. Open curriculum/tutorial by the Carpentries.
- Intro to R for Social Scientists. Tutorial by Jasper Dag Tjaden.
- R for Non-Programmers: A Guide for Social Scientists. Free online textbook by Daniel Dauber.
Working with census and American Community Survey (ACS) data in R:
Handy tool for graphing functions: Desmos calculator
Footnotes
Comment and organize your code well!↩︎