Day 5: Quarto and extra practice

CSSS Math Camp 2024

Author

Jess Kunke and Erin Lipman

Published

September 13, 2024

What is Quarto?

Quarto is an open-source publishing system that allows you to combine text, code (even in multiple languages), and output (figures, tables, and more) to make all sorts of products and documentation!

Let’s check out some examples in their Quarto gallery. Later we’ll also take a look at the Quarto document I made for the Day 3 lab notes.

How do I install Quarto?

If you have RStudio v2022. 07.1 or later, then you’ve already got Quarto! To check your RStudio version, open RStudio if you haven’t already, go to the top menu, and select RStudio > About RStudio.

If you have an older version of RStudio than that, you probably want to update RStudio anyway, so I recommend first downloading the latest version of R (go here and follow these instructions), then downloading the latest version of RStudio here, which comes with Quarto.

If you have a recent version of RStudio but you want a newer version of Quarto than what came with that RStudio version (since development on Quarto happens fast), click here to download the current version. You do not need up update R or RStudio after installing Quarto.

Creating a Quarto document

Click on the dropdown menu that looks like a piece of paper with a plus sign, and select “Quarto document”. Alternatively, go to the menu across the top of the screen and select File > New File > Quarto Document. When the New Quarto Document window appears, enter a title and author, then click “Create”.

Learning from the template

Even a brand new document comes with some template text and code that we can learn from and modify. Let’s save it and render it to see how it works! Here are a few things we can learn about already just from the template. You can see the documentation here on all the different things you can do with Quarto and all the different settings you can set.

Source vs. visual editing modes

There are two editing modes– source and visual– and we can toggle back and forth between them. The source mode shows you what the document you are writing actually looks like, while the visual mode shows you what you will get when you render the document.

YAML header options

The document starts with a YAML header, meaning that the top of the document is written in a language called YAML. What is YAML? “Yet another markup language”. Markup languages allow us to specify the format and structure of text documents. Other examples of markup languages include HTML (hypertext markup language), XML, Markdown (used in jupyter notebooks and GitHub readme files), and R Markdown.

From this small template example, we already see some of the parameters we can set using YAML: title, author, format, editor.

R code in a Quarto doc

R chunks start with three backticks and {r} and end with three backticks.

We can run the code chunks in the Quarto document interactively within RStudio; this is useful for homework and research notes. Let’s see how to do that.

We can put YAML options in the code chunks using comments that start with #|. For example, we can control how code cells are executed using eval.

Text formatting using Markdown

The text of the document is formatted using Markdown, and we see some examples of Markdown formatting used in this short example:

  • We can make sections and subsections of the document just as we did in R scripts, and it formats the size of the headings accordingly in the resulting document when we render.

  • We can bold text by putting two asterisks on either side of the text.

  • We can format text like code by putting a backtick on either side.

  • We can make a URL into hypertext (make it clickable) by putting <> around it, like this: https://www.google.com/

Checking out my lab notes example

Here are some additional features you can see in my lab notes document “RDay3_data_viz.qmd”:

  • More YAML header options at the top of the document (what do these do?)
    • Subtitle
    • Date
    • Table of contents
    • Self-contained HTML file (what’s that?)
    • Modifying the default code execution settings for your document
  • How to comment out YAML you don’t want
  • Callouts (special notes or warnings, how to name them, and how to make them collapsible)
  • Embedding images from online or from a file
  • Named links/URLs
Your Turn

Make a mini lab report or homework assignment with some of the data manipulation and visualization functions we’ve already learned in math camp!

Let’s do some work with the midwest dataset from the ggplot2 package and then write a report.

library(tidyverse)
data(midwest)
str(midwest)
  • Below are some data exploration and analysis questions you can work on, and/or feel free to do some exploration of your own.
  • Make some relevant sections and subsections in your document.
  • Say a little bit about what the data represent and where they come from.
  • Make and include at least one table and one plot in your report. Format the table and plot as nicely as you can and play with the different options. (Feel free to ask for suggestions!)
  • Play with the format of your document and have fun!

Data exploration and analysis

Let’s start by finding out some basic information about the midwest data.

  1. How many rows are represented in the data?

  2. Count the number of unique values in the PID column to confirm that PID serves as a unique identifier for each county.

  3. Let’s look at the county variable.

    1. How many unique values are in the column county? Why might there be multiple rows with the same county name?

    2. Create a data set that contains only the counties that appear in more than one state by doing the following:

      • Use group_by() and summarise() to create a data set with a column that counts the number of rows with each county name.

      • Filter to rows where this count is at least 2.

    3. Which counties appear in all 4 states?

Now let’s use the data to answer some questions about population demographics:

  1. Create a data set that has the total state population of each racial/ethnic category in the dataset (“white”, “black”, “amerindian”, “asian”, “other”), as well as the total state population. You could name your summary columns something like poptotal_state,popwhite_state, etc.

  2. Check that these population counts from the five racial/ethnic categories sum to the total population count poptotal_state by (a) creating a column that sums these race-specific state population counts, then (b) checking that it equals the total state population column.

  3. Use ggplot() to make a bar plot of the total population by state using the following steps:

    1. Start by passing your dataset from (Q1) to ggplot(), with the aesthetic aes(x = state, y = poptotal_state). You may need to change poptotal_state if your summary column has a different name.

    2. Add a geom_col() layer to add the bars to the blank plot (note: there is a geom called geom_bar(), but it works best when the thing you are plotting is a count of rows. geom_col(), short for “column”, is more flexible).

    3. Customize your plot by changing the axis labels and adding a plot title.

  4. Use ggplot() to make a bar plot of total population by state, broken down by racial/ethnic group:

    1. In order to make a barplot in ggplot(), we will need to combine all of the population count columns into a single column using pivot_longer(), to create a data set with columns for state, race, and population count for that state and racial/ethnic group. Your pivot_longer() command might look something like this: pivot_longer(!state, names_to="race", values_to="population"). Can you understand what each argument to pivot_longer() does? Be careful to pivot only the columns you want to pivot.

    2. Now that you have a single column for population count, you can make a stacked bar plot showing total state population by race using code similar to that in Q3, but adding the aesthetic fill=race, e.g. aes(x=state, y=population, fill=race).

    3. Customize your plot to look better or more report-ready.

There are many solutions to most or all of these questions, but here are some examples of how to solve them.

# Q1 ------------------------------------------------------
# one approach
nrow(midwest) # 437

# another approach
dim(midwest)[1]

# Q2 ------------------------------------------------------
# one approach:
n_distinct(midwest$PID) # 437

# another approach:
midwest %>% 
  select(PID) %>% 
  n_distinct()

# yet another:
length(unique(midwest$PID)) # 437

# Q3 ------------------------------------------------------
# Q3a
n_distinct(midwest$county) # 320, which is < 437 so there are some repeats

# Q3b
# one approach
county_name_counts = midwest %>%
  group_by(county) %>%
  summarise(n=n()) %>%
  filter(n>=2)

# another approach
table_vals = table(midwest$county)
county_name_counts = table_vals[table_vals >= 2]

#Q3c
# if you used the first (tidyverse) approach above
counties_all_four_states = county_name_counts %>%
  filter(n==4) # 8 counties appear in all four states

# if you used the second (base-r) approach above
counties_all_four_states = names(table_vals[table_vals == 4])

# Q4 ------------------------------------------------------
state_pops = midwest %>%
  group_by(state) %>%
  summarise(
    poptotal_state = sum(poptotal),
    popwhite_state = sum(popwhite),
    popblack_state = sum(popblack),
    popasian_state = sum(popasian),
    popamerindian_state = sum(popamerindian),
    popother_state = sum(popother)
  )

# Q5 ------------------------------------------------------
# here I didn't update state_pops because I only want to use these
# columns once to make this check-- instead I just print the
# output-- but if you wanted you could update state_pops to add 
# these columns
state_pops %>% 
  mutate(poptotal_state_check = popwhite_state + popblack_state + popasian_state + 
    popamerindian_state + popother_state) %>%
  select(state, poptotal_state, poptotal_state_check) %>%
  mutate(totals_match = (poptotal_state == poptotal_state_check))

# Q6 ------------------------------------------------------
ggplot(state_pops, aes(x=state, y=poptotal_state)) +
  geom_col() +
  xlab("State") +
  ylab("State population total") +
  ggtitle("Population by state")

# Q7 ------------------------------------------------------
# The trick here is that we need to pivot just the five race-based columns, so however we exclude any other column(s) we have is fine
# option 1:
state_pops_long = state_pops %>%
  select(-poptotal_state) %>%
  pivot_longer(!state, names_to="race", values_to="population")

# option 2:
state_pops_long = state_pops %>%
  pivot_longer(popwhite_state:popother_state, names_to="race", values_to="population")

# option 3:
state_pops_long = state_pops %>%
  pivot_longer(!c(state, poptotal_state), names_to="race", values_to="population")

# then plot:
ggplot(state_pops_long, aes(x=state, y=population, fill=race)) +
  geom_col() +
  xlab("State") +
  ylab("State population total") +
  ggtitle("Population by state broken down by racial/ethnic group")

# more report-ready: (some examples; you might have done different things)
install.packages("ggrepel")
library(ggrepel)

state_pops_long = state_pops %>%
  pivot_longer(!c(state, poptotal_state), names_to="race", values_to="population") %>%
  # capitalize legend title
  # make it a factor to make it easier for us to change the legend names
  mutate(Race = plyr::revalue(as.factor(race),
                        c("popamerindian_state" = "American Indian",
                          "popasian_state" = "Asian",
                          "popblack_state" = "Black",
                          "popother_state" = "Other",
                          "popwhite_state" = "White")))
  

ggplot(state_pops_long, aes(x=state, y=population/1e6, fill=Race)) +
  geom_col() +
  # geom_text(aes(label = round(population/1e6, 2)), size = 3, 
  #           hjust = 0.5, vjust = 3, position = "stack") +
  geom_text_repel(aes(label = round(population/1e6, 2)), 
            position = position_stack(vjust = 0.5)) +
  xlab("State") +
  ylab("Population (millions)") +
  ggtitle("State population by racial/ethnic group") +
  theme_bw(base_size = 18)

ggsave("state_pop.png", width = 20, height = 20, units = "cm")