Day 3: Data manipulation and visualization, continued

CSSS Math Camp 2024

Author

Jess Kunke, with previous material from Jess Godwin

Published

September 11, 2024

Matrices in R

Let’s start by defining some matrices in R:

A = matrix(c(1,2,3,4), nrow=2)
B = matrix(c(1,2,1,2), nrow=2)
C = matrix(c(6,5,4,3,2,1), nrow=2)
Your Turn

Using matrices \(A\), \(B\), and \(C\) out on paper what some or all of these should give you:

  1. \(A + B\)
  2. \(AC\) (shorthand for \(A\) times \(C\); this is how we usually write a product of matrices)
  3. \(CA\)
  4. \(A^T\) (transpose of \(A\))
  5. \(A^{-1}\) (inverse of \(A\))

Now let’s see how to do these in R, and this will also give us a way to check our answers:

# A+B
A+B
# not AB or AC but actually elementwise multiplication!
A*B
A*C
# AC
A %*% C
# CA
C %*% A
# transpose of A
t(A) 
# inverse of A
solve(A) 

So an important takeaway here is that matrix multiplication in R uses the operator %*%, not *. A*B does elementwise multiplication, which can be useful sometimes in coding or data science and is different from matrix multiplication.

Data manipulation (continued)

Summary statistics with group_by() and summarize()

How many data points do we have for each continent?

gapminder %>%
  group_by(continent) %>%
  summarize(n.obs = n())

What about other summary statistics, like the minimum, average, and maximum country population on each continent?

gapminder %>%
  group_by(continent) %>%
  summarize(
    min_pop = min(pop),
    mean_pop = mean(pop),
    max_pop = max(pop)
  )

As always, if you want to store the results as an object in R so you can do other stuff with it later, you can do that by assignment:

continent_stats = gapminder %>%
  group_by(continent) %>%
  summarize(
    min_pop = min(pop),
    mean_pop = mean(pop),
    max_pop = max(pop)
  )

We can also group by more than one thing if we want to define groups by more than one variable. For example, how many data points do we have for each continent and each year?

n_obs_by_cont_year <- gapminder %>%
  group_by(continent, year) %>%
  summarize(n.obs = n())

What happens if we switch the order of the grouping variables?

n_obs_by_year_cont <- gapminder %>%
  group_by(year, continent) %>%
  summarize(n.obs = n())

Notice that the table format is not ideal; we’ll address this in the next section!

Your Turn

Filter the gapminder dataset for only data on Italy, then compute the average per-capita GDP for each year in that Italy dataset.

gapminder %>%
  filter(country == "Italy") %>%
  group_by(year) %>%
  summarize(avg_gdp = mean(gdpPercap))

Pivoting

In the last example, the table has a separate row for each year-continent combination. What we would probably find more readable is to have each row represent a year (or continent) and each column represent a continent (or year), and then the values currently in the n.obs column would become the entries in each cell of the table.

We say that n_obs_by_year_cont is currently in long format, and we would like to pivot to a wider format. To do that, we’ll use the tidyverse function pivot_wider():

# we start with n_obs_by_year_cont
table_year_cont = n_obs_by_year_cont %>%
  pivot_wider(
    # which current column has values we want to use as the names of our new columns?
    names_from = continent,
    # which current column has the values for those new columns?
    values_from = n.obs
  )

How would we change the above code so that rows are continents and columns are years?

Sometimes we have a wide-format table that we want to pivot longer. This will come up later as we’re plotting.

Before we move to plots, let’s take a moment to think about which lines of code above depend on which previous lines. Starting with the last code chunk above, where we created table_year_cont, what other code had to be run before that in order for it to work? What is the minimum set of code above that we would need to keep in a script so that we could open a fresh R Session and run the script start to finish without errors to create table_year_cont?

Data viz

Now for some plotting.

Not that kind of plotting.

Getting started with ggplot()

By popular demand, let’s make a plot of population over time for the country of Japan.

First, review: how do you get a subset of the data that’s just the Japan data?

# your code here...
japan_data =
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Cool, now let’s plot population versus time with ggplot():

ggplot(japan_data, aes(x = year, y = pop))

Weird, what do you notice? ggplot() is funny in that the first line which actually has the ggplot function only declares the initial plot area; it doesn’t make the full plot. To do that, we add a + at the end of the ggplot() line and add additional lines of code. For example, “geoms” (geometry layers) add the actual lines, points, bars, etc.:

ggplot(japan_data, aes(x = year, y = pop)) +
  geom_line()

We can combine multiple layers too as long as they make sense for the data structure:

ggplot(japan_data, aes(x = year, y = pop)) +
  geom_line() +
  geom_point()

We can make things a lot prettier and more customized too. Here are just a few examples of things we can do:

ggplot(japan_data, aes(x = year, y = pop/1e6)) +
  geom_line(color = "maroon", alpha = 0.7) +
  geom_point(color = "maroon", alpha = 0.7) +
  xlab("Year") + ylab("Population (millions of people)") +
  ggtitle("Japan's population over time") +
  theme_bw()

Saving plots to file

After we make a plot, we can save the plot to file as a png, jpg, or another file type. There are multiple functions we can use to do this, and here is one I use a lot:

ggsave("japan_plot.png")

I often use some of the other arguments of ggsave() to specify the scale and dimensions of the plot, which also affects the size of axis tick labels relative to the plot:

ggsave("japan_plot_dim.png", width = 10, height = 7, units = "cm")

Remember that “japan_plot.png” is a relative file path, so this is going to save the image file to your current working directory. If you’re using an R Project, then you’re probably going to be saving it into that project’s folder. If you want to save it somewhere else, you can do that by specifying the path you want. To do this, let’s first see how the paste() function works:

paste("Hello", "World!")
paste0("Hello", "World!")

Now let’s apply that function to save the file to another location than the default folder. Note that you need to make sure there is a slash (forward or backward depending on your operating system) between the file path you’re adding and the file name.

# example 1: relative path within the folder you're in now
plot_folder = "plots/"
ggsave(paste0(plot_folder, "japan_plot.png"))

# example 2: absolute path to any place on your computer
plot_folder = "~/Documents/math-camp-plots/"
ggsave(paste0(plot_folder, "japan_plot.png"))

Plotting multiple countries

What if we want to compare several countries on the same plot?

multi_country_data = filter(gapminder, country %in% c("Japan", "Nigeria", "Argentina", "New Zealand"))

ggplot(multi_country_data, aes(x = year, y = pop)) +
  geom_line()

Whoa, why does the plot look like that, and how can we fix it?

The issue is that the multi_country_data dataset has both multiple countries and multiple years, and currently ggplot does not know to group the time series data by country in deciding which lines to plot. It is treating it as data to plot as a single line, when we would like a separate line for each country.

To fix this, we will specify a grouping variable using the group argument to the aes() function:

multi_country_data = filter(gapminder, country %in% c("Japan", "Nigeria", "Argentina", "New Zealand"))

ggplot(multi_country_data, aes(x = year, y = pop, group = country)) +
  geom_line()

We probably also want to distinguish and label the lines somehow by what country they represent:

multi_country_data = filter(gapminder, country %in% c("Japan", "Nigeria", "Argentina", "New Zealand"))

ggplot(multi_country_data, aes(x = year, y = pop, group = country, color = country)) +
  geom_line()

Facets (subplots)

What if we want to plot each country in a different subplot so that we can see each curve on its own scale? We can use facet_wrap():

ggplot(multi_country_data, aes(x = year, y = pop, group = country)) +
  geom_line() +
  facet_wrap(~ country, scales = "free")

Note that since they all share an x-axis (years), it might make sense to plot them vertically stacked so that the years line up. For this, we can use facet_grid() which allows us to arrange the plots specifically in a row or a column:

ggplot(multi_country_data, aes(x = year, y = pop, group = country)) +
  geom_line() +
  # make countries the rows
  facet_grid(country ~ ., scales = "free")

What if we want to plot one country but three different variables: lifeExp, pop, and gdpPercap? We’d basically like to group by variable, but to do that, it has to be a column of the dataset. As in we need a column whose values are “lifeExp”, “pop”, and “gdpPercap” (or some other names for these three quantities).

To do this… yes, we will pivot longer!

japan_wide = japan_data %>%
  pivot_longer(
    cols = lifeExp:gdpPercap,
    names_to = "variable",
    values_to = "value"
  )

ggplot(japan_wide, aes(x = year, y = value, group = variable)) +
  geom_line() +
  facet_wrap(~ variable, scales = "free")

Again, we can use facet_grid() to stack the plots so they align by year:

ggplot(japan_wide, aes(x = year, y = value, group = variable)) +
  geom_line() +
  facet_grid(variable ~ ., scales = "free")
Your Turn

What else might you want to plot from this dataset? Imagine some plots you’d like to make, then use the tools above (and further documentation online if you like, such as the ggplot gallery and the vignettes or “articles” here) to make them.