?sample
# run this line a few times
sample(1:20, 10)
# then run this line a few times
sample(1:20, 10, replace = TRUE)Day 4: Intro to probability and statistics in R
CSSS Math Camp 2024
Solving problems with simulation in R
Suppose that there are 30 people enrolled in Math Camp. How likely do you think it is that there are two people in the group who share the same birthday?
While it is possible to answer this question using probability calculations, we can answer this question using simulation in R.
We can solve the birthday problem in the following way:
- Simulate a class of 30 people by sampling a random birthday out of the 365 days of the year.
- Check whether or not any two people share the same birthday.
- Repeat this experiment a large number of times and calculate the proportion of trials for which two people share a birthday.
First, let’s try out the sample() function and answer some questions:
- What package is
sample()in? - How does the output compare when you run
sample()multiple times? - What difference does the
replaceargument make?
The sample() function allows us to sample from a list of items uniformly at random (same probability for each birthday), with or without replacement. Actually, we can also change the probabilities of drawing one birthday versus another so that some are more likely than others, in which case it would no longer be a uniform distribution. We won’t use that feature of the sample() function today, but it is often useful.
Simulation Step 1: Draw a sample
Okay, now let’s use sample() to draw a sample of random birthdays for 30 people. What is the vector we’re sampling from? What is the size of the sample we want to draw? And do we want replace = TRUE?
# class_bdays = sample(?)There are various ways we could do this, but basically we want some vector that represents the 365 possible birthdays (we’re ignoring leap years for simplicity), and we’ll use sample() to draw a birthday 30 times with replacement (since we want to allow for people to have the same birthday). If we don’t use replace = TRUE, we are ensuring that no one can have the same birthday!
class_bdays = sample(1:365, 30, replace = TRUE)Simulation Step 3: Repeat for many classes
As a start, try this manually a few times; simulate 2-3 different classes and determine whether is_repeated equals 0 or 1 for each class:
# simulate a class
class_bdays = sample(1:365, 30, replace = TRUE)
is_repeated = ifelse(sum(duplicated(class_bdays))==0, 0, 1)
# simulate another class
class_bdays = sample(1:365, 30, replace = TRUE)
is_repeated = ifelse(sum(duplicated(class_bdays))==0, 0, 1)
# simulate a third class
class_bdays = sample(1:365, 30, replace = TRUE)
is_repeated = ifelse(sum(duplicated(class_bdays))==0, 0, 1)For-loops
Notice that if you want to do this many times, you can use a for-loop. The basic structure of a for-loop is for(){}. You put the code you want to repeat inside the curly braces {}, and the directions for how many times to repeat it inside the parentheses ():
for(i in 1:3){
# simulate a class's birthdays
class_bdays = sample(1:365, 30, replace = TRUE)
# check if any birthdays were repeated
is_repeated = ifelse(sum(duplicated(class_bdays))==0, 0, 1)
print(is_repeated)
}Why use a for-loop?
Even if you have a manageable number of repeats, like three here, a great advantage of the for-loop is that you avoid errors that can happen from copying and pasting the same code multiple times. For instance, if you edit the code, you might edit some of the copies but not all of them; the for-loop avoids that and ensures that the code is exactly the same for every iteration.
If you have more than a few iterations, a for-loop is much more compact, easier to type and easier to read.
Scaling up
Next we’ll scale this up to get a better estimate of the probability. By scaling up, I mean we’re going to draw not three different classes but a lot: maybe a thousand, or ten thousand, or a hundred thousand. Rather than printing out 10,000 values of is_repeated, we probably want to store them somewhere and then compute the proportion of them that equal 1. The mean of the 0s and 1s gives you your estimate of the proportion of the time that you get a 1, i.e. the proportion of time that you expect to find at least 1 birthday shared by at least 2 people.
So let’s do that. Remember:
- We only want to put inside the for-loop the things that we want to repeat.
- Things we’re going to need to use in the for-loop have to be defined before the for-loop. - Things that will use the results of the for-loop have to be defined after the for-loop.
# let's make is_repeated an "empty" vector
is_repeated = rep(NA, 10000)
# then in each iteration, we'll fill in one of the values of is_repeated
for(i in 1:10000){
# simulate a class's birthdays
class_bdays = sample(1:365, 30, replace = TRUE)
# check if any birthdays were repeated
# AND SAVE THIS in the corresponding element of is_repeated
is_repeated[i] = ifelse(sum(duplicated(class_bdays))==0, 0, 1)
# no need to print the value
}We can make one improvement here: we can replace the 10000’s with a variable for the number of simulations and then set that once at the top. This way if we want to change the number of simulations, we don’t have to hunt for the 10000 and change it everywhere and worry we might have missed one.
nsim = 10000
# let's make is_repeated an "empty" vector
is_repeated = rep(NA, nsim)
# then in each iteration, we'll fill in one of the values of is_repeated
for(i in 1:nsim){
# simulate a class's birthdays
class_bdays = sample(1:365, 30, replace = TRUE)
# check if any birthdays were repeated
# AND SAVE THIS in the corresponding element of is_repeated
is_repeated[i] = ifelse(sum(duplicated(class_bdays))==0, 0, 1)
# no need to print the value
}Writing our own functions
R allows us to write our own functions, which is useful when we want to repeat something many times.
The basic form of a function:
my_function_name = function(input1, input2){
# code for whatever I do with my inputs
# for example, here we add the two inputs together
result = input1 + input2
# return something
return(result)
}You can also write functions that don’t return something but instead do some kind of “side-effect” such as print stuff, save a figure, or make a plot.
Let’s adapt our code from the two chunks above into a function called simulate_class_bdays that simulates a class and returns 0 or 1 depending on whether or not at least two people share the same birthday! To do this, we first want to think about these questions:
- What inputs will this function need?
- What output should it give us, and in what format?
- What code do we need to get from inputs to outputs?
# we actually don't need any inputs if we don't want to be able to change the parameters of our setup
simulate_class_bdays = function(){
class_bdays = sample(1:365, 30, replace = TRUE)
is_repeated = ifelse(sum(duplicated(class_bdays))==0, 0, 1)
return(is_repeated)
}
# if you want to be able to change the class size, we need to pass it as an input:
simulate_class_bdays = function(n.grp){
class_bdays = sample(1:365, n.grp, replace = TRUE)
is_repeated = ifelse(sum(duplicated(class_bdays))==0, 0, 1)
return(is_repeated)
}Let’s run this function a few times to test how it works.
Now let’s use it to generate 10,000 different classes! We’ll use map_dbl() to do this:
# use map_dbl (a vectorized purrr function) to run 10,000 simulations
classes_all = map_dbl(1:10000, ~simulate_class_bdays())The mean of the 0s and 1s gives you your estimate of the proportion of the time that you get a 1, i.e. the proportion of time that you expect to find at least 1 birthday shared by at least 2 people:
mean(classes_all)Try repeating the last two lines a few times and see what you notice:
classes_all = map_dbl(1:10000, ~simulate_class_bdays())
mean(classes_all)
classes_all = map_dbl(1:10000, ~simulate_class_bdays())
mean(classes_all)
classes_all = map_dbl(1:10000, ~simulate_class_bdays())
mean(classes_all)You can read more about the birthday problem here.
Sampling from distributions and generating random samples
We’ve seen how to sample from a list of items uniformly at random, with or without replacement. What if we want to draw numbers or values according to some other distribution? Can we draw a sample of normally distributed or binomially distributed values?
Yes! Several of these are already built into R, including the binomial distribution:
?rbinomLet’s run it a few times and change the parameters:
rbinom(10, 100, 0.5)Before we simulate more from the \(Binom(100, 1/2)\) distribution, let’s consider some questions:
- What is the mean or expectation of this distribution?
- What is its variance?
Now let’s make a lot of draws from the \(Binom(100, 1/2)\) distribution:
samples = rbinom(1000, 100, 0.5)
# does this match the expected number of heads? should it?
mean(samples)
# does this match the expected variance in the number of heads? should it?
var(samples)Note that the Wikipedia pages on distributions such as the binomial distribution are GREAT references. They have the definition, parameters, mean, variance, cdf, pdf/pmf, examples, applications, and more, all in once place.
Run this chunk of code a few times and then try the questions after it:
n_samples = 1000
n = 100
p = 0.5
samples = rbinom(n_samples, n, p)
hist(samples, breaks=min(samples):max(samples))Compare the result of the last line of code above to
hist(samples). What does the argumentbreaksdo?Why does the bar plot change slightly when you rerun the same code?
What function that is a property of the binomial distribution does this histogram approximate: the pmf, the pdf, or the cdf?
Try changing the number of samples and notice how the bar plot changes.
Try playing around with the values of
nandp. How do the center, spread, and shape of the distribution change? Can you find values ofnandpfor which the distribution looks more lopsided/skewed rather than symmetric?
grid = seq(35, 65, 1) # vector of values 35, 36, ..., 65
pmf = dbinom(grid, n, p) # calculates value of pmf for each grid value
samples = rbinom(n_samples, n, p)
hist(samples, prob=TRUE)
lines(grid, pmf)What parameters do other distributions take as input?
?rnorm
?rpois
?rbeta
?runifLeast squares regression
Let’s try a least squares regression using the gapminder data. Here’s some example code:
library(tidyverse)
gapminder <- read_csv("gapminder.csv")
gapmodel = lm(gdpPercap ~ pop + lifeExp, gapminder)
gapmodel # see what the lm object looks like
summary(gapmodel) # summary of an lm object is more informative
coef(gapmodel)
summary(gapmodel)$coefficientsWhat do these coefficients mean? Let’s talk about how to interpret them.
Computational efficiency and benchmarking (timing) your code
We can use benchmarking to compare the efficiency of computing the least squares estimate coefficients using lm() versus using matrix operations. We’ll check out two benchmarking packages, rbenchmark and microbenchmark.
First, let’s use the benchmark() function from the rbenchmark package to compare the time it takes to run lm() and matrix operations. Which one is faster, and by how much?
install.packages("rbenchmark")
library(rbenchmark)
?benchmark
# using benchmark:
# benchmark(
# "test1" = code you want to run, (if code is more than one line, wrap code with {})
# "test2" = code you want to run,
# "test3" = code you want to run,
# ...,
# "testn" = code you want to run,
# replications = however many you want,
# possibly other arguments
# )
# let's make up some data!
X = matrix(rnorm(1000), 100, 10)
y = X %*% sample(1:10, 10) + rnorm(100)
benchmark(
"lm" = lm(y ~ X + 0)$coef,
"matrix operations" = solve(t(X) %*% X) %*% t(X) %*% y,
replications = 1000,
columns = c("test", "replications", "elapsed", "relative")
)Another package, microbenchmark, is more accurate for very short runtimes. Let’s try it out on some fast operations like computing a square root:
install.packages("microbenchmark")
library(microbenchmark)
# for example, with simpler functions:
# generate 100 random numbers between 0 and 1
x = runif(100)
# run benchmarking to compare sqrt() and x^0.5
microbenchmark(
sqrt(x), # approach 1
x ^ 0.5 # approach 2
)