Monte Carlo Simulation

Statistical Programming with R

Packages and functions that we use

library(dplyr)    # Data manipulation
library(magrittr) # Pipes
library(stringr)  # For counting substrings
library(ggplot2)  # Plotting suite

“feel like a god at your own computer”

What is it?

With Monte Carlo methods we can compute expectations: what do we expect to happen to a particular sample quantity (such as the mean or a p-value) upon repeated sampling?

It builds upon the principles of inferential statistics and needs:

A large set of numbers (e.g. an infinite population) or a theoretical distribution
A way to sample randomly from that large set

It works by repetitively sampling from a population while varying the input, but keeping the method consistent.

Why do it?

evaluate how a statistical method performs in different situations
calculate the power of a particular test based on some (violation of) assumptions
estimate the probability of a complex event by simulating that event

Some probability intuition

Uncertainty, estimation, repeated sampling

John travels to and from work by train every day:
- In the past 10 years his train has been delayed 12 times
- \(P(\text{delay}) = \frac{12}{2\times3650} \approx .0016\)
Bill travels to and from work by train every week:
- In the past year his train has been delayed 50 times
- \(P(\text{delay}) = \frac{50}{2\times52} \approx .481\)
Claire travels by train very occasionally
- Out of the last 3 trips, two trips were delayed
- \(P(\text{delay}) = \frac{2}{3} \approx 0.667\)

Who should be most confident about their estimate of delay probability?

Law of large numbers

Bernouilli’s Theorem (1713):

In repeated independent experiments

with the same true probability \(p\) of a particular outcome in each experiment
when repeated over a large number of times
the average over the results for all these repetitions
will converge to the true probability \(p\)

So if we replicate the same procedure an infinite number of times, the difference between our estimate and the true value would be zero.

The experiments must be independent, i.e., the event probability in one trial does not depend on other trials.

Let’s throw some dice.

Law of large numbers

set.seed(123)
x <- sample(1:6, 10, prob = rep(1/6, 6), replace = TRUE)
prop.table(table(x))

## x
##   1   2   3   4   5   6 
## 0.3 0.1 0.1 0.2 0.2 0.1

x <- sample(1:6, 100, prob = rep(1/6, 6), replace = TRUE)
prop.table(table(x))

## x
##    1    2    3    4    5    6 
## 0.15 0.17 0.16 0.21 0.14 0.17

More samples!

x <- sample(1:6, 10000, prob = rep(1/6, 6), replace = TRUE)
prop.table(table(x))

## x
##      1      2      3      4      5      6 
## 0.1616 0.1683 0.1663 0.1713 0.1663 0.1662

x <- sample(1:6, 1000000, prob = rep(1/6, 6), replace = TRUE)
prop.table(table(x))

## x
##        1        2        3        4        5        6 
## 0.166610 0.167284 0.166763 0.166686 0.166458 0.166199

Estimating the probability of an event

What is the probability of getting 123 in a row?

charx <- paste(x, collapse = "")

estprob  <- str_count(charx, "123") / 1e6
trueprob <- (1/6)^3

cat(estprob,"\n", trueprob, sep = "")

## 0.004635
## 0.00462963