2/20/23
We want to understand something about a population.
We can never observe the entire population, so we draw a sample.
We then use a model to describe the sample.
By comparing that model to a null model, we can infer something about the population.
Here, we’re going to focus on statistical description, aka models.
We take ten cars, send each down a track, have them brake at the same point, and measure the distance it takes them to stop.
Question: how far do you think it will take the next car to stop?
Question: what distance is the most probable?
But, how do we determine this?
Two types of random variable:
Let \(X\) be the number of heads in two tosses of a fair coin. What is the probability that \(X=1\)?
Has two components:
Central-tendency or “first moment”
Dispersion or “second moment”
These can be defined using precise mathematical functions:
A probability mass function (PMF) for discrete random variables.
A probability density function (PDF) for continuous random variables.
Df. distribution of a binary random variable (“Bernoulli trial”) with two possible values, 1 (success) and 0 (failure), with \(p\) being the probability of success. E.g., a single coin flip.
\[f(x,p) = p^{x}(1-p)^{1-x}\]
Mean: \(p\)
Variance: \(p(1-p)\)
Df. distribution of a random variable whose value is equal to the number of successes in \(n\) independent Bernoulli trials. E.g., number of heads in ten coin flips.
\[f(x,p,n) = \binom{n}{x}p^{x}(1-p)^{1-x}\]
Mean: \(np\)
Variance: \(np(1-p)\)
Df. distribution of a random variable whose value is equal to the number of events occurring in a fixed interval of time or space. E.g., number of orcs passing through the Black Gates in an hour.
\[f(x,\lambda) = \frac{\lambda^{x}e^{-\lambda}}{x!}\]
Mean: \(\lambda\)
Variance: \(\lambda\)
Df. distribution of a continuous random variable that is symmetric from positive to negative infinity. E.g., the height of actors who auditioned for the role of Aragorn.
\[f(x,\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\;exp\left[-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right]\]
Mean: \(\mu\)
Variance: \(\sigma^2\)
Let’s use the Normal distribution to describe the cars data.
Sample statistics:
Sample statistics:
This is our approximate expectation
Sample statistics:
But, there’s error, \(\epsilon\), in this estimate.
Sample statistics:
The average squared error is the variance:
Sample statistics:
This is our uncertainty, how big we think any given error will be.
Sample statistics:
So, here is our probability model.
\[Y \sim N(\bar{y}, s)\] This is only an estimate of \(N(\mu, \sigma)\)!
Sample statistics:
With it, we can say, for example, that the probability that a random draw from this distribution falls within one standard deviation (dashed lines) of the mean (solid line) is 68.3%.
This gives us a simple formula
\[y_i = \bar{y} + \epsilon_i\] where
This gives us a simple formula
\[y_i = \bar{y} + \epsilon_i\]
If we subtract the mean, we have a model of the errors centered on zero:
\[\epsilon_i = 0 + (y_i - \bar{y})\]
This gives us a simple formula
\[y_i = \bar{y} + \epsilon_i\]
If we subtract the mean, we have a model of the errors centered on zero:
\[\epsilon_i = 0 + (y_i - \bar{y})\]
This means we can construct a probability model of the errors centered on zero.
Note that the mean changes, but the variance stays the same.
Now our simple formula is this:
\[y_i = \bar{y} + \epsilon_i\] \[\epsilon \sim N(0, s) \]