Dr. Cheng-Han Yu Department of Mathematical and Statistical Sciences Marquette University
Why Study Probability
Probability is the study of chance, the language of uncertainty.
We could do data science without any probability involved. However, what we can learn from data will be much limited. Why?
Every time you collect a data set, you obtain a different one. Your data are affected by some chance or random noises!
Represent our uncertainty and ignorance about some event happening.
Knowledge of probability is essential for data science, especially when we want to quantify uncertainty about what we learn from our data.
Probability as Relative Frequency
The probability that some outcome of a process will be obtained is the relative frequency with which that outcome would be obtained if the process were repeated a large number of times.
Example:
toss a coin: probability of getting heads 🪙
pick a ball (red/blue) in an urn: probability of getting a red ball 🔴 🔵
Frequency Relative Frequency
Heads 4 0.4
Tails 6 0.6
Total 10 1.0
---------------------
Frequency Relative Frequency
Heads 512 0.51
Tails 488 0.49
Total 1000 1.00
---------------------
If we repeat tossing the coin 10 times, the probability of obtaining heads is 40%.
If 1000 times, the probability is 51.2%.
Monte Carlo Simulation for Categorical Data
We get a bag of 5 balls colored in red or blue.
Without peeking at the bag, how do we approximate the probability of getting a red ball?
Monte Carlo Simulation: Repeat drawing a ball at random a large number of times to approximate the probability by the relative frequency of getting a red ball.
ggplot()+xlim(-5, 5)+geom_function(fun =dnorm, args =list(mean =2, sd =.5), color ="blue")
18-probability
In lab.qmd## Lab 18 section,
Plot the probability function \(P(X = x)\) of \(X \sim \text{binomial}(n = 5, \pi = 0.3)\).
To use ggplot,
Create a data frame saving all possible values of \(x\) and their corresponding probability using dbinom(x, size = ___, prob = ___).
# A tibble: 6 × 2
x y
<int> <dbl>
1 0 0.168
2 1 0.360
3 2 0.309
4 3 0.132
5 4 0.0284
6 5 0.00243
2. Add geom_col()
Probability vs. Statistics
Probability: We know the process generating the data and are interested in properties of observations.
Statistics: We observed the data (sample) and are interested in determining what is the process generating the data (population).
Figure 1.1 in All of Statistics (Wasserman 2003)
Terminology
Population(Data generating process): a group of subjects we are interested in studying
Sample(Data): a (representative) subset of our population of interest
Parameter: a unknownfixed numerical quantity derived from the population 1
Statistic: a numerical quantity derived from a sample
Common population parameters of interest and their corresponding sample statistic:
Quantity
Parameter
Statistic (Point estimate)
Mean
\(\mu\)
\(\overline{x}\)
Variance
\(\sigma^2\)
\(s^2\)
Standard deviation
\(\sigma\)
\(s\)
proportion
\(p\)
\(\hat{p}\)
\((1 - \alpha)100\%\) Confidence Interval for \(\mu\)
With \(z_{\alpha/2}\) being \((1-\alpha/2)\) quantile of \(N(0, 1)\), \((1 - \alpha)100\%\) confidence interval for \(\mu\) is \[\left(\overline{X} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \,\, \overline{X} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\]
What if \(\sigma\) is unknown?
\((1 - \alpha)100\%\) confidence interval for \(\mu\) becomes \[\left(\overline{X} - t_{\alpha/2, n-1} \frac{S}{\sqrt{n}},\,\, \overline{X} + t_{\alpha/2, n-1}\frac{S}{\sqrt{n}}\right),\] where \(t_{\alpha/2, n-1}\) is the \((1-\alpha/2)\) quantile of Student-t distribution with degrees of freedom \(n-1\).
Interpreting a 95% Confidence Interval
We are 95% confident that blah blah blah . . .
If we were able to collect our sample data many times and build the corresponding confidence intervals, we would expect about 95% of those intervals would contain the true population parameter.
However,
We never know if in fact 95% of them do, or whether any particular interval contains the true parameter! 😱
❌ Cannot say “There is a 95% chance/probability that the true parameter is in the confidence interval.”
In practice we may only be able to collect one single data set.
95% Confidence Interval Simulation
\(X_1, \dots, X_n \sim N(\mu, \sigma^2)\) where \(\mu = 120\) and \(\sigma = 5\).
Simulate 100 CIs for \(\mu\) when \(\sigma\) is known
random.Generator.choice(): Generates a random sample from a given array
bag_balls = ['red'] *2+ ['blue'] *3## set a random number generator rng = np.random.default_rng(2025) ## R set.seed()## sampling from bag_balls rng.choice(bag_balls, size =6, replace =True) ## R sample()