Probability and Statistics 🎲

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Why Study Probability

Probability is the study of chance, the language of uncertainty.
We could do data science without any probability involved. However, what we can learn from data will be much limited. Why?
- Every time you collect a data set, you obtain a different one. Your data are affected by some chance or random noises!
- Represent our uncertainty and ignorance about some event happening.
Knowledge of probability is essential for data science, especially when we want to quantify uncertainty about what we learn from our data.

Probability is the study of chance, the language of uncertainty.
If you wanna study chance or uncertainty formally and rigorously, you cannot do it without probability involved.
We could do data science without any probability involved. However, what we can learn from data will be much limited. Why?
- Every time you collect a data set, you obtain a different one. Your data point is like a random variable that follows some distribution. Or your data are sampled from some population associated with some probability distribution.
- Your data are affected by chance in some way!
Knowledge of probability becomes essential for data science, especially when we want to quantify uncertainty about what we learn from our data. We’ll talk more about how to quantify uncertainty about the accuracy of our estimators or predictions in the next lecture. OK.

Probability as Relative Frequency

The probability that some outcome of a process will be obtained is the relative frequency with which that outcome would be obtained if the process were repeated a large number of times.
Example:
- toss a coin: probability of getting heads 🪙
- pick a ball (red/blue) in an urn: probability of getting a red ball 🔴 🔵

      Frequency Relative Frequency
Heads         4                0.4
Tails         6                0.6
Total        10                1.0
---------------------
      Frequency Relative Frequency
Heads       512               0.51
Tails       488               0.49
Total      1000               1.00
---------------------

If we repeat tossing the coin 10 times, the probability of obtaining heads is 40%.
If 1000 times, the probability is 51.2%.

There are more than one way to define a probability, and one common way is to view probability as relative frequency.
And the definition is as follows.
The probability that some outcome/event of a process/experiment will be obtained is the relative frequency with which that outcome would be obtained if the experiment were repeated a large number of times under similar (identical theoretically) conditions.
Mathematically, we actually require infinite number of repetitions and identical condition.
Here are examples whose probability can be interpreted as relative frequency.
Suppose we toss or flip a coin, we can ask what’s the probability of getting heads.
Or if the experiment is picking a ball (red/blue) in an urn, we can ask what’s the probability of getting heads.
Take tossing a coin as example. To get the probability of getting heads or least the approximate of it. We can toss the coin many times, and then count the number of times that heads shows up, and get the relative frequency.
So here, if I toss a coin 10 times, the relative frequency is 0.4, and so the probability of getting heads is approximately 0.4.
And if I toss a coin 1000 times instead, the approximate probability becomes 51.4%.

Monte Carlo Simulation for Categorical Data

We get a bag of 5 balls colored in red or blue.

Without peeking at the bag, how do we approximate the probability of getting a red ball?

Monte Carlo Simulation: Repeat drawing a ball at random a large number of times to approximate the probability by the relative frequency of getting a red ball.

sample(x = bag_balls, size = 1)

[1] "blue"

mc_sim <- replicate(10000, 
                    sample(bag_balls, 1))
head(mc_sim)

[1] "blue" "red"  "blue" "blue" "blue" "blue"

(freq_table <- table(mc_sim))

mc_sim
blue  red 
6005 3995

freq_table / 10000

mc_sim
blue  red 
 0.6  0.4

So how many red balls in the bag?

A so-called Monte Carlo Simulation demonstrate the idea of relative frequency as an approximation to a probability.
Suppose we have 2 red balls and 3 blue ones in a bag. We wanna know what is the probability of getting a red ball?
If each ball is equally likely to be chosen (picking at random), apparently, \(Pr(red) = 0.4\). # draw a ball at random from bag_balls, a # vector of length 5 with value red/blue
The idea of Monte Carlo Simulation: is that We repeat the experiment (drawing a ball) a large number of times to obtain the relative frequency of red ball to approximate the probability of getting a red ball, which is exactly what we do for tossing a coin before.
The code is shown right here.
First we can use rep() function to generate the bag that has 2 red balls and 3 blue balls.
To draw a ball at random, we can use the function sample(). It produces one random outcome.
To do a Monte Carlo Simulation, we can use replicate() function, which allows us to repeat the same job number of times.
So here, I repeatedly draw a ball 10000 times.
Finally, we can check the frequency table or frequency distribution using table() function.
To get the relative frequency, just divided by the total number of repetitions.
You can see that the result of MC simulation approximates the true probability very well.

Random Seed `set.seed()`

bag_balls

[1] "red"  "red"  "blue" "blue" "blue"

When doing sampling, we use random number generators, and results vary from sample to sample.

To ensure the results are the same every time we do sampling, set the random seed to a specific number by set.seed()

## same result!
set.seed(2026)
sample(x = bag_balls, size = 3)

[1] "blue" "red"  "blue"

set.seed(2026)
sample(bag_balls, 3)

[1] "blue" "red"  "blue"

Normal (Gaussian) Distribution \(N(\mu, \sigma^2)\)

Density curve

Draw Random Values from \(N(\mu, \sigma^2)\)

rnorm(n, mean, sd): Draw \(n\) observations from a normal distribution with mean mean and standard deviation sd.

## the default mean = 0 and sd = 1 (standard normal)
rnorm(5)

[1] -1.033  0.311 -0.881  0.022  1.657

\(100\) random draws from \(N(0, 1)\)

We use rnorm(n, mean, sd) to draw n observations from the normal distribution with mean mean and standard deviation sd.
the default distribution is standard normal (mean is 0 and sd is 1)
So we can easily draw normal samples as many as we want.
All those red points are random draws from the standard normal.
You can see that most of the points are around the mean because the density around the mean is higher.
Also, we can see when we draw normal samples, it is very difficult to get a sample with a very extreme value because its corresponding density value is quite small.
Therefore, we tend to underestimate the population variance if we use the sample data to estimate it. And that’s one of the reason why we divided by n - 1 in the sample variance formula to sort of correct this underestimation.

Histogram of Normal Data (n = 20)

nor_sample <- rnorm(20)

Histogram of Normal Data (n = 200)

nor_sample <- rnorm(200)

Histogram of Normal Data (n = 5000)

nor_sample <- rnorm(5000)

Compute Normal Probabilities

dnorm(x, mean, sd) to compute the density value \(f(x)\) (NOT probability)
pnorm(q, mean, sd) to compute \(P(X \leq q)\)
pnorm(q, mean, sd, lower.tail = FALSE) to compute \(P(X > q)\)
pnorm(q2, mean, sd) - pnorm(q1, mean, sd) to compute \(P(q_1\leq X \leq q_2)\)

Distributions and Their R Function

Source: https://www.oreilly.com/library/view/the-r-book/9780470510247/ch007-sec013.html

Normal Curve

ggplot() +
    xlim(-5, 5) +
    geom_function(fun = dnorm, args = list(mean = 2, sd = .5), color = "blue")

18-probability

In lab.qmd ## Lab 18 section,

Plot the probability function \(P(X = x)\) of \(X \sim \text{binomial}(n = 5, \pi = 0.3)\).

To use ggplot,

Create a data frame saving all possible values of \(x\) and their corresponding probability using dbinom(x, size = ___, prob = ___).

# A tibble: 6 × 2
      x       y
  <int>   <dbl>
1     0 0.168  
2     1 0.360  
3     2 0.309  
4     3 0.132  
5     4 0.0284 
6     5 0.00243

2. Add geom_col()

Probability vs. Statistics

Probability : We know the process generating the data and are interested in properties of observations.
Statistics : We observed the data (sample) and are interested in determining what is the process generating the data (population).

Figure 1.1 in All of Statistics (Wasserman 2003)

Terminology

Population (Data generating process): a group of subjects we are interested in studying
Sample (Data): a (representative) subset of our population of interest
Parameter: a unknown fixed numerical quantity derived from the population ¹
Statistic: a numerical quantity derived from a sample
Common population parameters of interest and their corresponding sample statistic:

Quantity	Parameter	Statistic (Point estimate)
Mean	\(\mu\)	\(\overline{x}\)
Variance	\(\sigma^2\)	\(s^2\)
Standard deviation	\(\sigma\)	\(s\)
proportion	\(p\)	\(\hat{p}\)

First just review some terminologies.
A Populationis a group of individuals or objects we are interested in studying. It could be as small as a single family. Other examples are all Marquette students, all Marquette faculty, all American people, or all people in the world.
A Sample is a (representative) subset of our population of interest.
An important word here is representative. The sample should look like its population. The more alike, the better for the accuracy of statistical inference. So if your population is all people in the US. The sample collected in Wisconsin is not representative at all. If your population is all Marquette students, the sample collected from computer science is not representative. The sample should include all different majors and years that have proportions similar to the proportions of all students. Right.
A Parameter a unknown fixed numerical quantity derived from the population
A Statistic: a numerical quantity derived from a sample
One goal of statistical inference is to estimate population parameters using our sample data, or statistics calculated from our sample data. (proportion of fraternity/sororities)

\((1 - \alpha)100\%\) Confidence Interval for \(\mu\)

With \(z_{\alpha/2}\) being \((1-\alpha/2)\) quantile of \(N(0, 1)\), \((1 - \alpha)100\%\) confidence interval for \(\mu\) is \[\left(\overline{X} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \,\, \overline{X} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\]

What if \(\sigma\) is unknown?

\((1 - \alpha)100\%\) confidence interval for \(\mu\) becomes \[\left(\overline{X} - t_{\alpha/2, n-1} \frac{S}{\sqrt{n}},\,\, \overline{X} + t_{\alpha/2, n-1}\frac{S}{\sqrt{n}}\right),\] where \(t_{\alpha/2, n-1}\) is the \((1-\alpha/2)\) quantile of Student-t distribution with degrees of freedom \(n-1\).

Interpreting a 95% Confidence Interval

We are 95% confident that blah blah blah . . .

If we were able to collect our sample data many times and build the corresponding confidence intervals, we would expect about 95% of those intervals would contain the true population parameter.

However,

We never know if in fact 95% of them do, or whether any particular interval contains the true parameter! 😱
❌ Cannot say “There is a 95% chance/probability that the true parameter is in the confidence interval.”
In practice we may only be able to collect one single data set.

We usually say we are 95% confident that blah blah blah. What do we mean by that?
Imagine that you ask your friends to go to a party this weekend, and they say oh I’m 95% confident that I like. You’d be like, what’s wrong with you, i’m never inviting you.
When we say We are 95% confident, it has a very specific meaning.
- It means that Suppose we were able to collect our sample dataset many times and build the corresponding confidence intervals.
- We would expect about 95% of those intervals would contain the true population parameter.
However, we never know if in fact 95% of them do, or whether any particular interval contains the true parameter (maybe none of them do!). Because we don’t know the true value is.
We cannot say
“There is a 95% chance/probability that the true parameter is in the confidence interval.”

95% Confidence Interval Simulation

\(X_1, \dots, X_n \sim N(\mu, \sigma^2)\) where \(\mu = 120\) and \(\sigma = 5\).

Simulate 100 CIs for \(\mu\) when \(\sigma\) is known

Algorithm
Simulation Result
Code

Algorithm

Generate 100 sampled data of size \(n\): \((x_1^1, x_2^1, \dots, x_n^1), \dots (x_1^{100}, x_2^{100}, \dots, x_n^{100})\), where \(x_i^m \sim N(\mu, \sigma^2)\).
Obtain 100 sample means \((\overline{x}^1, \dots, \overline{x}^{100})\).
For each \(m = 1, 2, \dots, 100\), compute the corresponding confidence interval \[\left(\overline{x}^m - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \overline{x}^m + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\]

mu <- 120; sig <- 5 
al <- 0.05; M <- 100; n <- 16

set.seed(2026)
x_rep <- replicate(M, rnorm(n, mu, sig))
xbar_rep <- apply(x_rep, 2, mean)
E <- qnorm(p = 1 - al / 2) * sig / sqrt(n)
ci_lwr <- xbar_rep - E
ci_upr <- xbar_rep + E

plot(NULL, xlim = range(c(ci_lwr, ci_upr)), ylim = c(0, 100), 
     xlab = "95% CI", ylab = "Sample", las = 1)
mu_out <- (mu < ci_lwr | mu > ci_upr)
segments(x0 = ci_lwr, y0 = 1:M, x1 = ci_upr, col = "navy", lwd = 2)
segments(x0 = ci_lwr[mu_out], y0 = (1:M)[mu_out], x1 = ci_upr[mu_out], col = 2, lwd = 2)
abline(v = mu, col = "#FFCC00", lwd = 2)

19-Confidence Interval

In lab.qmd ## Lab 19 section,

Run the code I give you for simulating 100 \(95\%\) CIs. Change the random generator seed to another number you like.

set.seed(a number you like) Birthday? Lucky number?

How many CIs do not cover the true mean \(\mu\)?

Random Generator

import numpy as np

random.Generator.choice(): Generates a random sample from a given array

bag_balls = ['red'] * 2 + ['blue'] * 3

## set a random number generator 
rng = np.random.default_rng(2025) ## R set.seed()

## sampling from bag_balls 
rng.choice(bag_balls, size = 6, replace = True) ## R sample()

array(['blue', 'blue', 'blue', 'red', 'blue', 'blue'], dtype='<U4')

rng.normal(loc=0.0, scale=1.0, size=5) # R rnorm()

array([-1.12919278, -2.44186665,  0.7653914 , -0.75970935,  0.26699619])

Normal Distribution from SciPy

library(reticulate); py_install("scipy")

import scipy
from scipy.stats import norm

x = 0
norm.pdf(x, loc=0, scale=1) ## R dnorm()

0.3989422804014327

norm.cdf(x, 0, 1) ## R pnorm()

0.5

q = 0.95
norm.ppf(q, 0, 1)  ## Percent point function. R qnorm()

1.6448536269514722

norm.rvs(loc=0, scale=1, size=1) ## R rnorm()

array([-0.19802244])

Normal Curve Plotting

import matplotlib.pyplot as plt

mu = 100
sig = 15
x = np.arange(-4, 4, 0.1) * sig + mu
hx = norm.pdf(x, mu, sig)
plt.plot(x, hx)
plt.xlabel('x')
plt.ylabel('density')
plt.title('N(100, 15^2)')
plt.show()