Fall 2025

Probability and statistics with applications to Machine Learning

Course Syllabus

Course description:

Deep learning has transformed the field of Machine Learning over the last decades. Though inspired by attempts to mimic the way the brain learns, it is heavily grounded in basic principles of probability and statistics. This introductory course presents these basic principles and discusses them in context of several 'classical' ML tasks.

After this course the students should be able to:

Visualize and summarize data, think and reason about outliers and relationships.
Adopt probabilistic reasoning.
Solve practical problems using probability and statistics.
Intuitively understand mathematical underpinnings of major machine learning algorithms.

Prerequisites

Discrete Structures for CS (CS 0441) or Discrete Math 2 (CS 0406)
Intro to matrices and Linear Algebra (MATH 0280) or Linear Algebra 1 (MATH 1180)

Course content

Introduction

Defining ML. Math for ML

00.Introduction.pdf

What is learning? What is Machine Learning? Learning from previous experiences. Supervised/unsupervised learning. Examples of learning tasks: classification, regression, clustering, associations.

What kind of math do we need for ML? Course preview.

Linear algebra – to describe data as vectors in multi-dimensional spaces.
Probability and statistics – to reason probabilistically.
Calculus – to learn model parameters (Learning = optimization).

1. Data basics

01.Data Basics.pdf

Concept of population and population parameters. Population parameters vs. sample statistics. Concept of statistical inference. Sampling methods. Observational studies vs. statistical experiments.

2. Data sets

02. Datasets.pdf

Dataset: a sample from a population.

Dataset structure:

Attributes=dimensions: concept of a multi-dimensional space.
Tuples=records=observations: concept of data points in space.

Attribute types:

Nominal (categorical)
Ordinal
Numeric: discrete, continuous

Converting tuples into numeric vectors. Handling ordinal data. Handling categorical data: one hot encoding.

3. Describing datasets

03. Describing and visualizing.pdf

Visualization:

Frequency tables and Histograms
Bar charts

Numeric statistics:

Measures of centrality: Mean, Median, Mode.
Measures of dispersion: Range, Variance, Standard deviation.
Measures of position. Dispersion of values around the mean. Chebyshev’s theorem.
Measures of position. Quartiles. 5-number summary. Using quartiles for outlier detection. Box plots.

Part I. Probability and statistics

4. Discrete probability

04.Probability.pdf

Quantifying uncertainty. Experiments with finite number of outcomes. Sample space. Events as sets of outcomes and subsets of the sample space.

Probability – defined as the number of desired outcomes divided by the total number of outcomes in the sample space.

5. Counting number of outcomes

05.Combinatorics.pdf

Using combinatorics to count the total number of outcomes. Combinations and permutations.

Addition rule and complements. Multiplication rule and independence.

6. Conditional probability and Bayes theorem

06. Conditional ProbabilityBayes.pdf

Definition of conditional probability. Probabilistic independence. Independent and mutually exclusive events. Joint probability of two events.

Bayesian reasoning: inductive reasoning with probabilities. Updating odds with evidence. Bayes theorem.

7. Random variables and discrete probability distributions

07.Discrete probability distributions.pdf

Concept of a Random variable. Distribution for the possible values of a random variable. Expected value of a random variable: weighted average. Variance of a random variable.

Discrete probability distributions. Probability mass function.

Sample discrete distributions:

Bernoully
Binomial
Poisson

8. Continuous probability distributions

08. Continuous probability distribution.pdf

Continuous distributions:

Normal distribution (Gaussian, z-distribution).
Student distribution (t-distribution).

Probability density function vs. probability mass function.

Mean and standard deviation of a continuous probability distribution.

Standard normal distribution. Using statistical tables to compute the number of standard deviations from the mean.

9. Case study 1: Naive Bayes classifier

09. Naive Bayes Classifier.pdf

Comparing the odds of mutually exclusive events: prior probabilities.

Bayesian reasoning: updating odds using new evidence.

Concept of Conditional independence. Modeling conditionally independent events.

The task of classification. Naive Bayes classifier.

10. Case study 1A: Bayesian belief networks

10.BayesianBeliefnetworks.pdf

Modeling probabilistic dependencies between events using Bayesian belief networks. Queries over the network. Comparing joint probabilities.

11. Case study 1B: Gaussian Naive Bayes

11. GaussianNaiveBayes.pdf

Using probability density function of a numeric attribute to discriminate between classes. Gaussian Naive Bayes.

Classification using a combination of probabilities for categorical and numeric attributes under assumption of conditional independence.

12. Confidence intervals

12. Inference.ConfidenceIntervals.pdf

Inferring population parameters from sample statistics. Central Limit Theorem: distribution of sample statistics.

Concept of a confidence interval. Level of confidence.

Confidence interval for a population mean computed from sample data.

Sample proportion as an outcome of Bernoulli trial. Confidence interval for a population proportion.

13. Case study 2: classifier performance

13.ClassifierPerformance.pdf

Measuring classifier performance. Error rate/accuracy. Training and testing sets. Holdout estimation. K-fold cross-validation.

Computing confidence intervals for classifier performance based on accuracy of a test set.

14. Hypothesis testing

14. HypothesisTesting.pdf

Statistical hypothesis testing. Formulating null hypothesis and alternative hypothesis. Left-, Right- and Two-tailed tests.

Confidence and significance. Hypothesis testing using rejection regions.

Large samples (z-distribution) vs. small samples (t-distribution).

Hypotheses about population mean. Hypotheses about population proportion.

15. More hypothesis testing. P-values

15. MoreHypothesisTesting.pdf

P-values and the level of significance. Hypothesis testing using p-values.

Hypotheses about comparing two population means (independent samples).

Hypotheses about two proportions (independent samples).

Hypotheses about paired data (not independent samples).

16. Case study 2A: comparing performance of classifiers

16. Statistical comparison of classifiers.pdf

Statistically comparing performance of two classifiers using Hypothesis testing for two proportions.

Part II. Data in space

17. Data as vectors in multi-dimensional space

17.DistanceMetrics..pdf

Review of vectors. Geometric interpretation of vector operations. Data as feature vectors. Measuring proximity between data records:

Euclidean distance.
Cosine similarity.
Matching coefficients and Tanimoto (Jaccard).

When to use each.

18. Case study 3: Nearest Neighbor classifier

18.NearestNeighbors.pdf

Memory-based reasoning. Nearest neighbors. The k-Nearest Neighbor classifier (k-NN). Distance and combination functions. The best value of k.

19. Case study 3A: Recommender systems

19. Recommender systems.pdf

Recommendations from nearest neighbors.

Part III. Learning parameters using partial derivatives

20. Case study 4: Linear Regression

20.Regression.pdf

Predicting numeric outcomes. Linear regression. Fitting line (model) to data using the least-squares model. Optimization problem: minimizing the squared loss function. Learning line parameters using stochastic gradient descent.

21. Case study 5: Neural Nets

21.Neural networks introduction.pdf

Artificial Neural networks. Neuron: metaphor. Activation threshold. Signal summation. Sigmoidal activation function. Bias node. Learning network parameters from data by minimizing prediction error using gradient descent.

Single-layer perceptron: linearly-separable classes.

Multi-layer perceptron. Importance of non-linearity.

Forward algorithm and backpropagation using partial derivatives.