Course Syllabus
Deep learning has transformed the field of Machine Learning over the last decades. Though inspired by attempts to mimic the way the brain learns, it is heavily grounded in basic principles of probability and statistics. This introductory course presents these basic principles and discusses them in context of several 'classical' ML tasks.
After this course the students should be able to:
What is learning? What is Machine Learning? Learning from previous experiences. Supervised/unsupervised learning. Examples of learning tasks: classification, regression, clustering, associations.
What kind of math do we need for ML? Course preview.
Concept of population and population parameters. Population parameters vs. sample statistics. Concept of statistical inference. Sampling methods. Observational studies vs. statistical experiments.
Dataset: a sample from a population.
Dataset structure:
Attribute types:
Converting tuples into numeric vectors. Handling ordinal data. Handling categorical data: one hot encoding.
03. Describing and visualizing.pdf
Visualization:
Numeric statistics:
Quantifying uncertainty. Experiments with finite number of outcomes. Sample space. Events as sets of outcomes and subsets of the sample space.
Probability – defined as the number of desired outcomes divided by the total number of outcomes in the sample space.
Using combinatorics to count the total number of outcomes. Combinations and permutations.
Addition rule and complements. Multiplication rule and independence.
06. Conditional ProbabilityBayes.pdf
Definition of conditional probability. Probabilistic independence. Independent and mutually exclusive events. Joint probability of two events.
Bayesian reasoning: inductive reasoning with probabilities. Updating odds with evidence. Bayes theorem.
07.Discrete probability distributions.pdf
Concept of a Random variable. Distribution for the possible values of a random variable. Expected value of a random variable: weighted average. Variance of a random variable.
Discrete probability distributions. Probability mass function.
Sample discrete distributions:
08. Continuous probability distribution.pdf
Continuous distributions:
Probability density function vs. probability mass function.
Mean and standard deviation of a continuous probability distribution.
Standard normal distribution. Using statistical tables to compute the number of standard deviations from the mean.
9. Case study 1: Naive Bayes classifier
09. Naive Bayes Classifier.pdf
Comparing the odds of mutually exclusive events: prior probabilities.
Bayesian reasoning: updating odds using new evidence.
Concept of Conditional independence. Modeling conditionally independent events.
The task of classification. Naive Bayes classifier.
10. Case study 1A: Bayesian belief networks
Modeling probabilistic dependencies between events using Bayesian belief networks. Queries over the network. Comparing joint probabilities.
11. Case study 1B: Gaussian Naive Bayes
Using probability density function of a numeric attribute to discriminate between classes. Gaussian Naive Bayes.
Classification using a combination of probabilities for categorical and numeric attributes under assumption of conditional independence.
12. Inference.ConfidenceIntervals.pdf
Inferring population parameters from sample statistics. Central Limit Theorem: distribution of sample statistics.
Concept of a confidence interval. Level of confidence.
Confidence interval for a population mean computed from sample data.
Sample proportion as an outcome of Bernoulli trial. Confidence interval for a population proportion.
13. Case study 2: classifier performance
Measuring classifier performance. Error rate/accuracy. Training and testing sets. Holdout estimation. K-fold cross-validation.
Computing confidence intervals for classifier performance based on accuracy of a test set.
Statistical hypothesis testing. Formulating null hypothesis and alternative hypothesis. Left-, Right- and Two-tailed tests.
Confidence and significance. Hypothesis testing using rejection regions.
Large samples (z-distribution) vs. small samples (t-distribution).
Hypotheses about population mean. Hypotheses about population proportion.
P-values and the level of significance. Hypothesis testing using p-values.
Hypotheses about comparing two population means (independent samples).
Hypotheses about two proportions (independent samples).
Hypotheses about paired data (not independent samples).
16. Case study 2A: comparing performance of classifiers
16. Statistical comparison of classifiers.pdf
Statistically comparing performance of two classifiers using Hypothesis testing for two proportions.
Review of vectors. Geometric interpretation of vector operations. Data as feature vectors. Measuring proximity between data records:
When to use each.
18. Case study 3: Nearest Neighbor classifier
Memory-based reasoning. Nearest neighbors. The k-Nearest Neighbor classifier (k-NN). Distance and combination functions. The best value of k.
19. Case study 3A: Recommender systems
Recommendations from nearest neighbors.
20. Case study 4: Linear Regression
Predicting numeric outcomes. Linear regression. Fitting line (model) to data using the least-squares model. Optimization problem: minimizing the squared loss function. Learning line parameters using stochastic gradient descent.
21. Case study 5: Neural Nets
21.Neural networks introduction.pdf
Artificial Neural networks. Neuron: metaphor. Activation threshold. Signal summation. Sigmoidal activation function. Bias node. Learning network parameters from data by minimizing prediction error using gradient descent.
Single-layer perceptron: linearly-separable classes.
Multi-layer perceptron. Importance of non-linearity.
Forward algorithm and backpropagation using partial derivatives.