notes

Statistical Machine Learning

Instructors - Shaowei, Nengli

Course Overview

Component Weightage (%)
Midterms 20
Finals 30
Participation 10
Homework 40

Shaowei’s background

Resources

Textbooks

Textbook sections

Week Topic Book Chapter
7a Kullback distance Bishop 1.6
7b Collaborative Filtering ? ?
7c Gaussian Mixtures Bishop 9.2
8a Information Theory Bishop 1.6
8b Clustering Bishop 9.1
9a Principal Component Analysis Bishop 12.1
9b Time Series Tsay 2.1 - 2.6
10 Gaussian Processes Bishop 6.4
11 Hidden Markov Models Bishop 13
12 Multi-armed Bandit Sutton 1-2
13 Markov decision processes ? ?

Development Environment

conda create -n sml  python=3.7
conda activate sml
conda install numpy scipy matplotlib scikit-learn -y
conda install nb_conda jupyter ipykernel -y
# include OpenAI gym also

Introduction

Machine Learning

Learning Frameworks

Pipeline

Decision Theory

Optimization

Regression

Model Selection

Regularization

Classification

Logistic Regression

Softmax Regression

Neural Networks

Feedforward Networks

Backpropagation

Feature Engineering

Convolutional Networks

Recurrent Neural Networks

Kernel Methods

Maximum Margins

Duality

Comment: As the two strategies are explained side-by-side, it was quite hard to follow. Moreover you are mixing primal-dual (from optimsation) and Lagrangian (from systems world).

Support Vector Machines

Kernels

Kernelization

Graphical Models

Probability

Graph Theory

Bayesian Networks (Directed Graphical Models)

Markov Random Fields (Undirected Graphical Models)

Latent variables

Kullback Distance

Recommender Systems

Please draw the matrix and label the dimensions.

Gaussian Mixtures

Expectation Maximization

Infromation Theory

(Chapter 1.6 of Bishop)

Unsupervised learning

Application

Clustering

(When to use medoids rather than centriods - you cannot show the centriods of n webpages)

Optimisation

Screenshot 2020-03-29 at 2.04.50 AM

Screenshot 2020-03-29 at 2.04.55 AM

Coordinate descent

Screenshot 2020-03-29 at 2.26.42 AM

K-means

Screenshot 2020-03-29 at 2.26.11 AM

(Not K-nearest neighbours (KNN) - which does not have a training phase, and it is a supervised method)

Other clustering methods (FYI)

Principal Component Analysis

Transform into features that are

PCA is a linear dimension reduction. (Manifold learning is non-linear dimensional reduction).

One potential drawback is that such a transformation would lead to a loss in interpretability of the data.

Single value decomposition

Screenshot 2020-03-29 at 2.49.59 AM

We assume that $n » p$

The feature vectors are the columns of $U$. The feature vectors have a norm of one and are orthogonal to each other.

What is $V$?

Please understand how the information is conserved.

SVD Procedure

Positive semidefinite - all eigenvalues are non-negative (and real).

Screenshot 2020-03-29 at 5.45.07 AM

Screenshot 2020-03-29 at 5.44.50 AM

Things to note

Principal Component Analysis

Screenshot 2020-03-29 at 5.56.57 AM

Design matrix is the data, and each entry is one row. We need to normalise each column first.

The result is a matrix $XV$ with dimension reduced from $n \times p$ to $n \times L$.

Screenshot 2020-04-12 at 5.16.07 AM

Why does PCA work

Time Series

Not useful for stock returns modelling, but a good tool to measure volatility.

Stochastic processes

A stochastic process $\lbrace X_t \rbrace _t \in T$ is a collection of random variables indexed by some index set $T$ and governed by some joint distribution $P$. The random variables $X_t$ take on values in some state space $\mathcal{S}$.

$T$ and $\mathcal{S}$ could be discrete or continuous.

For a time series, $\mathcal{S} = \mathbb{R}$, and $T = \lbrace 0,1,2,\cdots \rbrace$

Strict stationarity

The joint distribution of values is the same for any collection of indices.

Weak stationarity

$\mathbb{E}[r_t] = \mu = \mathbb{E}[r_{t+s}]$ for any $s$.

$\text{Cov} (r_t, r_{t-s}) = \gamma(s)$ is a function (autocovariance) only of the lag $s$.

Strict stationarity is hard to check, we adopt the second definition here.

Stationarity Autocorrelation function (ACF)

Screenshot 2020-03-29 at 3.03.04 AM

In essence, this is Pearson correlation applied on elements that are time $t$ apart.

White noise sequences

All time entries $\lbrace a_t \rbrace$ are independent of each other, with a finite variance.

If $\lbrace a_t \rbrace$ is normally distributed then the sequence is a “Gaussian white noise”

Auto-regressive (AR) models

AR(1) model

$r_t = \phi_0 + \phi_1 r_{t-1} + a_t$

where ${a_t}$ is a white noise series.

Given sufficient data, $\phi_0$ and $\phi_1$ can be found with linear regression.

Proof of weak conditionality (please understand)

Homework - AR(2) was modelled.

The general AR(p) model is given by

$r_t = \phi_0 + \phi_1 r_{t-1} + \cdots + \phi_p r_{t-p} + a_t$

Need to use partial-ACF (not in exam) to determine $\rho$, because ACF cannot distinguish AR models with different orders. (Partial F-test to determine fit)

Moving average (MA) models

MA(1) model

$r_t = c_0 + a_t - \theta_1 a_{t-1}$

MLE to find optimal parameters.

ACF

Look at the correlation plot between points $s$ apart, for $s >= 1$. See where the correlation stops.

Screenshot 2020-03-29 at 3.48.11 AM

MA(q) model

$r_t = c_0 + a_t - \theta_1 a_{t-1} - \cdots - - \theta_q a_{t-q}$

ACF

Screenshot 2020-03-29 at 3.48.15 AM

MA(q) models are always weakly stationary because they are finite linear combinations of a white noise sequence with time-invariant expectation and variance.

Forecasting

For all models, as the horizon increase we are more uncertain of the forcast.

2-step - two step into the future, assuming your prediction of 1-step. You need to recursively calculate.

Screenshot 2020-03-29 at 3.49.19 AM

For MA(q) model, when the values are more than $q$ points apart the prediction is $c_0$ because values will be uncorrelated.

Screenshot 2020-03-29 at 3.49.25 AM

For a stationary series, the multistep-ahead forecasts converge to the mean of the series, and the variance of forecast errors converge to the variance of the series as the forecast horizon increases.

ARMA models

(not in exam)

Screenshot 2020-03-29 at 3.51.16 AM

Use MLE to find optimal parameters.

Gaussian processes

“This is an exercise” - will come up for finals.

Is the “inverse covariance matrix” the inverse of covariance matrix?

Conditional distribution in the continuous case $p(y \vert x) = \dfrac{p(y,x)}{p(x)}$ - how to interpret this?

Interesting animation, but I prefer one that animates what the prof has shown (with the truth that is noisily sampled).

Marginal distribution

Probability density function of multivariate Gaussian distribution \(p(x) = C \exp \left( -\frac{1}{2} \left< (x-\mu), \Lambda (x-\mu) \right> \right)\)

where

Distribution of partitioned vectors

If we partition the vector $x$, we can express the marginal distribution in terms of each partition.

Screenshot 2020-04-05 at 1.54.59 PM

Screenshot 2020-04-05 at 1.55.15 PM

Note that

If we are interested in the unconditioned marginal distribution of $x_a$, simply take the integral across $x_b$.

Screenshot 2020-04-05 at 2.12.07 PM

Conditional distribution

We are interested in $\mu_{a \vert b}$ and $\Sigma_{a \vert b}$ as we want to model the distribution of $x_{a \vert b}$.

We set $x_b$ to a constant and compare the expressions.

Screenshot 2020-04-05 at 2.08.54 PM

Screenshot 2020-04-05 at 2.08.58 PM

We want to express $\mu_{a \vert b}$ and $\Sigma_{a \vert b}$ in terms of $\mu$ and $\Sigma$, so we use an expression that was covered in the homework.

Screenshot 2020-04-05 at 2.10.09 PM

Gaussian process

Screenshot 2020-04-05 at 2.13.10 PM

Both the state space and the indexing is continuous.

Screenshot 2020-04-05 at 2.24.52 PM

Example of kernels

Screenshot 2020-04-05 at 2.27.14 PM

In essence, if $x$ is close to $y$, x should be strongly correlated to $y$.

Bayesian prediction

Linear regression

\[p(y \vert f(x)) \sim \mathcal{N}(f(x), \tau^2)\]

Modelling $f$ with a Gaussian process distribution, with kernel $K$

\[p(f(x)) \sim \mathcal{N}(0,K)\]

Parametric methods assume there is a process that generates the data and that process is defined using some parameters. Non-parametric methods do not make any assumption about the process - they try to solve the problem solely in terms of the input data.

Parameteric methods throw away the training data when the model is trained.

The kernels do not have parameters (therefore this is a “non-parametric” method).

What is the value of $\tau$? - an unknown noise variance that does not need to be known?

$t$ is the target (no it is not time)

Screenshot 2020-04-05 at 5.19.44 PM

$\delta_{ij} = 1$ if $i=j$, otherwise $0$.

Screenshot 2020-04-05 at 5.09.18 PM

Screenshot 2020-04-05 at 5.09.52 PM

Refer to conditional distribution of the partitioned variables for the concept.

Bayesian optimization

We want to minimise the computation of $f$, which is a black-box function that does not have an analytic expression.

Bayesian optimization techniques attempt to find the global optimum of $f$ in as few steps as possible.

Parametric methods assume there is a process that generates the data and that process is defined using some parameters. Non-parametric methods do not make any assumption about the process - they try to solve the problem solely in terms of the input data.

Optimisation procedure

Screenshot 2020-04-05 at 3.51.35 PM

Common acquisition functions

Screenshot 2020-04-05 at 3.54.59 PM

Screenshot 2020-04-12 at 10.55.38 PM

Screenshot 2020-04-05 at 4.55.10 PM

With the surrogate model, we can easily calculate $\mu_{GP}(x)$

Hidden Markov Models

What to take note - off by one errors, identify exactly which chain each variable is referring to.

Time-homogeneous Markov chain

A Markov chain is defined by

Nth order Markov Chain (we are only dealing with 1st order, 2nd order in exams?)

Screenshot 2020-04-09 at 12.56.45 AM

Screenshot 2020-04-09 at 1.00.24 AM

Assumption - The probability of observing a series of observations can be factorised into conditions.

Parameters for a hidden Markov model

Screenshot 2020-04-09 at 1.12.42 AM

Example

Screenshot 2020-04-09 at 1.13.52 AM

Screenshot 2020-04-09 at 1.14.03 AM

Screenshot 2020-04-09 at 1.14.40 AM

Three main problems

Evaluation: Forward algorithm

Screenshot 2020-04-09 at 1.17.31 AM

Starting from the initial node to the last node, calculate the forward probability of each hidden state.

Screenshot 2020-04-09 at 1.51.23 AM

The forward probability $\alpha_t^k$ is the joint probability of the hidden state $k$ and past and current observations up to $t$.

Screenshot 2020-04-14 at 6.59.00 AM

To get the (required) joint probability of observations $p(\lbrace O_t \rbrace_{t=1}^T )$, take the sum of the latest forward probabilities over all the hidden states $\sum_k \alpha_t^k$.

Dynamic programming simplifies the computation of all $K^T$ paths to $KT$ alphas.

Decoding: Forward-Backward algorithm

Screenshot 2020-04-09 at 2.02.31 AM

d-separation allows for the above decomposition, because we condition on $S_t$ which d-separates the two subgraphs and thus \(\lbrace O_1 ... O_t, S_1 ... S_{t-1} \rbrace \bot \lbrace O_{t+1} ... O_T, S_{t+1} ... S_T \rbrace \ \vert \ S_t\)

The first part $\alpha_t^k$ is calculated in the forward algorithm until $t$.

The “backward” probabilities $\beta_{t}^k$ is the probability of the following observations from a node given the state of the node.

We compute the $\beta_{t}^k$ starting from last node, until the node that are we interested in the probability in. We would compute all the $\beta$ (and $\alpha$)anyways, because we want to know the probability of each state in every node.

Screenshot 2020-04-09 at 2.10.54 AM

Why initialise $\beta_T^k = 1$? See the LHS of first termination equation, substitute $t=T$ and it is $\alpha_T^k$.

The second termination equation - numerator is the first termination equation, denominator is the sum across all states of the numerator.

The termination equations applies to all $t$s.

Explaining the iterative equation

Learning: Baum-Welch (EM) Algorithm

Objective - find parameters that maximise the likelihood of the observed data.

E-step - Fix parameters, find expected state assignments

M-step - Fix expected state assignments, update parameters

“Likelihood does not factorize since observations are not i.i.d.”

Screenshot 2020-04-14 at 9.06.36 AM

Screenshot 2020-04-14 at 9.06.46 AM

$p_{ij}$ is the transition probability from state $i$ to $j$

$q_j^{O_t}$ is the emission probability of hidden state $j$ to $O_t$

$\gamma_i(t)$ are the probabilities of the hidden state $i$ for each node $t$, computed with the forward-backward algorithm.

$\varepsilon_{ij}(t)$ is the probability of how the hidden states transitioned (invitation to “check” line 2 and 3). Why don’t we just use $p_{ij}$?

Screenshot 2020-04-09 at 2.19.34 AM

Multi-armed bandits

Reinforcement learning

Objective

Policy

Value function

Model

Multi-armed bandit problem

Action-values function $Q_t(i)$

Value function (depends on the state)

Exploitation

Exploration

$\epsilon$-greedy policy

UCB1

Not recorded

Markov Decision Processes

The agent can now transition a different state (unlike the multi-arm bandit problem which where the agent is the always in the same state).

Problem setup

The problem is defined with these probabilities.

Screenshot 2020-04-21 at 1.02.00 AM

Objective

To maximise the expected sum of rewards $\mathbb{E}(G_0)$

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma R_{t+3} + … = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$

$\gamma \in [0,1]$ is the discount factor. If $\gamma=0$, the algorithm will be greedy and maximise only the immediate reward. $\gamma = 1$ may not converge if there are infinite episodes.

$\gamma$ is selected during training, and may not produce the optimum result?

The policy function, the state-value and action-value function is independent of $t$ - this is the Markov assumption.

Policy function $\pi(a \vert s)$

State-value function $v_\pi(x)$ for given policy $\pi$

Terminal states has a value of 0.

Action-value function $q_\pi (s,a)$ for given policy $\pi$

Bellman equation

Screenshot 2020-04-21 at 1.39.31 AM

This will be a system of $ \vert S \vert $ equations with $ \vert S \vert $ unknowns, one for each state-value function. This can be solved with linear algebra. This will compute the state-values given a policy.

How do you compute the action-values given a policy?

Optimal conditions

If you have one of these, you can evaluate the rest.

Optimal policy

$v_\pi (s) \geq v_{\pi’} (s)$ for all $s \in \mathcal{S}$

Optimal state-value function

\(v_{*}(s)\)\(= \max_\pi v_\pi (s) = \max_a q_*(s,a)\)

Optimal action-value function

$q_{*}(s,a) = \max_\pi q_\pi (s,a)$

Iterative policy evaluation

When $ \vert S \vert $ is small, we can directly solve the linear equations. However, when the state space is large, we need to use iteration.

Screenshot 2020-04-21 at 2.10.56 AM

Policy iteration

Start with a random policy, evaluate the policy and then improve the policy until it eventually reach optimum.

Screenshot 2020-04-21 at 3.08.11 AM

Value iteration

Start with random values, evaluate the values and improve the values until you reach the optimum.

Screenshot 2020-04-21 at 3.07.45 AM

AlphaGo

Supervised-only approach - given board state predict next move (with CNN) - good policy function, bad value function.

MTCS trains both policy and value function. It does not require any human games, but just the rules (self-play).