notes

Deep Learning

I am auditing this course, so projects and examinations are optional for me.

Midterm Cheatsheet

page-1

page-2

Finals Cheatsheet

page-1

page-2

Introduction

Why MINST and CIFAR-10 are not used for deep learning

Key components of discriminative (?) machine learning

Low-level (?) engineering steps

Pytorch guide

The fallacies of Deep Learning

Human in the loop

What is being optimised

The goal of machine learning is to find a model that generalises to unseen data points.

Generalisation

The goal in machine learning is to find parameters $w,b$ which generalise well to new, unseen data points - i.e. give predictions with low errors

The training objective and the prediction objective might be different (e.g. training objective usually include regularisation terms).

(I do not understand the small example middle - but I am not taking the exam heh)

The expected loss is considered “low” when our average loss is very close to the assumed model $f^*$ that generated the dataset.

Takeaways

Linear Regression

Prediction mapping of vector $x$ to a value

Linear mapping $g(x) = w \cdot x + b$

Loss function to use linear regression for classification

Regularisation

Support Vector Machine

Logistic regression

The sigmoid function $s(a)$

$s(a) = \dfrac{\exp(a)}{1+\exp(a)} = \dfrac{1}{\exp(-a) + 1}$

Logistic regression model

$h(x) = s(f_{w,b}(x)) = \dfrac{1}{\exp(-w \cdot x-b) + 1}$

The cross-entropy loss function

Optimising with gradient

Observations

Gradient descent

Gradient methods can be used if the function is differentiable.

How do we calculate the gradient of a function?

Differentiability

Learning rate

Stochastic gradient descent

Batch gradient descent

Stochastic gradient descent

Neural networks

Neuron

Types of activation functions

Neural network

How does neural network learn non-linear mappings

Universal approximation theorem

How to learn neural network parameters

Multiclass vs multilabel classification

Backpropagation

The objective of backprogagation is to calculate the gradients of each weight from node $b$ to node $a$ $\dfrac{\partial E}{\partial w_{ab}}$, where $E$ is the loss of the minibatch.

The derivative of the weight is the derivative of the output node

$\dfrac{\partial E}{\partial w_{ab}} = \dfrac{\partial E}{\partial z_{b}} \dfrac{\partial z_b}{\partial w_{ab}} = \dfrac{\partial E}{\partial z_{b}} w_{ab}$

The we use chain rule to compute the derivative of each node

$\dfrac{\partial E}{\partial z_k} = \sum\limits_{l:\, l \text{ receives input from } k} \dfrac{\partial E}{\partial z_l} \dfrac{\partial z_l}{\partial z_k}$

The derivative between each node $\dfrac{\partial z_l}{\partial z_k}$

$\dfrac{\partial z_l}{\partial z_k} = \sigma’ \left( \sum_j w_j z_j \right) z_k$

We can reuse the derivative in the later layers (this technique is known as dynamic progamming).

What happens for recurrent neural networks?

Challenges

Convolutions

Why do we use convolutions?

Dimensions of a 2D convolution

Pooling

Training techniques

Initialisation

Guidelines on initialising the parameters of a neural network

Data augmentation

Data augmentation at testing time allows the neural net to see multiple views of the same sample.

Data augmentation at training time allows the neural net to be trained with more samples.

Types of augmentation

Finetuning

Optimizers

Stochastic gradient descent (refer to a section previously)

Learning rate $\eta$

Learning rate schedule $\eta_t$

Weight decay with regularisation

Momentum

RMSProp

Adam

Model Improvements

Dropout

Batch Normalisation

Residual connections

The history of SOTA CV models

Recurrent Neural Networks

There are many variations, and notation may not be uniform.

Vanillin RNN cell

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

LSTM cell

GRU cell

Temporal Convoluntional Neural Network

Implementation

Sequence to one prediction

Sequence to sequence prediction/generation

Encoder to sequence generation

Attack and defense on neural networks

Neural networks are “limited” - they will always be at risk of attacks making them malfunction, no matter how many safeguards that are put in place.

Epsilon Noising Attack

What is a good attack sample

Impact of noise on classifiers

Taxonomy on attacks

White-box attack strategies

Defense strategies

Black-box attack strategies

The Embedding Problem

(With some bonus material mostly from this video because I need to pass interview)

The model can only take in a set tensor inputs (i.e. constant size). If our dataset involves these, we need to transform the dataset

An embedding transforms an object $x$ into another object $y$

How do we evaluate word vectors

Distributional Hypothesis of Linguistics

One-hot embeddings

Count-based embeddings

Word2Vec Methods

In contrast to count-based method, this is a prediction-based method

Global Vectors for Word Representation (GLoVE)

Universal and Multitask Learning-based Embeddings

Universal Embeddings are embeddings that are pre-trained on a large corpus and can be plugged in a variety of downstream task models to automatically improve their performance, by incorporating some general word/sentence representations learned on larger datasets.

A good embeddings should at least be partially operate on out-of-vocabulary words.

fastText

Google Universal Sentence and Word Encoder

Sentence embeddings

Contextual embeddings

The idea that the embeddings should depend on the context.

RNN encoder and decoder

Turns an input sequence into another output sequence.

Embeddings from Language Models (ELMo)

Attention

Examples given

Attention coefficients

Transformer

(to ask prof to go through the specific details on Tuesday)

(Try to understand this graph)

transformer

Applications of attention/transformers

Limits of attention

History of the state of the art of NLP

Graph Convoluntional Network

Introduction to Graphs

Elements of a graph

Types of graph

Adjacency matrix

Degree matrix (for undirected graph only)

Laplacian matrix

Reachability - whether a node is reachable by another node with a sequence of nodes. Reachability is not necessarily mutual if the graph is directed.

Hop distance - if a node is reachable by another, what is the length of the smallest sequence.

Hermitian matrix is matrix that has elements $a_{ij} = \overline{a_{ji}}$

Application

Graph convolutions

Graph convolutional layer

Graph convolutional layer improved

Generative Models

Structure of an AutoEncoder

An AutoEncoder is a neural network that learns to copy its input to its output.

Why AutoEncoder will not be perfect

Fully Connected AutoEncoder

MINIST autoencoder with fully connected layers

Choice of the size of context vector

Inverse Convolution operations

Problems

Deconvolution layer

Magic formula for convolution output size = (input_size + 2 x padding - kernel_size + stride)/stride

Magic formula for deconvolution output_size = (input_size-1) x stride - 2 x padding + kernel_size + output_padding + 1 output_size = (input_size-1) x stride - 2 x padding + output_padding + 1 + dilation x (kernel_size - 1)

Use output_padding so that the input is divisble by the stride

Animation - https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md

Applications of AutoEncoder

Variational AutoEncoder (VAE)

Generative Adversarial Network

https://arxiv.org/abs/1406.2661

An architecture which attempts to learn to produce the samples distribution in a given dataset.

Train a model $G$ which noise sample inputs $z \in \mathbb{R}^k$ drawn from normalised Gaussian distributions.

Components

Training procedures

Objective

Issues with vanilla GANs

Wasserstein GAN

Mathematical story

Progressive GANs

Limitations of vanilla/WGANs

Issue - generating high-resolution images is challenging because the generator must learn how to product both large structures and fine details at the same time

Core idea

StyleGANs

Conditional GANs

CycleGAN

Interpretability

Why do we need interpretability

Two families of interpretability methods

Methods for interpretation

Reinforcement Learning

Reference

Elements of a RL framework

Exploration - Acquire knowledge

Exploitation - Make use of knowledge to make the best decision

A good RL-based AI smartly combines exploration and exploitation

Multi-armed bandit problem

Reference strategies

$\epsilon$-first strategy

$\epsilon$-first strategy with softmax exploration

$\epsilon$-greedy strategy

Upper bound confidence interval strategy

Maze Problems

Now that the agent may exist in many different possible states

Value function $V$

State-action function $Q$

$Q$-learning

q-table

Naive-Q learning

There may be almost infinite possible states, and it may not longer be possible to have a Q table. We use a deep neural network (DNN) to replace the Q table

Deep-Q learning

Use of experience buffer

Use two neural networks, one for training and one for producing targets.

More advanced problems

Actor-critic - for problems where reward is not easily computed

Markov states - for problems when transitions is not deterministic

Partially observable environment - when agent is not able to see full board state

State-Action-Reward-State-Action (SARSA) - the agent has to learn how to play, but also has to learn how another player might respond to its actions

Non-stationarity in problems - the distribution of candies can change over time

Invited Talks

Jason Kuen

Mathieu Ravaut

Small project feedback