notes

Information Retrieval

Instructor

Grade breakdown

What is information retrieval

All logarithms in this course is $\log_2$

Information Retrieval task

Implementation of IR models

Boolean IR (ok)

Objective of Boolean IR

Apporaches to Boolean IR

Appraisal of Boolean IR

Problems with Boolean IR

Vector Space Model (ok)

Documents as vectors

TF-IDF measure

Jaccard coefficient

Term frequency

Document frequency

TF-IDF

Different situations (?)

Similarity mesaures between vectors

Appraisal of Vector Space Model

Merits of Vector Space Model

Problems with Vector Space Model

Word Representations

Each word or token needs to be transformed into a vector. An embedding maps the token to the vector.

Embeddings are not transferable across spaces, the embeddings have meaning only in the space where they were learned. Word meaning also changes over time.

Two words are similar if they have similar word context. (e.g. you can the word means an alcoholic beverage like beer).

Types of Vector Representations

Co-occurance matrices

Positive Pointwise Mutual Information (PPMI)

$\text{PPMI}(\text{word}_1, \text{word}_2) = \max \left( \log_2 \dfrac{P(\text{word}_1, \text{word}_2)}{P(\text{word}_1)P(\text{word}_2)} , 0 \right)$

Explicit Semantic Analysis

Dense Vectors

Singular Value Decomposition

You can get dense vectors by applying SVD on sparse vectors (do people even do that in practice?)

$X = TSD$

The original size of the decomposed matrices. $m$ is the rank of the matrix

$(t \times d)$ = $(t \times m) \cdot (m \times m) \cdot (m \times d)$

Dropping all except the first $k$ elements

$(t \times d)$ = $(t \times k) \cdot (k \times k) \cdot (k \times d)$

It might be useful to get rid of the top 1 dimension or even the top 50 dimensions.

Vectors from Neural Networks

Skipgram and CBOW - refer to Deep Learning notes

Feature vectors from recurrent neural network, ELMo, Transformers

In project, we applied SVD to reduce the dimension of the vectors for faster query

Use FAISS for to quickly query the close vector from a preprocessed list of vectors.

Probabilistic IR

Basic probability

Probability Ranking Principle

Binary Independence Model

Retrieval Status Value (RSV)

RSV of a document $d$ wrt to a query $q$

Estimation of $p_t$ and $u_t$

Table $R=1$ Document is relevant $R=0$ Document is not relevant Total
$x_t = 1$
Term $t$ exist in document
[a] [b] $\text{df}_t$
$x_t = 0$
Term $t$ does not exist in document
[c] [d]  

$p_t = \dfrac{a}{a+c}$

$u_t = \dfrac{b}{b+d}$

$c_t = \dfrac{a/c}{b/d}$

Probability estimates in practice
Probabilistic Relevance Feedback

Guess a preliminary probabilisitic description of R=1 documents, use it to retrieve a set of documents

Interact with the user to refine the description and learn some definite members

Re-estimate $p_i$ and $r_i$ on the baisis of this

Repeat, thus generating a succession of approximations to relevant documents

Okapi BM25

This is a nonbinary model

$RSV_d = \sum_\limits{t \in q} w_{t_{\text{BM25}}} \log \left( \dfrac{N}{\text{df}_t} \right)$

$w_{t_{\text{BM25}}} = \dfrac{(k_1 + 1) \cdot \text{tf}{td}}{k_1 \left( 1-b + b \left( \dfrac{L_d}{L{ave}} \right) \right) + \text{tf}d} \cdot \dfrac{(k_3 + 1) \text{tf}{tq}}{k_3 + \text{tf}_{tq}}$

Relevance Feedback (?)

Replace IDF term with something. TBC

Optimised Index Construction

Objective

Hardware basics

Baseline algorithm

Blocked Sort-based Indexing (BSBI) (?)

https://nlp.stanford.edu/IR-book/html/htmledition/blocked-sort-based-indexing-1.html

Single-pass In-memory Indexing (SPIMI) (???)

https://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html

This procedure to create an inverted index

The complexity is linear (I do not need why for each block or the entire index)

Advantages of SPIMI

It is faster because there is no sorting required, and it saves memory because we keep track of the term a postings list belongs to, so the termIDs of postings need not be stored. As a result, the blocks that individual calls of SPIMI-INVERT can process are much larger and the index construction process as a whole is more efficient.

Because we do not know how large the postings list of a term will be when we first encounter it, we allocate space for a short postings list initially and double the space each time it is full (lines 8-9). This means that some memory is wasted, which counteracts the memory savings from the omission of termIDs in intermediate data structures. However, the overall memory requirements for the dynamically constructed index of a block in SPIMI are still lower than in BSBI - why?

Distributed Indexing (ok)

https://nlp.stanford.edu/IR-book/html/htmledition/distributed-indexing-1.html

Dynamic Indexing (???)

https://nlp.stanford.edu/IR-book/html/htmledition/dynamic-indexing-1.html

Auxiliary and main index approach
Logarithmic merge

Other methods

Neural Network and Deep Learning

Evaluation Metrics

Torch (the new stuff I learnt)

from torchtext.data.utils import get_tokenizer
from gensim.models import KeyedVectors
glove_50dim = api.load("glove-wiki-gigaword-50")
glove_50dim.save('glove_50dim.w2v')

Relevance Feedback (ok)

Procedure

All these are query specific. If you want to make a different query, you will need to repeat the process.

Query Reformulation in VSM

In Vector Space Model, each document and query are represented with a vector of a same length.

Optimal Query

If you know the set of all relevant documents (not true), the best query that ranks all and only the relevant documents at the top

$\vec{q}\text{opt} = \dfrac{1}{\vert C_r \vert} \sum\limits{\text{relevant}} \vec{d}j - \dfrac{1}{N-\vert C_r \vert} \sum\limits{\text{irrelevant}} \vec{d}_j$

Standard Rocchio Method

Since the relevant documents are unknown, just use the known relevant and irrelevant sets of document

\[\vec{q}_{m} = \alpha \vec{q} + \dfrac{\beta}{\vert D_r \vert} \sum\limits_{\text{relevant}} \vec{d}_j - \dfrac{\gamma}{\vert D_n \vert} \sum\limits_{\text{irrelevant}} \vec{d}_j\]
Ide Regular Method
\[\vec{q}_{m} = \alpha \vec{q} + \beta \sum\limits_{\text{relevant}} \vec{d}_j - \gamma \sum\limits_{\text{irrelevant}} \vec{d}_j\]
Ide “Dec Hi” Method
\[\vec{q}_{m} = \alpha \vec{q} + \beta \sum\limits_{\text{relevant}} \vec{d}_j - \gamma \max_{\text{irrelevant}} \vec{d}_j\]

Comparison between the methods

How to evaluate performance

Addressing the lack of feedback

Query Expansion with a Thesaurus

For each term $t$, in a query, expand the query with synonyms and related words of $t$ from the thesaurus

IR Evaluation (ok)

(Understanding achieve, but still need to memorise the various metrics)

Evaluation

Resources needed to evaluate IR systems

What to keep constant when comparing IR systems

Metrics

Implementation

Benchmarks

Unigram Language Model (ok)

If you can predict the next word, you understand the language.

Query-Likelihood Retrieval Model

Linear interpolation smoothing

query-likelihood-model

Recurrent Neural Networks

RNN formulation

Encoder and Decoder model in information retrieval

Learning to Rank

Approaches based on training objective

This only affects how the model is trained

Pointwise approach

Predict the relevant score (by regression - MSE) of a document w.r.t. query or predict whether the document is relevant or not (by classification - cross-entropy)

Pairwise approach

Take two document, and try to predict which document is more relevant

Listwise approach

Directly optimise the ranking metric (e.g. NDCG) - difficult because the metrics are non differentiable

Training

Features available for training

Deep neural networks
Turning a unsupervised problem to a supervised one

Image Retrieval

Challenges

Methods

Efficient image retrieval strategies

Question Answering

Factoid QA needs information extraction followed by ranking

Non-factoid QA needs filtering and summarisation as well

Challenges

Baseline approaches

Modern QA approaches

There is also a hybrid based system by IBM Watson

IR-based QA

Question Processing

Passage Retrieval

Knowledge-based QA

Guest Lecture

Microsoft Research India

Code-mixing, scripting-mixing and transliteration (romanization)

Challenges on working with code-mixed data

On how people code-mix (and implications on code-mixed data generation)

Approaches to mixed script information retrieval

Computational Models of Code-Switching

Information Retrieval in Paypal

Use case of Information Retrieval