Instructor
- https://sporia.info/
- https://declare-lab.net/
Grade breakdown
- 20% Checkoff
- 20% Finals
- 60% Project
What is information retrieval
- Learn how to use machine learning and AI tools to mine your desired information from the web
- Not just web scraping
- Group projects in 3 people
All logarithms in this course is $\log_2$
Implementation of IR models
- Preprocessing - tokenisation, stop word removal, stemming
- Indexing - build an inverted index
- Retrieval - compute cosine similarity of a limited set of documents that contains at least one of the query words
- Ranking - sort the retrieved documents based on the value of cosine similarity
Boolean IR (ok)
Objective of Boolean IR
- Information need: “Which plays by Shakespeare mention Brutus and Caesar, but not Calpurnia?”
- Boolean query: Brutus AND Caesar AND NOT Calpurnia
- Does not rank the query
Apporaches to Boolean IR
- Term-document incidence matrix
- 1 if term exists in document, otherwise 0
- The matrix is too large to store
- The matrix is very sparse as well
- Each document is a column, each index in a row
- Inverted index
- Map each term to a posting list (linked list) of documents containing it
- Assemble sequence of (token, docID) pairs
- Sort by terms, then by docIDs
- Retrieve the relevant posting list
- Merge sort the posting list according to the query (AND/OR/NOT)
- Consider the query processing order
- Process the smallest query first
Appraisal of Boolean IR
Problems with Boolean IR
- Does not rank the results
- Require expertise and precise understanding of their needs the collection
- Boolean queries often result in too few or too many results
- Hard to tune precision and recall
Vector Space Model (ok)
Documents as vectors
- Each document are a point in $V$ dimensional space, though most entries are zero.
- Each vector is the TF-IDF of each term occuring in the document)
TF-IDF measure
Jaccard coefficient
- Overlap of two sets A and B
- The size of intersection divided by the size of union
- Size of intersection + Size of union = Size of A + Size of B
- Does not consider term frequency and document frequency
Term frequency
- Number of times the term $t$ appears in document $d$
- More frequent terms in a document are more important
- May be normalised over the frequency of the most frequent term (interpretation - each document should have at least a one)
Document frequency
- Rare terms are more information than frequent terms
- Number of times the term $t$ appears in all the documents (?)
- Logarithm to dampen the effect
TF-IDF
- term-frequency inverse document frequency
-
\[w_{t,d} = \text{tf}_{t,d} \times \log_{10}(N / \text{df}_t)\]
Different situations (?)
- Technical vocabulary
- Varied vocabulary
- Short documents or queries
- Dynamic collections
Similarity mesaures between vectors
- Euclidean distance - lower more similar
- Manhattan distance - lower more similar
- Inner product - higher more similar
- Cosine similarity
- Cosine angle between two vectors
- The different lengths of the document is normalised
- Known as normalised inner product
- $\dfrac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2} \sqrt{\sum_i b_i^2}}$
- Do you need the vectors to be normalised?
Appraisal of Vector Space Model
Merits of Vector Space Model
- Simple, mathematically based approach
- Consider both local (tf) and global (idf) word occurance frequencies
- Provides partial matching and ranked results
- Allows efficient implementation for large document collections
Problems with Vector Space Model
- (Because documents are treated documents as bag of words)
- Missing semantic information
- Missing syntactic information
- Assumption of term independence
- Lacks the control of a Boolean model (e.g. requiring a term to appear in a document)
Word Representations
Each word or token needs to be transformed into a vector. An embedding maps the token to the vector.
Embeddings are not transferable across spaces, the embeddings have meaning only in the space where they were learned. Word meaning also changes over time.
Two words are similar if they have similar word context. (e.g. you can the word means an alcoholic beverage like beer).
Types of Vector Representations
- Sparse vector representations
- Vector space model (each document is a vector, a vector with TF-IDF of each term occuring in the document)
- Mutual-information co-occurance matrices
- Explict Semantic Analysis
- Dense vector representations
- Singular value decomposition
- Neural network models (skip-grams, CBOW)
Co-occurance matrices
- Matrix counts for each pair of words, the number of times the words occur together.
- Co-occurance can be defined as occuring in a paragraph, or in a window of $k$ words.
- The dimension of the matrix if $\vert V \vert \times \vert V \vert$.
- The shorter the windows, the more syntactic the representation
- The longer the windows, the more semantic the representation
- Does not need to be symmetrical - the number of words and contexts can be different
- Raw word frequency is not a great measure of association between words (stop words occur very often but maybe not the most discriminative)
- Do words x and y co-occur more than if they were independent?
- Made positive bceause negative values are problematic (why and how often?)
$\text{PPMI}(\text{word}_1, \text{word}_2) = \max \left( \log_2 \dfrac{P(\text{word}_1, \text{word}_2)}{P(\text{word}_1)P(\text{word}_2)} , 0 \right)$
- Instead of count, we can use PPMI for the co-occurance matrix
- PPMI is biased towards infrequent events - very rare words have very high PMI values. We use Laplace smoothing - we add a small number (e.g. 2) each of the co-occurance count.
Explicit Semantic Analysis
- Determine the extent to which each word is associated with every concept (article) of Wikipedia via term frequency (or some other method). The dimension of the concept vector is the number of concepts (article).
- For a piece of text, take the element sum of word vectors over each word for a composite text concept vector.
- The texts are compared using a standard vector similarity measure. (Normalisation would happen here I guess)
Dense Vectors
- Short vectors may be easier to use as features in machine learning
- Short vectors may generalise better than storing explicit counts
- Dense vectors may do better at capturing synonymy (in sparse vectors, they may be represented in “distinct dimensions”)
Singular Value Decomposition
You can get dense vectors by applying SVD on sparse vectors (do people even do that in practice?)
$X = TSD$
The original size of the decomposed matrices. $m$ is the rank of the matrix
$(t \times d)$ = $(t \times m) \cdot (m \times m) \cdot (m \times d)$
Dropping all except the first $k$ elements
$(t \times d)$ = $(t \times k) \cdot (k \times k) \cdot (k \times d)$
It might be useful to get rid of the top 1 dimension or even the top 50 dimensions.
Vectors from Neural Networks
Skipgram and CBOW - refer to Deep Learning notes
- CBOW trains to predict the center word given surrounding words
- Skipgram trains to predict adjacent words given the center word
- Each have two trainnable embedding matrices.
Feature vectors from recurrent neural network, ELMo, Transformers
In project, we applied SVD to reduce the dimension of the vectors for faster query
Use FAISS for to quickly query the close vector from a preprocessed list of vectors.
Probabilistic IR
Basic probability
- Chain rule - $P(A,B) = P(B,A) = P(A \vert B)P(B)$
- Partition rule - $P(B) = P(A,B) + P(\bar{A}, B)$
- Where $\bar{A}$ is the complement of $A$
- Bayes rule - $P(A \vert B) = \dfrac{P(B \vert A) P(A)}{P(B)}$
- Odds $O(A) = \dfrac{P(A)}{P(\bar{A})} = \dfrac{P(A)}{1-P(A)}$
Probability Ranking Principle
- If the retrieved documents are ranked decreasingly on their probability of relevance, then the effectiveness of the system will be the best that is obtainable.
Binary Independence Model
- Binary / Boolean
- Each document $d$ is represented by a binary vector $x$, where $x_t = 1$ if the term $t$ occurs in $d$ and $x_t = 0$ otherwise
- Independence
- Naive assumption that there is no association between terms
- Assumptions
- The presence or absence of a word in a document is independent of the presence or absence of any other word
- Relevance of individual documents are independent (not true as duplicate results are considered bad, consider Google Search) - but this applies to all the methods taught in IR
- Requires relevant and nonrelevant labels
Retrieval Status Value (RSV)
RSV of a document $d$ wrt to a query $q$
- $p_t$ is the probability of a term appearing in a relevant document
- $u_t$ is the probability of a term appearing in a nonrelevant document
- $RSV_d = \log \prod_\limits{t \in q} \dfrac{p_t(1-u_t)}{u_t(1-p_t)} = \sum_\limits{t \in q} \log\dfrac{p_t(1-u_t)}{u_t(1-p_t)} = \sum_\limits{t \in q} \log c_t$
Estimation of $p_t$ and $u_t$
Table |
$R=1$ Document is relevant |
$R=0$ Document is not relevant |
Total |
$x_t = 1$ Term $t$ exist in document |
[a] |
[b] |
$\text{df}_t$ |
$x_t = 0$ Term $t$ does not exist in document |
[c] |
[d] |
|
$p_t = \dfrac{a}{a+c}$
$u_t = \dfrac{b}{b+d}$
$c_t = \dfrac{a/c}{b/d}$
Probability estimates in practice
- $\log \dfrac{1-u_t}{u_t} = \log \dfrac{N-\text{df}_t}{\text{df}_t} \approx \log \dfrac{N}{\text{df}_t} = \text{IDF}$
- We usually do not know the relevant documents
- $p_t = 0.5$
- We update $p_t$ with probabilistic relevance feedback
Probabilistic Relevance Feedback
Guess a preliminary probabilisitic description of R=1 documents, use it to retrieve a set of documents
Interact with the user to refine the description and learn some definite members
Re-estimate $p_i$ and $r_i$ on the baisis of this
-
If $i$ appears in $V_i$ within the set of documents $p_i = |
V_i |
/ |
V |
$ |
- $p_i^{(2)} = \dfrac{\vert V_i \vert + \kappa p_i^{(1)}}{\vert V \vert + \kappa}$
- $\kappa$ is the weight on the prior
- Or can combine new information with original guess
Repeat, thus generating a succession of approximations to relevant documents
Okapi BM25
This is a nonbinary model
- a document is no longer represented as a binary vector
- to take into account of term frequency of the document
$RSV_d = \sum_\limits{t \in q} w_{t_{\text{BM25}}} \log \left( \dfrac{N}{\text{df}_t} \right)$
$w_{t_{\text{BM25}}} = \dfrac{(k_1 + 1) \cdot \text{tf}{td}}{k_1 \left( 1-b + b \left( \dfrac{L_d}{L{ave}} \right) \right) + \text{tf}d} \cdot \dfrac{(k_3 + 1) \text{tf}{tq}}{k_3 + \text{tf}_{tq}}$
Relevance Feedback (?)
Replace IDF term with something. TBC
Optimised Index Construction
Objective
- construct the index (posting list of termid to docids) from termid-docid pairs
- update the index
Hardware basics
- Access to data memory is much faster than access to data on disk
- The size of memory is limited, and the size of disk can be unlimited
- Transferring one large chunk of data from disk to memory is faster than transferring many small chunks
- Seeking for the data takes time
Baseline algorithm
- Approach
- Iterate through all the termid-docid pairs and populate the posting index (inverted index), and sort
- You do not need to sort the posting list right? Just the terms? (?)
- Problems
- Cannot sort 100M records on memory - not enough memory space
- Cannot sort 100M records on disk - too many disk seeks
- Question
- Why do we even need to sort the posting list when we construct the index?
- Can’t we store all the termid to address in one computer?
Blocked Sort-based Indexing (BSBI) (?)
https://nlp.stanford.edu/IR-book/html/htmledition/blocked-sort-based-indexing-1.html
- Objective - sort 10 blocks of $N$=10M records
- Not possible to quicksort everything because it will run out of memory
- Approach
- Use the baseline algorithm in each block. Quicksort within each block $O(N \log N)$
- Merge sort the blocks (read from the blocks simultaneously)
- You will not exceed the memory at the final merge sort because you can load one blocksize from each sublist at a time
- Alternate implementation
- Multiway merge sort
- You may not be able to load that many blocks in memory at once
- Not loading information in block may because excessive time spent due to disk seeks
- Issue
- Assumed that we can keep the dictionary (term-termID) in memory (really?)
Single-pass In-memory Indexing (SPIMI) (???)
https://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html
-
Key Idea
- Generate separate dictionaries for each block - no need to maintain term-termID across blocks (really? This algorithm only operate on a single block)
- Do not sort. Accumulate posting in posting lists as they occur
- Objective: Generate a completed inverted index for each block
This procedure to create an inverted index
- For every term-docID in the block
- If term is not in the dictionary, create a posting list and add the docID
- If term is in the dictionary, add the docID
- If the posting list is full, double the size of the posting list
- Sort the terms in the dictionary (to allow merging of blocks)
- Write the block to disk and process the next block
The complexity is linear (I do not need why for each block or the entire index)
Advantages of SPIMI
It is faster because there is no sorting required, and it saves memory because we keep track of the term a postings list belongs to, so the termIDs of postings need not be stored. As a result, the blocks that individual calls of SPIMI-INVERT can process are much larger and the index construction process as a whole is more efficient.
Because we do not know how large the postings list of a term will be when we first encounter it, we allocate space for a short postings list initially and double the space each time it is full (lines 8-9). This means that some memory is wasted, which counteracts the memory savings from the omission of termIDs in intermediate data structures. However, the overall memory requirements for the dynamically constructed index of a block in SPIMI are still lower than in BSBI - why?
Distributed Indexing (ok)
https://nlp.stanford.edu/IR-book/html/htmledition/distributed-indexing-1.html
- Objective
- Very large scale
- Individual machines are fault-prone
- Given percentage uptime of each node and number of nodes, what percentage uptime of the system - take joint probability
- Approach
- Maintain a master machine (that is considered safe) directing the indexing job.
- Parsers write pairs from each document split to partitions (or segment files)
- Inverters reads from partitions and populate the final posting list
- Improvements
- For better load balancing, most search engines use a document-partition index rather than term partitioned index
- MapReduce
- Template
map: input -> list(k,v)
reduce: (k,(list,v)) -> output
- Application
map: collection -> list(termid, docid)
reduce: (termid1, list(docid)) -> postings)
Dynamic Indexing (???)
https://nlp.stanford.edu/IR-book/html/htmledition/dynamic-indexing-1.html
-
Objective
- Documents come in over time and need to be inserted
- Documents are deleted and modified
Auxiliary and main index approach
- Approach
- Maintain a big main index
- New docs go into small auxiliary index (and indexed)
- Search acorss both and merge results
- Deletion - maintain invalidation bit vector and filter output
- Complexity
- $T/n$ merges where $T$ is the number of postings and $n$ is the size of auxiliary
- Index construction time $O(T^2/n)$ at the worst case a posting is touched $T/n$ times
- Query processing is $O(1)$ - really?
- Issues
- Merge performance is poor and you might need to do it frequently
- Create a separate file for each postings list, but inefficient for OS
Logarithmic merge
- Assumption - The index is one big file (?)
- Approach
- Keep smallest posting lists in memory
- Keep larger lists on disk
- Double the size of the list when it runs out of space
- (Idea tradeoff between preprocessing time and query time, ideal when you need to alternate between indexing a new document and query)
- Complexity
- Each posting is merged at most $O(\log(T/n))$ times
- Query processing now requires the merging of $O(\log(T/n))$ indexes
- Issues
- Complexity of query processing is no longer constant
Other methods
- Earlybird (postings sorted in reverse chronological order)
- Positional indexes
- n-gram indexes
Neural Network and Deep Learning
Evaluation Metrics
- Accuracy - problematic if there is class imbalance
- Precision - how many of the predicted positive are actually positive
- Recall / Sensitivity - how many of the fraction that are actually positive are predicted positive
- Weighted loss - weighted sum of false negative and true positive
- F-measure = $\dfrac{(\beta^2 + 1)PR}{\beta^2P + R}$
- ROC plots - true positive rate (recall) against false positive rate
- Root Mean Square Error
- Mean Absolute Error
- R-squared
Torch (the new stuff I learnt)
from torchtext.data.utils import get_tokenizer
from gensim.models import KeyedVectors
glove_50dim = api.load("glove-wiki-gigaword-50")
glove_50dim.save('glove_50dim.w2v')
Relevance Feedback (ok)
Procedure
- Present the initial results (with some query vector, e.g. tfidf?)
- Receive feedback from user
- Reformulate the query
- Produce new results
- Repeat
All these are query specific. If you want to make a different query, you will need to repeat the process.
In Vector Space Model, each document and query are represented with a vector of a same length.
- Change query vector using vector algebra
- Add the vectors for the relevant documents to the query vector
- Subtract the vectors for the irrelevant documents from the query vector
- This both adds both positive and negatively weighted terms to the query (query expansion) as well as reweighting the initial terms (term reweighting)
Optimal Query
If you know the set of all relevant documents (not true), the best query that ranks all and only the relevant documents at the top
$\vec{q}\text{opt} = \dfrac{1}{\vert C_r \vert} \sum\limits{\text{relevant}} \vec{d}j - \dfrac{1}{N-\vert C_r \vert} \sum\limits{\text{irrelevant}} \vec{d}_j$
- $N$ is the total number of document
- $C_r$ is the relevant set of parameters
- $\vec{d}_j$ is document vector
Standard Rocchio Method
Since the relevant documents are unknown, just use the known relevant and irrelevant sets of document
\[\vec{q}_{m} = \alpha \vec{q} + \dfrac{\beta}{\vert D_r \vert} \sum\limits_{\text{relevant}} \vec{d}_j - \dfrac{\gamma}{\vert D_n \vert} \sum\limits_{\text{irrelevant}} \vec{d}_j\]
- $\vec{q}$ Is the original query
- $D_r$ and $D_n$ are the known relevant and irrelevant sets of documents
- Hyperparameters
- $\alpha$ tunable weight for initial query
- $\beta$ tunable weight for relevant documents
- $\gamma$ tunable weight for irrelevant documents
Ide Regular Method
\[\vec{q}_{m} = \alpha \vec{q} + \beta \sum\limits_{\text{relevant}} \vec{d}_j - \gamma \sum\limits_{\text{irrelevant}} \vec{d}_j\]
- Do not normalise for the amount of feedback
Ide “Dec Hi” Method
\[\vec{q}_{m} = \alpha \vec{q} + \beta \sum\limits_{\text{relevant}} \vec{d}_j - \gamma \max_{\text{irrelevant}} \vec{d}_j\]
- Rejecting just the highest-ranked irrelevant document
Comparison between the methods
- Experimental results indicate no clear preference for any one of the specific methods.
- All methods generally improve performance with feedback
- Generally let tunable constants to be equal to 1
- Performance should only be tested on unrated documents
How to evaluate performance
- Do not evaluate on explictly marked documents (testing on training data)
- As relevant documents are removed from evaluation, recall/precision numbers may decrease from initial query - please consider relative performance between methods
- Standardise feedback data? Since this is an interactive process
Addressing the lack of feedback
- Users are reluctant to provide explicit feedback
- Pseudo feedback - Just assume top $m$ retrieved documents are relevant, and use them to reformulate the query
- Use of implict relevance feedback (click-through data)
Query Expansion with a Thesaurus
For each term $t$, in a query, expand the query with synonyms and related words of $t$ from the thesaurus
- Added terms may be weighted less than original query terms
- Generally increases recall
- May significantly decrease precision
- (doubt, because size of the returned set is constant)
- Dataset - WordNet Synset
- Synonyms (same meaning)
- Hyponyms (specialisation)
- Hypernyms (generalisation)
- Statistical Thesaurus - Automatic Global Analysis
- Association matrices (co-occurance) $c_{ij} = \sum_{d_k \in D} f_{ik} \times f_{ik}$
- Normalised association matrix $s_{ij} = \dfrac{c_{ij}}{c_{ii} + c_{jj} - c_{ij}}$
- To remove favouritism for more frequent terms
- Normalised score is one if two terms have the same frequency
- Expand the query with the highest value of $s_{ij}$ for each term in the query
- Problem
- “Apple computer” -> “Apple red fruit computer”
- Refinement
- Expand query with terms that are similar to all terms in the query (but time complexity is bad)
- Use more sophisticated term weights (instead of just frequency)
- Statistical Thesaurus - Automatic Local Analysis
- At query time, dynamically determine similar terms based on analysis of top-ranked retrieved documents (assume relevance)
- Base correlation analysis only on the local set of retrieved documents for a specific query (???)
- Finding the similar terms only within relevant documents, avoid ambiguity of individual terms (address “Apple computer”)
- Comparison between global and local analysis
- Global analysis is precomputation heavy (require correlation computation)
- Local analysis is runtime heavy (need to analyse top documents at query time)
- Local analysis gives better results
IR Evaluation (ok)
(Understanding achieve, but still need to memorise the various metrics)
Evaluation
- How fast does it index?
- How fast does it search?
- What is the (monetary) cost per query?
- What is the level of user happiness? - How do we measure it?
- Who is the user that we want to make happy - users or advertisers, sellers or buyers?
- Approximated by the relevance of search results
Resources needed to evaluate IR systems
- Document collection
- Queries
- Binary (or otherwise) assessments (on query results)
- Some methods require knowledge of all the relevant documents, which may not be feasible in practice
What to keep constant when comparing IR systems
- Evaluated along the same measures
- Evaluated along the same collection of documents
- Evaluated on the same set of questions/query
Metrics
-
Precision and Recall - on relevant and retrieved documents
- Mnemonic
- Recall - percentage “recalled”
- Precision - percentage correct
- Determining recall can be difficult because you do not know the ground truth of all relevant documents
- Workaround - consider only a sample of the dataset
- F measure $F1 = \dfrac{2PR}{P+R} = \dfrac{2}{\dfrac{1}{P} + \dfrac{1}{R}}$
- For F measure to be high, both precision need to be high
- Parametrized F measure $F_\beta = \dfrac{1+\beta^2}{\dfrac{\beta^2}{R} + \dfrac{1}{P}}$
- Consider whether you task is precision critical (web search) or recall critical (patent search)
-
11-point Interpolated Precision-Recall Curve
- For every true positive in the rank, consider the precision and recall values.
- Plot the points on the precision value graph
- (Precision is not necessarily decreasing - consider Rank 1 query is wrong)
- The points will be evenly distributed along the x-axis
- Plot the points on the graph and iterpolate left
- Measure - the area under the precision-recall curve
- Average of preicions at 0.0, 0.1, … 0.9, 1.0
- Macro average (precision recall curve is calculated for each query) and micro average (precision recall curve is calculated once)
-
R-precision
- The query has R relevant documents
- The precision at the R-th position
-
Precision@K and Recall@K
- The precision/recall at the K-th position
- Regardless of the number of the relevant documents
-
Mean Average Precision
- Compute the Precision@K for each of the ranks of the relevant document
- If the document never gets retrieved, we assume that the precision corresponding to that relevant document to be zero
- Take the average for one query
- To evaluate the system - take average across queries (each query counts equally) - this is maco averaging?
-
Mean Reciprocal Rank
- Rank of the first relevant document
- Could be the first clicked document
-
Normalised Discounted Cumulative Gain (NDCG)
- For non-binary relevance labels
- Consider cumulative gain at each rank $n$
- $\text{DCG}p = rel_1 + \sum{i=2}^p \dfrac{rel_1}{log_2 i}$
- $\text{NDCG} = \dfrac{\text{DCG}_n}{\text{IDCG}_n}$
-
Pairwise Relative Rating and Kendall-tau Distance
- (Sometimes data only provide whether one document is better than the other)
- Generate all pairs for each ranking
- X is the number of agreement between rankings
- Y is the number of disagreement
- The distance is $\dfrac{X-Y}{X+Y}$
Implementation
- A/B testing
- Randomly send 1-10% of incoming users to a variant of the system that includes a single change
- Mechanical Turk
Benchmarks
- SMART collection
- TREC Benchmark
- https://en.wikipedia.org/wiki/Text_Retrieval_Conference#Tracks
Unigram Language Model (ok)
If you can predict the next word, you understand the language.
- Defines a probability distribution over individual words given a document
- Unigram - the likelihood of each word independent of any other word
- Steps in Language Modelling
- Estimation - Observing text and estimating probability of each word
- Prediction - Using the language model to assign a probability to a span of text
- Other application
- Could be used for topic modelling
- Given the document (query), find the most likely topic(s)
Query-Likelihood Retrieval Model
- $P(q \vert D) = \Pi_t P(t \vert \theta_D)$
- $D_1$ is more relevant that $D_2$ for query $q$ if $P(q \vert D_1) > P(q \vert D_2)$
- $P(t \vert \theta_D) = \dfrac{\text{tf}_{t,D}}{N_D}$ is the term frequency for the unigram model
- The sum of logarithm over each query term is computed to avoid underflow
- Analysis
- The longer the query, the lower the probability but this is ok
- If the query term does not exist in the document, the score is zero
- IDF is not considered
- Objective of smoothing
- Decrease the probability of observed outcomes because we tend to underestimate them
- Increase the probability of unobserved outcomes because we tend to underestimate them
- Add one smoothing
- Add one to numerator, add unique token count to denominator (if the unique token count is large, large documents will be favoured because the term frequency is no longer normalised by document size?)
- Still does not consider IDF
Linear interpolation smoothing
- Why do this?
- Capture IDF
- Avoid zero probabilities
- $P(t \vert D) = \alpha P(t \vert \theta_D) + (1-\alpha) P(t \vert \theta_C)$
-
Consider the entire corpus as a document and compute $P(t |
\theta_C)$ also |
- If $\alpha=0$, how do document score differ?
- $P(q \vert D) = \prod_\limits{t} P(t \vert \theta_D) \propto \dfrac{\alpha p_\text{MLE}(t \vert D)}{(1-\alpha) p_\text{MLE}(t \vert C)} + 1$
- TF is the numerator, IDF as the denominator
- Avoid zero probabilities and introduces IDF
- (Please wrap your head around how this works)
- Extending the unigram model to bigrams
- $P(t_{i-1}, t_i \vert D) = \alpha_1 \times P_1(t_1 \vert D) + \alpha_2 \times P_2(t_{i-1},t_i \vert D)$
- Just consider each bigram as a term
Recurrent Neural Networks
RNN formulation
- Hidden state $h_t$, Output vector $y_t$
- $h_t = f_W(h_{t-1}, x_t) = \tanh(W_{hh}h_{t-1} + W_{xh}x_t)$
- $y_t = W_{hy}h_t$
Encoder and Decoder model in information retrieval
- Training
- Require document-query pairs
- Train the model to predict the next word of the query
- $y_t$ is the logits (apply softmax to get probabilities)
- What if the vocabulary size is large? (Hierarchical softmax? Word vectors?)
- Evaulation
- Run the document through the RNN (encoder) for the hidden state
- The encoder and decoder may or may not share weights
- Use the hidden state and evaluate the likelihood of query
- Objective - calculate probability of generating the query given a document
- $P_\theta(Q \vert D) = \prod_\limits{q_i \in Q} P_\theta(q_i \vert D, q_{j<i})$
- $\text{score}(Q \vert D) = \sum_\limits{q_i \in Q} \log P_\theta(q_i \vert D, q_{j<i})$
- $\mathcal{L}\text{NLL} = - \sum \log P\theta (Q \vert D^+)$
- Maximise probability of query given relevant documents
- $\mathcal{L}\text{Marg} = \sum\limits{(Q, D^+, D^-) \in \mathcal{T}} \max{ 0,b - \log P_\theta (Q \vert D^+) + \log P_\theta (Q \vert D^-) }$
- Uses both relevant and irrelevant documents
- Hinge loss that penalise only when likelihood of relevant document is larger (?) than likelihood of irrelevant document by a certain amount $b$ (?)
- Possible apporach if there is no supervised data
- Train a language model that predicts the next word. Use this model to estimate the probability of the query
- However, this does not work. This is predicting the next word, not the query, therefore this approach does not work
Learning to Rank
Approaches based on training objective
This only affects how the model is trained
Pointwise approach
Predict the relevant score (by regression - MSE) of a document w.r.t. query or predict whether the document is relevant or not (by classification - cross-entropy)
Pairwise approach
Take two document, and try to predict which document is more relevant
- RankNet loss
- $v_q = \text{NN}(q)$
- $s_i = f(v_q, v_{d_i})$
- $p_{ij} = p(s_i > s_j) = \dfrac{1}{1+e^{- \gamma (s_i - s_j)}}$
- $\mathcal{L}\text{RankNet} = -\log (p{ij})$
- Trains on pair of documents with one that is more relevant than the other - also trains on binary labels only if you pair relevant and nonrelevant document
- Generalised loss - compute softmax over the full collection or randomly sampled negative documents
- Inference - order the documents by $s_i$
Listwise approach
Directly optimise the ranking metric (e.g. NDCG) - difficult because the metrics are non differentiable
- LambdaNet / LambdaRank
-
$\lambda_\text{LambdaRank} = \lambda_\text{RankNet} \cdot |
\Delta NDCG |
$ |
- It is desired that the gradient to be bigger to be bigger for pairs of documents that produces a bigger impact on NDCG
- Luce Model, ListNet loss, ListMLE loss (refer to slides) (?)
Training
Features available for training
- Query-independent feature (D) - incoming link count and document length
- Query-dependent feature (Q-D) - BM25
- Query-level feature (Q) - query length
Deep neural networks
- DNN for YouTube recommendation
- Inputs include average of video embeddings, average of search tokens, geographic embedding, demographic embedding (age, gender)
- At inference - search the nearest neighbour in the video vectors
- Siamese networks
- Relevance estimated by cosine similarity between item embeddings
- Wide and deep model
- Lexical and semantic matching networks
- Large scale pretrained language models
- Neural ranking models with multiple document fields
Turning a unsupervised problem to a supervised one
- You will need
- The documents to search from
- Sample queries (you do not need labels)
- Run BM25 using the queries
- Set a threshold $t$ and to create pseudo-labels
- Use the pseudo-labels to train an LTR model
Image Retrieval
Challenges
- How to represent visual content of images
- What are visual contents
- Which type of visual content should be used?
- How to retrieve efficiently
- Should avoid visiting the entire database at query
Methods
- Color histogram
- Texture histogram - local binary pattern - check if pixel value is larger compared to 8 neighbours and store it as a 8-bit number
- Gridded color
- Shape distances - projection matching - encoding based on column and row sum
- Bag of visual words
- Interesting point detection - Local points - image patterns that different from their immediate neighbourhood
- Should be robust to various geometric transformations
- SIFT descriptor
- Issue - computationally expensive because required linear scan and individual computations
- Bag of textures
- Image embedding
Efficient image retrieval strategies
Question Answering
Factoid QA needs information extraction followed by ranking
Non-factoid QA needs filtering and summarisation as well
Challenges
- Understand the query
- Translating into sequence of token useful for retrieval
- Retrieve appropriate units of text
- Filter, rank and/or combine units into an answer
- Summarise the answer
- Visualise the answer
- Different types of question need different treatments
Baseline approaches
- Do standard IR (what is the document here?)
- Find exact match of query and return the next sentence on the webpage
Modern QA approaches
There is also a hybrid based system by IBM Watson
IR-based QA
Question Processing
- Query formulation
- Search for question directly
- Query expansion
- Query reformulation
- Apply a keyword extraction model (?)
- Answer Type Detection - what is it asking for
- Rule based - “who” and patterns
- Machine learning inputs - words, POS tags, wordnet synsets, question headword
Passage Retrieval
- Find relevant documents (use an IR system)
- Find the passage in the document that contains the answer
- Break document into passages/paragraphs
- Filter passages
- Machine learning inputs - number of entities in correct answer type, number of question keywords, longest exact sequence of words, n-gram overlap, rank of document from IR system, promxity of query keywords to one another in the passage
- Answer processing
- Rule-based - pattern extraction
- Find and return instances of answer types (e.g. person)
- Use predefinted patterns to search passage (e.g. x such as y)
- Consider n-grams that appear in many top-ranked passages
- Machine learning inputs - novelty (answers should not be found in the question), punctuation, correct answer type, distance between query keywords and answer, appositive
- Use IR to find relevant text document
- Extract answer from the relevant documents
Knowledge-based QA
- Convert question into a query for a structured database
- Heavily relies on a rich knowledge base
Guest Lecture
Microsoft Research India
- Project ELLORA: Enabling Low Resource Language Speakers
- Project LITMUS: Linguistically Informed Training & Testing of Multiliingual Systems
- Neuro-symbolic Reasoning and Computational Pragmatics
- Project Melange: NLP for multilingual systems
Code-mixing, scripting-mixing and transliteration (romanization)
- How to build systems for all languages of the world?
- How to build systems that can bridge the gap between languages?
Challenges on working with code-mixed data
- No large text corpora for code-mixing, which is a largely a spoken phenomeon
- Model explosion - with $n$ language, there are $O(n^2)$ language pairs
- Reusing models - how to exploit monolingual models and data for code mixing
On how people code-mix (and implications on code-mixed data generation)
- There is grammar on how we code-mix
- For Hindi-English tweets - English is used far more for positive sentiment
- People are making Bing queries in mixed script
Approaches to mixed script information retrieval
- Canonical indexing
- Neat approach but the world is messy
- Query expansion
- Reverse transliteration in the query
- Phoentic word embeddings (phoetically similar words have similar vectors)
Computational Models of Code-Switching
- Supervised - annotated code-mixed data - treating code-mixed language as another language - not scalable
- Divide & Conquer - run through a language detector which separate into fragments - but treating the text as a bag of words without context
- Combining Monolingual Models - run the text through both models as well as a language detector and feed the output into a combination model
- Zero-shot learning - massively multilingual pretrained language model (RoBERTa or mBERT) - train the model with many languages as well as code-mixed data - (what is zero-shot hypothesis?)
- Use of RAKE to extract keyword for senttence
- Extraction of the post relevant paragraphs
- SQuAD re-ranker
Use case of Information Retrieval
- Sanction screening
- “Prince of Persia - Sand of Time” - search Wikipedia and know that is has nothing to do Iran
- Using open knowledge for learning
- Labels from agents
- Conversation search
- (Difference between utterances and queries)
- Users want concise answers rather than the entire page
- Labels from community kudos