This a project I'm working on right now, I'm trying to compile a list of questions and answers for Generative AI interviews.
I'm using this reference as the base, credit to them for compiling it, however, I am taking a lot of liberties with editing the questions, as well as the answers, they are completely my own.
Note: I'm trying to keep the answers I write myself to a minimum, since I am in no way or form an authoritative source on this topic. I will be providing references to the best of my ability. I refrained from adding any sort of visual aid for readability and to keep the complexity of maintenance to a minimum. The resources and references I cite contain a wealth of information, mostly with visuals.
I plan to expand this to Generative AI in general, not just for language, covering everything from diffusion models up to vision-language models. Once I get the basic structure down and I'm happy with the preliminary results, I will work on establishing an efficient methodology for contributing to this repository, and then I will open it up for contributions, but for now, I want to keep it simple and focused.
Important:
I think it might be necessary to clarify that the answers I provide, regardless if they are my own writeup or if I'm citing a source, are not in any way or form definitive, what I'm trying to do is get you started on the right path, and give you a general idea of what to expect, you should definitely read any and all resources I provide, and then some. If you want this to be your last stop, this is the wrong place for you. This is where it starts.
Also, if you're just getting started, my one and only piece of advice is:
Get comfortable reading papers, because they never end.
Paper on How to Read a Paper: How to Read a Paper
1. LLM and Prompting Basics
2. Retrieval Augmented Generation (RAG)
3. Chunking Strategies
4. Embedding Models for Retrieval
5. Vector Retrieval, Databases and Indexes
6. Advanced Search Algorithms
7. Language Model Internal Workings
Let's consider a dataset, where each data point represents a cat. Let's pass it through to each type of model, and see how they differ:
Let's build the definition of a Large Language Model (LLM) from the ground up:
Further reading: Common Crawl
Large Language Models are often trained in multiple stages, these stages are often named pretraining, fine-tuning, and alignment.
The purpose of this stage is to expose the model to all of language, in an unsupervised manner, it is often the most expensive part of training, and requires a lot of compute. Pretraining is often done on something like the Common Crawl dataset, processed versions of the dataset such as FineWeb and RedPajama are often used for pretraining. To facilitate this broad type of learning, there exists multiple training tasks we can use, such as Masked Language Modeling (MLM), Next Sentence Prediction (NSP), and more.
Mask Language Modeling is based of the Cloze Test, where we mask out a word in a sentence, and ask the model to predict it. Similar to a fill in the blank test. It differs from asking the model to predict the next word in a sentence, as it requires the model to understand the context of the sentence, and not just the sequence of words.
Next Sentence Prediction is a task where the model is given two sentences, and it has to predict if the second sentence follows the first one. As simple as it sounds, it requires the model to understand the context of the first sentence, and the relationship between the two sentences.
An excellent resource to learn more about these tasks is the BERT paper.
This stage is much simpler than pretraining, as the model has already learned a lot about language, and now we just need to teach it about a specific task. All we need for this stage is the input data (prompts) and the labels (responses).
This stage is often the most crucial and complex stage, it requires the use of separate reward models, the use of different learning paradigms such as Reinforcement Learning, and more.
This stage mainly aims to align the model's predictions with the human's preferences. This stage often interweaves with the fine-tuning stage. Essential reading for this stage is the InstructGPT paper, this paper introduced the concept of Reinforcement Learning from Human Feedback (RLHF) which uses Proximal Policy Optimization.
Other methods of Aligning the model's predictions with human preferences include:
Tokens are the smallest unit of text that the model can understand, they can be words, subwords, or characters.
Tokenizers are responsible for converting text into tokens, they can be as simple as splitting the text by spaces, or as complex as using subword tokenization. The choice of tokenizer can have a significant impact on the model's performance, as it can affect the model's ability to understand the context of the text.
Some common tokenizers include:
Recommended reading (and watching):
This is a very loaded question, but here are some resources to explore this topic further:
Parameters include:
Each of these parameters can be tuned to improve the performance of the model, and the quality of the generated text.
Recommended reading:
Decoding strategies are used to pick the next token in the sequence, they can range from simple greedy decoding to more complex sampling strategies.
Some common decoding strategies include:
Newer decoding strategies include Speculative Decoding (assisted decoding) which is a wild concept, it involves using a candidate tokens from a smaller (thus faster) model to generate a response from a bigger model very quickly.
Recommended reading:
In the decoding process, LLMs autoregressively generate text one token at a time. There are several stopping criteria that can be used to determine when to stop generating text. Some common stopping criteria include:
A prompt contains any of the following elements:
Instruction - a specific task or instruction you want the model to perform
Context - external information or additional context that can steer the model to better responses
Input Data - the input or question that we are interested to find a response for
Output Indicator - the type or format of the output.
Reference: Prompt Engineering Guide
Recommended reading:
Reference: Prompt Engineering Guide
Recommended reading:
In-context learning is a very intuitive and easy to understand learning paradigm in Natural Language Processing. It encompasses concepts such as few-shot learning. It can be as easy as providing a few examples of the task you want the model to perform, and the model will learn from those examples and generate responses accordingly.
Recommended Reading:
It has been shown that In-context Learning can only emerge when the models are scaled to a certain size, and when the models are trained on a diverse set of tasks. In-context learning can fail when the model is not able to perform complex reasoning tasks.
Recommended Reading:
This is a very broad question, but the following will help you form a basic understanding of how to design prompts for a specific task:
Alternatively, newer research directions investigate using an algorithmic way of optimizing the prompts, this has been explored extensively in the DSPy package, which provides the means to do this, their work is also published in this paper.
There is no answer for this question, I put it as an excuse to link this reference:
There are multiple methods to get LLMs to generate structured outputs that are parsable every time, common methods depends on the concept of function calling in LLMs.
Recommended Reading and Viewing:
The term describes when LLMs produce text that is incorrect, makes no sense, or is unrelated to reality
Reference: LLM Hallucination—Types, Causes, and Solution by Nexla
Recommended Reading:
The concept of Chain-of-Thought Prompting is known to enhance reasoning capabilities in LLMs. This technique involves breaking down a complex task into a series of simpler tasks, and providing the model with the intermediate outputs of each task to guide it towards the final output.
Recommended Reading:
Retrieval Augmented Generation (RAG) is a common design pattern for grounding LLM answers in facts. This technique involves retrieving relevant information from a knowledge base and using it to guide the generation of text by the LLM.
Recommended Reading:
Retrieval Augmented Generation (RAG) is composed of two main component:
The intuition behind RAG is that by combining the strengths of retrieval-based and generation-based models, we can create a system that is capable of generating text that is grounded in facts, thus limiting hallucination.
RAG is often the go-to technique for answering complex questions based on a knowledge base, as it allows the model to leverage external information to provide more accurate and informative answers. It is not always feasible to fine-tune a model on proprietary data, and RAG provides a way to incorporate external knowledge without the need for fine-tuning.
A full solution that utilizes RAG to answer a complex question based on a knowledge base would involve the following steps:
This is a very loaded question, but here are some resources to explore this topic further:
Chunking text is the process of breaking down a large piece of text into smaller, more manageable chunks. In the context of RAG systems, chunking is important because it allows the retriever component to efficiently retrieve relevant information from the knowledge base. By breaking down the query into smaller chunks, the retriever can focus on retrieving information that is relevant to each chunk, which can improve the accuracy and efficiency of the retrieval process.
During the training of embedding models, which are often used as retrievers, positive and negative pairs of text are used to indicate what pieces of text correspond to each other, examples include the titles, headers and subheaders on a Wikipedia page, and their corresponding paragraphs, reddit posts and their top voted comments, etc.
A user query is often embedded, and an index is queried, if the index had entire documents contained within it to be queried for top-k hits, a retriever would not be able to return the most relevant information, as the documents to be queried would be too large to comprehend.
To summarize, we chunk text because:
Suppose we have a book, containing 24 chapters, a total of 240 pages. This would mean that each chapter contains 10 pages, and each page contains 3 paragraphs. Let's suppose that each paragraph contains 5 sentences, and each sentence contains 10 words. In total, we have: 10 * 5 * 3 * 10 = 1500 words per chapter. We also have 1500 * 24 = 36000 words in the entire book. For simplicity, our tokenizer is a white space tokenizer, and each word is a token.
We know that at most, we have an embedding model capable of embedding 8192 tokens:
All of this is to illustrate that there is no fixed way to chunk text, and the best way to chunk text is to experiment and see what works best for your use case.
An authoritative source on this topic is the excellent notebook and accompanying video by Greg Kamradt, in which they explain the different levels of text splitting.
The notebook also goes over ways to evaluate and visualize the different levels of text splitting, and how to use them in a retrieval system.
Recommended Viewing:
Vector embeddings are the mapping of textual semantics into an N-dimensional space where vectors represent text, within the vector space, similar text is represented by similar vectors.
Recommended Reading:
Embedding models are Language Models trained for the purpose of vectorizing text, they are often BERT derivatives, and are trained on a large corpus of text to learn the semantics of the text, recent trends however also show it is possible to use much larger language models for this purpose such as Mistral or Llama.
Recommended Reading and Viewing:
Embedding models are often used as retrievers, to utilize their retrieval capabilities, semantic textual similarity is used where in vectors produced by the models are measured in similarity using metrics such as dot product, cosine similarity, etc.
Recommended Reading:
Embeddings models are trained with contrastive loss, ranging from simple contrastive loss and up to more complex loss functions such as InfoNCE and Multiple Negative Ranking Loss. A process known as hard negative mining is also utilized during training as well.
Recommended Reading:
Contrastive learning is a technique used to train embedding models, it involves learning to differentiate between positive and negative pairs of text. The model is trained to maximize the similarity between positive pairs and minimize the similarity between negative pairs.
Recommended Reading:
Cross-encoders and bi-encoders are two types of models used for text retrieval tasks. The main difference between the two is how they encode the query and the document.
Rerankers are usually cross-encoders, they encode the query and the document together, and calculate the similarity between the two. This allows them to capture the interaction between the query and the document, and produce better results than bi-encoders at the cost of much higher computational complexity.
Text embedding models are usually bi-encoders, they encode the query and the document separately, and calculate the similarity between the two embeddings. This allows them to be more computationally efficient than cross-encoders, but they are not able to capture the explicit interaction between the query and the document.
Single vector dense representations are often the norm in text embedding models, they're usually produced by pooling the contextualized embeddings after a forward pass from the model, pooling techniques include mean pooling, max pooling, and CLS token pooling. The intuition behind single vector dense representations is that they are simple to implement and can be used for a wide range of tasks, as well as ease of indexing and retrieval. Dense representations are also able to capture the semantics of the text, and are often used in second stage ranking.
Multi vector dense representations have shown to produce superior results to single vector dense representations, they are produced by skipping the pooling step and using the contextualized embeddings in the form of a matrix, the query and document embeddings are then used to calculate the similarity between the two, models such as ColBERT have shown to produce superior results to single vector dense representations. An operator such as MaxSim is used to calculate the similarity between the query and document embeddings. The intuition behind multi vector dense representations is that they are able to capture more information about the text, and produce better results than single vector dense representations, models such as ColBERT also offer the ability to precompute document embeddings, allowing for very efficient retrieval. Dense representations are also able to capture the semantics of the text, and are often used in second stage ranking.
Recommended Reading:
Sparse text representations are the oldest form of vector space models in information retrieval, they are usually based on TF-IDF derivatives and algorithms such as BM25, and remain a baseline for text retrieval systems. Their sparsity stems from the fact that the dimension of the embeddings often corresponds to the size of the vocabulary. The intuition behind sparse representations is that they are explainable, computationally efficient, easy to implement and extremely efficient for indexing and retrieval. Sparse representation also focus on lexical similarity, and are often used in first stage ranking.
Recommended Reading:
Sparse text embeddings allow for the use of inverted indices during retrieval.
Recommended Reading:
Metrics for benchmarking the performance of an embedding model include:
Recommended Reading and Viewing:
Picking an embedding model could be a pivotal factor in the performance of your retrieval system, and careful consideration should be taken when choosing one. It is a broad process that involves experimentation, and the following resources will help you make an informed decision:
Recommended Viewing:
A vector database is a database that is optimized for storing and querying vector data. It allows for efficient storage and retrieval of vector embeddings, and is often used in applications that require semantic similarity search. Vector databases are a new paradigm that has emerged as part of the tech stack needed to keep up with the demands of GenAI applications.
Recommended Viewing:
Traditional databases are optimized for storing and querying structured data, such as text, numbers, and dates. They are not designed to handle vector data efficiently. Vector databases, on the other hand, are specifically designed to store and query vector data. They use specialized indexing techniques and algorithms to enable fast and accurate similarity search such as quantization and clustering of vectors.
A vector database usually contains indexes of vectors, these indexes contain matrices of vector embeddings, often a graph data structure is used as well, ordered in such a way that they can be queried efficiently. When a query is made, either text or a vector embedding is provided as input, in the case of text, it is embedded, and the vector database will query the appropriate index to retrieve the most similar vectors based on distance metrics. Usually, the vectors are compared using metrics such as cosine similarity, dot product, or Euclidean distance. Vectors also relate to a dictionary of metadata that could contain information such as the document ID, the document title, the corresponding text and more.
Search strategies in vector databases include:
Recommended Reading:
Once the vectors are indexed, they are often clustered to reduce the search space, this is done to reduce the number of vectors that need to be compared during the search process. Clustering is done by grouping similar vectors together, and then indexing the clusters. When a query is made, the search is first performed at the cluster level, and then at the vector level within the cluster. Algorithms such as K-means are often used for clustering.
Recommended Reading:
This is obviously a very loaded question, but here are some resources to explore this topic further:
Vector quantization, also called "block quantization" or "pattern matching quantization" is often used in lossy data compression. It works by encoding values from a multidimensional vector space into a finite set of values from a discrete subspace of lower dimension.
Reference: Vector Quantization
One general approach to LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.
Reference: Mining of Massive Datasets, 3rd Edition, Chapter 3, Section 3.4.1
Recommended Reading:
In short, PQ is the process of:
- Taking a big, high-dimensional vector,
- Splitting it into equally sized chunks — our subvectors,
- Assigning each of these subvectors to its nearest centroid (also called reproduction/reconstruction values),
- Replacing these centroid values with unique IDs — each ID represents a centroid
Reference: Product Quantization
Recommended Reading:
The Inverted File Index (IVF) index consists of search scope reduction through clustering.
Reference: Inverted File Index
Recommended Reading:
Hierarchical Navigable Small Worlds (HNSW) is often considered the state-of-the-art in vector retrieval, it is a graph-based algorithm that builds a graph of the vectors, and uses it to perform approximate nearest neighbor search.
Recommended Reading:
Distance and similarity metrics used in vector retrieval include:
Recommended Viewing:
This is a very active research topic, and no authoritative source exists, but here are some resources to explore this topic further:
It is also worth noting that search, retrieval and reranking systems are built on established patterns and architectures in the fields of information retrieval, recommendation systems, and search engines.
Some system architectures you might want to explore include:
Achieving good search in large-scale systems involves a combination of efficient indexing, retrieval, and ranking techniques. Some strategies to achieve good search in large-scale systems include:
You might notice that the entire process is done in phases of increasing complexity, this is known as phased ranking or multistage retrieval.
Recommended Reading:
But the most important aspect of achieving good search in large-scale systems is to experiment and iterate on your retrieval and ranking strategies, and to continuously monitor and evaluate the performance of your system.
Recommended Reading:
Recommended talks about improving search, retrieval and RAG systems:
Achieving fast searching involves optimizing the indexing and retrieval process, which takes non-trivial engineering effort, the following are some examples of the current landscape in the field of search and retrieval optimization:
Current state of the art in vector retrieval indicates multi-vector embeddings (late interaction) perform better than single vector embeddings, however, optimizing their retrieval poses a significant engineering challenge, the following discusses multi-vector embeddings and their retrieval in-depth:
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document.
Reference: BM25
Reranking models are sequence classification models trained to take a pair of query and documents, and output raw similiarity scores.
Recommended Reading, Viewing and Watching:
Evaluating RAG systems requires experimenting with and evaluating the individual components of the system, such as the retriever, generator, and reranker.
Recommended Reading:
Note: from here on, I'll refrain from answering as much as I can, and just link papers and references, this part is arguably one of the more complex parts, so it requires a lot of reading and understanding.
To understand attention, you'll need to be familiar with the Transformer architecture, and their predecessor architectures. Here are some resources to get you started:
The main bottleneck of self-attention is its quadratic complexity with respect to the sequence length. To understand the disadvantages of self-attention, you'll need familiarize yourself with attention alternatives, the following will help you get started:
There are multiple ways to encode positional information in LLMs, the most common way is to use sinusoidal positional encodings, known as absolute positional encodings. Other methods include relative positional encodings, and newer methods such as Rotary Positional Embeddings. Here are some resources to get you started:
To understand KV Cache, you'll need to be familiar with the Transformer architecture and its limitations.
Recommended Reading:
Mixture of experts is a type of architecture in LLMs, to understand how it works, you should go through the following resources, which cover the most prominent MoE models: