Hello, this is Aurélie, working as an Artificial Intelligence Engineer at Ridge-i. Today I would like to share an introduction to multimodal Retrieval-Augmented Generation (multimodal RAG) systems as part of a survey we conducted internally at Ridge-i.
- Introduction
- Motivation
- When to use RAG
- Basic RAG pipeline
- Multimodal RAG
- Evaluation
- Error analysis
- Conclusion
- References
Introduction
Multimodal Large Language Models (MLLMs) have gained popularity for their impressive performance across a wide range of tasks. However, in practice, even state-of-the-art models struggle in industrial applications due to (1) limited knowledge on specialized domain data and (2) a tendency to hallucinate [1].
This post introduces multimodal RAG frameworks to retrieve external knowledge based on a user query for improving the performance of multimodal models. We focus on text and image modalities although methods can be generalized to other modalities (video, sound, table data, etc.).
Motivation
A key challenge for MLLMs is to extend their capabilities to solve problems beyond the data on which they have been trained, typically general-purpose datasets. They often struggle with specialized industry data, encountering difficulties with hallucination, poor understanding of technical terminology and lack of knowledge.
While recent research efforts have largely focused on building larger models by increasing the number of parameters and using vast datasets, RAGs offer an alternative approach, relying on external knowledge rather than embedding the knowledge in model weights. This is similar to how humans often rely on reference documents to answer complex questions they would otherwise be unable to answer.
Main benefits of RAG systems include:
- Reduced hallucinations: By grounding responses in retrieved external knowledge, RAG reduces the model’s tendency to “hallucinate” or fabricate details. Furthermore, a human can manually double-check the retrieved information.
- Low maintenance costs: External knowledge can be modified and updated independently to the model’s weights which are notoriously time consuming and expensive to train.
- Privacy compatibility: Private data can be used without embedding sensitive information directly in the model weights.
- Reduced training and inference costs: High performance can be achieved even with smaller models which reduces the need for vast training datasets or expensive training resources. The focus shifts from accumulating large amounts of knowledge to having good reading and analysis capabilities.
When to use RAG
When trying to improve a multimodal model, it is important to identify the root causes of its failures. RAG is suitable when the model has a clear understanding of the task as well as enough reasoning capability but simply does not have the required knowledge to answer.
Cases where the model fails to interpret the input modalities (images, text, etc.) can likely not be solved by RAG systems.
RAG is a reliable choice for retrieval tasks and knowledge injection. This is particularly evident when dealing with less common factual knowledge, where RAG significantly outperforms supervised fine-tuning [2].
Although the documents' content could be directly added to the prompt with in-context learning, having a dedicated database with a RAG system is preferred when the amount of information available becomes large due to the limited context window size [3, 4].
It should be noted that RAG can be easily combined with other methods. For example, both fine-tuning and RAG can be used simultaneously to tackle different issues. Fine-tuning can be used to improve instruction understanding and customize the output’s tone and style while RAG can be used to inject specialized knowledge.
Basic RAG pipeline
The RAG pipeline remains consistent whether the system operates with one or multiple modalities. While individual modules (document pre-processing, embedding generation, and database) are adapted to accommodate different modalities, the overall structure and process are the same.
The most common way to use a RAG is the “Retrieve-Read” framework which includes: Indexing, Retrieval, and Generation. It is also called “Simple RAG” or “Naive RAG” [1].
- Indexing: Indexing is the process of preparing external documents to be stored in a database for later use during user queries. Raw data like text, images, tables are extracted from source documents (PDFs, HTML, etc.), cleaned and preprocessed to embeddings before being stored to a database (also called “vector store”).
- Retrieval: Various retrieval strategies are available, with similarity search being the most commonly used approach. Upon receiving a user query, the RAG system transforms the query into a vector representation using the same model used in the indexing step. It then computes the similarity scores between the query vector and the documents stored in the database and returns the top K chunks which are then used as expanded context to the prompt.
- Generation: The retrieved information is combined with the user query to form a new prompt. This prompt can be created by simply concatenating the documents as context to the query, or by using a model (usually a LLM) to generate a more coherent and contextually integrated prompt
However, the basic RAG pipeline may present some limitations. The retrieved data might not be relevant, miss crucial information or have redundant information retrieved from multiple sources while the generation step might face hallucination and bias [1].To tackle these issues, some advanced techniques have been introduced such as reranking, improved chunking methods, etc. [1]. In this post, we do not discuss these and instead focus on the foundational aspects.
Multimodal RAG
An important point when using multimodal RAGs is the information representation across different modalities. Since each modality has its own challenges and specificities, it is necessary to choose the pipeline that best captures the information in the form of embeddings. The majority of modern search engines choose high-dimensional embeddings from deep learning models for efficient, context-aware retrieval.
There are three main approaches to build multi-modal RAG pipelines as described by NVIDIA [5]:
- Unified embeddings
- Text embeddings
- Multiple stores
Unified embeddings
Some models are able to encode multiple modalities in the same vector space. In the case of images and text, Contrastive Language-Image Pretraining (CLIP) [6] is the most commonly used method.
The main limitation of this approach is the need for a model capable of generating unified embeddings across all target modalities, making it challenging to add new ones. For instance, in a text-image system using CLIP, incorporating a new modality like sound would require finding a model that can unify embeddings for text, images, and sound, as CLIP alone would no longer be sufficient.
Text embeddings
In this approach, all modalities are “converted” to a primary modality (usually text). Images are preprocessed by a pretrained model to generate descriptions, metadata, summaries which are then treated as text.
The resulting pipeline is relatively simple as there is no need to train a model for generating unified embeddings or using a re-ranker to compare and rank results from different modalities. However, the drawback of this method is the loss of some nuance from the image as it can be difficult to describe some aspects of the image with text efficiently.
Multiple stores
Another approach is to have separate embeddings for different modalities. Each data type is processed by a modality-specific model and stored to a database (also called “store”) specific to its modality. For N modalities, there would be N models and N stores. In this pipeline, a re-ranker is needed to query the top-N chunks of each modality and compare them to return the most relevant ones.
This system is scalable as there is no need to retrain existing models when adding a new modality but it adds complexity since multiple models and stores are needed, as well as a re-ranker to compare documents from different modalities.
Evaluation
Despite using a RAG and external documents, the final generated answer may still experience issues such as hallucinations or inaccuracies despite being grounded in the retrieved information.
The most common way to evaluate a generated text output is to compare the model output to a human-made annotation. For example, MMMU benchmark [7] designed multichoice questions which makes evaluation by standard metrics like accuracy possible. Other benchmarks use corpus metrics like BLEU [8] when some ground truth annotations are available.
When the ground-truth annotation is not available, it is possible to use “LLM-as-a-judge” to grade the output with a numerical score based on user-defined metrics such as honesty, fluency, politeness etc. This can involve not only assessing the factual correctness of the responses but also their relevance to the original query and the coherence of the generated text.
It has been reported that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences, achieving over 80% agreement, which is the same level between humans [9].
Well-known open-source libraries such as PrometheusEval [10] are available for training, evaluating, and using language models specialized in evaluating other language models.
Error analysis
Error analysis is usually done manually by humans. The error cases can be randomly sampled and then classified in different predefined categories.
In the case of MMMU benchmark [7], the categories are:
- Perceptual errors (Failure to understand the image, the text)
- Lack of domain knowledge (Failure to understand specific technical terms, symbols)
- Reasoning error (Flawed logic and reasoning despite correctly interpreting the images and the text)
Conclusion
This blog post introduces RAG and multimodal RAG systems as solutions for enhancing MLLMs by injecting external knowledge. They are particularly effective for cases where the models understand the task but lack the necessary domain-specific knowledge which can cause hallucination.
The basic RAG pipeline involves three main steps. First, reference documents are indexed by converting them into embeddings and storing them in a vector database. Next, the relevant information is retrieved by matching the user’s query to the stored embeddings. Finally, a response is generated by combining the retrieved information with the context of the query.
When working with multiple modalities, it is essential to design an embedding system that effectively represents the information from diverse sources. Common strategies include using unified embeddings to encode all modalities into a shared vector space, converting all modalities into a primary format like text for simplified processing, or maintaining separate embeddings for each modality with dedicated storage and integration mechanisms.
Evaluation of RAG systems is similar to that of MLLMs, where performance can be assessed through a comparison with human annotations or automated evaluation such as "LLM-as-a-judge" which relies on large language models to grade responses based on criteria such as accuracy, coherence, and relevance, ensuring a robust and reliable evaluation process.
In the next blog post, we will share the results of a small trial conducted in Ridge-i to build a chatbot able to answer questions relevant to Ridge-i using a multimodal RAG system where the “Unified embeddings” and “Text embeddings” methods are compared.
References
[1] Yunfan Gao et al. Retrieval-Augmented Generation for Large Language Models: A Survey. 2024. arXiv: 2312.10997 [cs.CL]. url: https://arxiv.org/abs/2312.10997.
[2] Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. Fine Tuning vs. Re trieval Augmented Generation for Less Popular Knowledge. 2024. arXiv: 2403.01432 [cs.CL]. url: https://arxiv.org/abs/2403.01432.
[3] Nelson F. Liu et al. Lost in the Middle: How Language Models Use Long Contexts. 2023. arXiv: 2307.03172 [cs.CL]. url: https://arxiv.org/abs/2307.03172.
[4] Google. What is Retrieval-Augmented Generation (RAG)? url: https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en.
[5] Annie Surla et al. NVIDIA- An Easy Introduction to Multimodal Retrieval-Augmented Generation. 2024. url: https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation/.
[6] Alec Radford et al. “Learning transferable visual models from natural language supervision”. In: International conference on machine learning. PMLR. 2021, pp. 8748 8763. [cs.CL]. url: https://arxiv.org/abs/2103.00020.
[7] Xiang Yue et al. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. 2024. arXiv: 2311.16502 [cs.CL]. url: https://arxiv.org/abs/2311.16502.
[8] Kishore Papineni et al. BLEU: a Method for Automatic Evaluation of Machine Translation. 2002. url: https://aclanthology.org/P02-1040.pdf.
[9] Lianmin Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. 2023. arXiv: 2306.05685 [cs.CL]. url: https://arxiv.org/abs/2306.05685.
[10] Seungone Kim et al. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. 2024. arXiv: 2405.01535 [cs.CL]. url: https://arxiv.org/abs/2405.01535.