Hello! This is Aurélie, an Artificial Intelligence Engineer at Ridge-i.
In this blog post, I'll be sharing insights from a small trial where I experimented with LlamaIndex [1] and Qdrant [2] to build a multimodal Retrieval-Augmented Generation (multimodal RAG) system.
- Introduction
- Case study design
- Baseline - Initial results by ChatGPT-4o
- Text RAG
- Unified embeddings results
- Text embeddings results
- Text embeddings results with InternVL2.5
- Results comparison
- Conclusion
- References
Introduction
As introduced in our previous blog post [3], there are three main approaches to building multimodal RAG pipelines as described by NVIDIA [4]:
- Unified embeddings
- Text embeddings
- Multiple stores
But how do we choose the best embedding strategy?
In this blog post, we explore the differences between "Unified Embeddings" and "Text Embeddings" methods, comparing their performance and effectiveness.
Case study design
In this blog post, we aim to build a chatbot-like system capable of answering relevant questions about Ridge-i.
To support our trial, we use 7 articles as our reference documents.
- 住友ゴムが実施する「DUNLOP全国タイヤ安全点検」向けに、Ridge-iがAIタイヤ点検ツールを提供~AIによる画像診断からタイヤの摩耗状態を判定~ [5]
- Ridge-iが参画する産学連携海洋ごみ削減プロジェクト「Debris Watchers」がドローンによる海洋ごみ解析サービスの開発進捗を公開 ~国産ドローンとAIを活用し、実証実験を山形県鶴岡市で実施~ [6]
- 株式会社SUBARUに生成AI活用コンサルティングサービスの提供を開始 ~設計開発業務における生成AI活用余地を共同で検証~ [7]
- バルカーとリッジアイ、AI技術を搭載した 設備異常検知システム「VHERMEⓇ」の提供を開始 〜振動、超音波センシング全般に対応可能なAI 技術を開発~ [8]
- リッジアイと北大動物医療センター、 猫の胸部X線画像から骨を除去するAI技術に関する研究論文が 「ISBI 2024」に採択 [9]
- AI・ディープラーニング技術開発のリッジアイと AI ソリューションプロバイダのジーデップ・アドバンス、 LLM開発システム「ローカルRAGスターターBOX」を共同開発・受注開始 [10]
- リッジアイ、異常検知・外観検査AIソリューション 「Ridge Inspection」を提供開始 感性評価AIなどのカスタムAI技術により、従来の外観検査・異常検知を高精度かつ効率的に革新 [11]
Two types of questions were designed: those for which answers can be found in the text and those for which answers can be found in images.
| Question | Answer location | Answer | Reference |
|---|---|---|---|
| What is the project Debris Watchers about? | Text | A project to monitor marine debris on Japanese beaches using drone images. | 6つの企業、2つの大学で構成される産学連携海洋ごみ削減プロジェクト「Debris Watchers」[...] ドローンとAIを活用した海洋ごみ解析の実証実験を行い、有効性を確認いたしました。 [6] |
| Which city was chosen for the demonstration experiment conducted of the Debris Watchers project? | Text | Tsuruoka City (山形県鶴岡市) | 「Debris Watchers」のドローンサブチームに参画し、山形県鶴岡市のご協力のもと、本日鶴岡市の海岸において、ドローンと AI を活用した海洋ごみ解析の実証実験を行い、有効性を確認いたしました。 [6] |
| What AI technology was developed in collaboration with VALQUA? | Text | Anomaly detection system using vibration sensors | 「株式会社バルカーと、株式会社Ridge-iは、回転機器の振動・超音波センシング全般に対応可能なAI 技術を開発し、同技術を搭載した設備異常検知システム「VHERMEⓇ(ベルム)」を共同開発したことを発表します。[8] |
| Where has the paper “Unsupervised Domain Adaptation for Human and Animal Chest X-ray Bone Suppression” been accepted to? | Text | IEEE International Symposium on Biomedical Imaging(ISBI 2024) | 株式会社 Ridge-iと北海道大学動物医療センターは共同研究の、猫の胸部 X 線画像から骨を高精度に除去する AI 技術に関する論文が 医療画像に関する主要な国際学会の一つである 21st IEEE International Symposium on Biomedical Imaging(ISBI 2024)。[9] |
| Please tell me if there is any product where Ridge-i supports LLM and RAG system development. | Text | Ridge-i supports the development of LLM and RAG applications with the ローカル RAG スターターBOX product. | 株式会社Ridge-iと株式会社ジーデップ・アドバンスは共同で、オンプレミス環境でユーザー固有の文書やデータを活用したRAGを実装することで、より手軽・迅速・安全な大規模言語モデル(LLM)開発を可能にするシステム「ローカルRAGスターターBOX」を共同で開発し、2025年1月20日より受注を開始しました。 [10] |
| What colors are on Ridge-i’s logo? | Image | White, blue and black | ![]() |
| What are the types of debris detected by Debris Watchers using drone images? | Image | Polystyrene, fishing buoys, artificial wood, and driftwood | ![]() |
| Is it possible to use FA camera input for Ridge Inspection? | Image | Yes | ![]() |
| Please tell me the difference about the number and models of the entry and high specs models of the ローカル RAG スターターBOX product. What is the difference in GPU memory? | Image | The entry model uses a single RTX 4500 Ada 24GB while the high specs one uses 3 RTX 6000Ada 48GB. | ![]() |
| According to the tire inspection tool, is it ok to drive with tires when there are scratches? | Image | No | ![]() |
Note: Questions are organized by answer location (in text or image)
Note: Please be aware that the documents are written in Japanese while the questions are asked in English.
Baseline - Initial results by ChatGPT-4o
First, we use ChatGPT-4o to establish a baseline by evaluating the performance of a state-of-the-art model without access to external knowledge (web search disabled).

The above designed questions are asked and we manually evaluate whether the answer provided by the model is correct or not.
Unfortunately, not a single question is answered accurately in this case.
| Pipeline | Baseline (no RAG) |
|---|---|
| Accuracy (All questions) | 0% (0/10) |
| Accuracy (Reference answer in text) | 0% (0/5) |
| Accuracy (Reference answer in image) | 0% (0/5) |
Furthermore, it seems that the model has a strong tendency to hallucinate. Rather than refusing to answer due to a lack of knowledge, the model fabricates incorrect answers instead.
| Question | Answer | Evaluation |
|---|---|---|
| What is the project Debris Watchers about? | Debris Watchers by Ridge-i is a satellite-based service that leverages AI to monitor and detect space debris. It aims to enhance the safety of satellite operations by providing real-time tracking and risk analysis of potential collisions in orbit. | The project is about marine debris, not about space. Hallucination case |
| What AI technology was developed in collaboration with VALQUA? | Ridge-i collaborated with バルカー (Nippon Valqua) to develop an AI-based defect detection system for industrial sealing products. The project aimed to improve quality control and operational efficiency by leveraging AI technology. | The project is not about sealing products. Hallucination case |
In the two examples above, the model appears to rely on the companies' reputations (Ridge-i for space projects and VALQUA for sealing products) to make predictions.
Text RAG
In the previous section, we see that answers generated solely based on a model’s pre-trained knowledge (even the SOTA ones !) are often unreliable.
To enhance accuracy and relevance, we introduce a text-based RAG. Text from reference documents is stored in a Qdrant vector database and made available as context during answer generation where the top three most relevant documents are retrieved using semantic similarity search.
The model successfully answers text-based questions and shows a significant improvement over the baseline (no RAG) approach.
| Pipeline | Text RAG |
|---|---|
| Accuracy (All questions) | 50% (5/10) |
| Accuracy (Reference answer in text) | 100% (5/5) |
| Accuracy (Reference answer in image) | 0% (0/5) |
| Question | Baseline answer | Text-RAG answer |
|---|---|---|
| Please tell me if there is any product where Ridge-i supports LLM and RAG system development. | Yes, Ridge-i offers a product called "AltText" that supports the development of systems using Large Language Models (LLM) and Retrieval-Augmented Generation (RAG). [...] Hallucination case |
Yes, Ridge-i supports the development of the LLM and RAG system through the joint development of the "Local RAG Starter BOX" system with GDEP Advance, Inc. This system allows for the implementation of RAG using user-specific documents and data in an on-premises environment, enabling more convenient, rapid, and secure large-scale language model development. Correct |
| Which city was chosen for the demonstration experiment conducted of the Debris Watchers project? | The city chosen for the demonstration experiment of the Debris Watchers project was Fukuoka, Japan. Hallucination case |
The city chosen for the demonstration experiment conducted of the Debris Watchers project was Tsuruoka in Yamagata Prefecture. Correct |
However, the model is still unable to answer questions that require information only found in images.
| Question | Baseline answer | Text-RAG answer |
|---|---|---|
| What colors are on Ridge-i’s logo? | Ridge-i's logo primarily features shades of blue and green. [...] Hallucination case |
The context information does not provide any details about the colors on Ridge-i's logo. Refuses to answer |
| Is it possible to use FA camera input for Ridge Inspection? | Yes, it is possible to use a FA (Factory Automation) camera for ridge inspection, depending on the specific requirements and conditions of the inspection task. [...] The model misunderstands the question and confuses “Ridge-i” for “ridge”. |
Based on the provided context information, there is no specific mention of using FA (Factory Automation) camera input for Ridge Inspection. [...] It is recommended to consult with Ridge-i directly or refer to their official documentation for more detailed information on the compatibility of FA camera input with Ridge Inspection. Refuses to answer |
It can be noted that although the model is still not able to answer some of our questions, the risk of hallucination is reduced. Instead of hallucinating, the model now refuses to answer. This is likely because the model is asked to answer based on the retrieved context and is now able to determine whether the relevant information is not available.
Unified embeddings results
By relying solely on text, our model misses information that is only present in images (for example tables, diagrams, etc). To overcome this limitation, we extend our system to multimodal RAG, integrating both text and image data.
First, we try the "Unified Embedding" method which generates a single shared representation for both text and image data.
In this trial, we use CLIP (Contrastive Language-Image Pre-Training) [12] to prepare a multimodal Qdrant vector store. By jointly training an image encoder and a text encoder on a variety of (image, text) pairs into the same embedding space using a contrastive loss, CLIP learns to align semantically similar image and text representations.

Matching image and text description are trained to have the same embeddings
GPT-4o is used as a query engine, relying on similarity scores to retrieve the most relevant documents from the vector store as before.
| Pipeline | Multimodal RAG (Unified embeddings) |
|---|---|
| Accuracy (All questions) | 80% (8/10) |
| Accuracy (Reference answer in text) | 100% (5/5) |
| Accuracy (Reference answer in image) | 60% (3/5) |
Compared to text-based RAG, the model's performance improves, maintaining its ability to answer text-based questions while also handling some image-based queries.
| Question | Answer | Evaluation |
|---|---|---|
| What colors are on Ridge-i’s logo? | I can't determine the colors of Ridge-i’s logo based on the provided text. | Refuses to answer |
| What are the types of debris detected by Debris Watchers using drone images? | The types of debris detected by Debris Watchers using drone images include: 1. 発砲スチロール (Styrofoam) 2. 漁業ブイ (Fishing buoys) 3. 人工木 (Artificial wood) 4. 流木 (Driftwood) |
Correct |
| Is it possible to use FA camera input for Ridge Inspection? | Yes, it is possible to use FA camera input for Ridge Inspection. | Correct |
| Please tell me the difference about the number and models of the entry and high specs models of the ローカル RAG スターターBOX product. What is the difference in GPU memory? | The differences between the entry and high specs models of the ローカル RAG スターターBOX product are as follows: GPU Memory: - Entry Model: 24GB - High Specs Model: 48GB (per GPU, with two GPUs) The high specs model has more powerful CPUs and GPUs with larger memory capacity compared to the entry model. |
Wrong. The high specs model features 3 GPUs. |
| According to the tire inspection tool, is it ok to drive with tires when there are scratches? | [...] the final inspection result could vary. It may suggest "rotation recommended" or "replacement needed" based on the human assessment. [...] | Correct |
The system demonstrates some ability to read the content from images, but it occasionally misinterprets or fails to extract the information accurately, especially when the text is unclear, distorted, or embedded within complex visuals.
Here, the failure cases are caused by:
- Retrieval of the wrong document. When asked “What colors are on Ridge-i's logo?”, the top document retrieved using similarity search is “ridge inspection.txt” rather than the “ri_logo.PNG” image file. Comparing the relevance of the retrieved images and text is challenging, as semantic similarity search may inherently favor text documents.
- Poor visual understanding. When asked “Please tell me the difference about the number and models of the entry and high specs models of the ローカル RAG スターターBOX product. What is the difference in GPU memory?”, the model retrieved the correct image but failed to extract the image content correctly.
Text embeddings results
This time, images are first converted into text by prompting GPT-4o to generate a detailed description using the following simple prompt: "Please describe the image as best as you can in detail."

The image shows a collection of debris scattered on a surface, likely a beach or shoreline. The debris includes various colorful items such as plastic pieces, straws, and other small objects. There is a close-up inset on the right side highlighting some of these items in more detail. At the bottom of the image, there is a color-coded legend with Japanese text: - The first label (in light green) reads "発砲スチロール" which means "Foamed Styrofoam." - The second label (in pink) reads "漁業ブイ" which means "Fishing Buoy." - The third label (in purple) reads "人工木" which means "Artificial Wood." - The fourth label (in brown) reads "流木" which means "Driftwood." These labels likely correspond to different types of debris shown in the image.
| Pipeline | Multimodal RAG (Text-embeddings) |
|---|---|
| Accuracy (All questions) | 90% (9/10) |
| Accuracy (Reference answer in text) | 100% (5/5) |
| Accuracy (Reference answer in image) | 80% (4/5) |
By converting images into text descriptions, the semantic search becomes more balanced, allowing text and image content to be compared within the same modality.
| Question | Answer | Evaluation |
|---|---|---|
| What colors are on Ridge-i’s logo? | The colors on Ridge-i's logo are blue, white, and black. | Correct |
| Please tell me the difference about the number and models of the entry and high specs models of the ローカル RAG スターターBOX product. What is the difference in GPU memory? | The difference between the entry and high-spec models of the "ローカル RAG スターター-BOX" product is in the number of GPUs and GPU memory. The entry model has 1 x RTX A4500 Ada 24GB GPU, while the high-spec model has 4 x RTX A4500 Ada 48GB GPUs | Wrong |
As we can see, the previous issue of failed document retrieval for the “What colors are on Ridge-i's logo?” question is now solved as the model correctly retrieves the text description of the Ridge-i logo.
Unfortunately, the last failure case is still not answered correctly. To understand the reason for failure, let’s look into the generated description of the image.

The image is a table detailing the specifications and pricing for different models of a product called "ローカル RAG スターター-BOX" (Local RAG Starter-BOX). The table is divided into three main columns, each representing a different model: Entry Model, Standard Model, and High-Spec Model. [...] ### Specifications: #### Entry Model: - **CPU**: Intel Xeon W5-2555X (16 cores, 32 threads) - **GPU**: 1 x RTX A4500 Ada 24GB - **OS**: Ubuntu 22.04 LTS [...] #### High-Spec Model: - **CPU**: Intel Xeon W5-3455X (24 cores, 48 threads) - **GPU**: 4 x RTX A4500 Ada 48GB - **OS**: Ubuntu 22.04 LTS [...]
In this case, the issue stems from the poorly generated description rather than the RAG process itself. The model used (GPT-4o) is unable to accurately extract content from the image, leading to misleading context during answer generation. Therefore, it is unlikely to solve this issue without changing the model used for image description to a stronger one.
Text embeddings results with InternVL2.5
We choose the open-source InternVL2.5 [13] model to replace GPT-4o for image description as it is known for being strong on visual tasks.
| Pipeline | Multimodal RAG (Text-embeddings) |
|---|---|
| Accuracy (All questions) | 100% (10/10) |
| Accuracy (Reference answer in text) | 100% (5/5) |
| Accuracy (Reference answer in image) | 100% (5/5) |
This time the model is able to answer all questions accurately ! 🎉
Compared to the previous system which failed at reading the specifications of the "ローカル RAG スターターBOX" product, this model is able to accurately read the type and number of GPU used.
The image is a table detailing the specifications and pricing for different models of a product called "ローカル RAG スターター-BOX" (Local RAG Starter-BOX). The table is divided into three main columns, each representing a different model: Entry Model, Standard Model, and High-Spec Model. [...] The image is a table in Japanese that lists the specifications and prices of different models of a RAG-STAR BOX. The table is divided into three columns, each representing a different model: エントリーモデル (Entry Model), スタンダードモデル (Standard Model), and ハイスペックモデル (High-Spec Model). [...] ### GPU - エントリーモデル: 1x RTX 4500 Ada 24GB - スタンダードモデル: 1x RTX 6000Ada 48GB - ハイスペックモデル: 3x RTX 6000Ada 48GB
As a result, the RAG generation is able to accurately answer all questions, including the ones where the answer is embedded within complex visuals.
| Question | Answer | Evaluation |
|---|---|---|
| Please tell me the difference about the number and models of the entry and high specs models of the ローカル RAG スターターBOX product. What is the difference in GPU memory? | The difference between the entry and high specs models of the ローカル RAG スターターBOX product is as follows: Therefore, the entry model has 1 GPU with 24GB of memory, while the high-spec model has 3 GPUs with 48GB of memory. |
Correct |
By using the text-embedding approach, it is relatively easy to experiment with different models when the vision component underperforms and choose one that is suitable for our task.
Results comparison
The table below summarizes the performance of all methods presented in this blog post. Multimodal RAG outperforms text-based RAG by enabling access to a broader range of external knowledge, including non-textual sources such as images.
| Pipeline | Baseline (no RAG) | Text-based RAG | Multimodal RAG (Unified embeddings) | Multimodal RAG (Text embeddings, image description by GPT-4o) | Multimodal RAG (Text embeddings, image description by InternVL2.5) |
|---|---|---|---|---|---|
| Accuracy (All questions) | 0% (0/10) | 50% (5/10) | 80% (8/10) | 90% (9/10) | 100% (10/10) |
| Accuracy (Reference answer in text) | 0% (0/5) | 100% (5/5) | 100% (5/5) | 100% (5/5) | 100% (5/5) |
| Accuracy (Reference answer in image) | 0% (0/5) | 0% (0/5) | 60% (3/5) | 80% (4/5) | 100% (5/5) |
Conclusion
For this small trial, the best results are achieved by using a multimodal RAG with the “Text-embeddings” strategy which converts images to text. We can expect text embeddings to perform well in projects since the images commonly used in our industry (such as graphs, charts, simulation data, and result figures) can usually be well described using text.
This time, we used a small amount of data and questions and addressed the system’s weak points one by one until we achieved perfect accuracy on our data. For different data, conclusions might be different. To choose the best design for a RAG pipeline, the nature of the data and how it is best represented and retrieved should be carefully considered.
References
[1] Jerry L. LlamaIndex. 2022. url: https://github.com/jerryjliu/llama_index.
[2] Qdrant. url: https://github.com/qdrant/qdrant.
[3] Ridge-i. A general overview of multimodal RAG. 2025. url: https://iblog.ridge-i.com/entry/2025/01/30/124153.
[4] Hayden Wolff. NVIDIA-RAG 101: Demystifying Retrieval-Augmented Generation Pipelines. 2023. url: https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation/.
[5] 住友ゴムが実施する「DUNLOP全国タイヤ安全点検」向けに、Ridge-iがAIタイヤ点検ツールを提供~AIによる画像診断からタイヤの摩耗状態を判定~. 2021. url: https://ridge-i.com/news/3339/.
[6] Ridge-iが参画する産学連携海洋ごみ削減プロジェクト「Debris Watchers」がドローンによる海洋ごみ解析サービスの開発進捗を公開 ~国産ドローンとAIを活用し、実証実験を山形県鶴岡市で実施~. 2021. url: https://ridge-i.com/news/3344/.
[7] 株式会社SUBARUに生成AI活用コンサルティングサービスの提供を開始 ~設計開発業務における生成AI活用余地を共同で検証~. 2023. url: https://ridge-i.com/news/4019/.
[8] バルカーとリッジアイ、AI技術を搭載した 設備異常検知システム「VHERMEⓇ」の提供を開始 〜振動、超音波センシング全般に対応可能なAI 技術を開発~. 2024. url: https://ridge-i.com/news/4220/.
[9] リッジアイと北大動物医療センター、 猫の胸部X線画像から骨を除去するAI技術に関する研究論文が 「ISBI 2024」に採択. 2024. url: https://ridge-i.com/news/4324/.
[10] AI・ディープラーニング技術開発のリッジアイと AI ソリューションプロバイダのジーデップ・アドバンス、 LLM開発システム「ローカルRAGスターターBOX」を共同開発・受注開始. 2025. url: https://ridge-i.com/news/4435/.
[11] リッジアイ、異常検知・外観検査AIソリューション 「Ridge Inspection」を提供開始 感性評価AIなどのカスタムAI技術により、従来の外観検査・異常検知を高精度かつ効率的に革新. 2025. url: https://ridge-i.com/news/4439/.
[12] Alec Radford et al. “Learning transferable visual models from natural language supervision”. In International conference on machine learning. PmLR, 2021, pp. 8748 8763. url: https://arxiv.org/abs/2103.00020.
[13] Chen et al. “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks”. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. url: https://arxiv.org/abs/2312.14238.




