Hello, this is Aurélie, working as an Artificial Intelligence Engineer at Ridge-i. Today I would like to share some insights about fine-tuning vision language models (VLMs)!
- Introduction
- To fine-tune or not to fine-tune
- Which modules to keep frozen in a VLM when fine-tuning
- Internal experiments
- Small model + SFT vs. Large model + LoRA?
- Conclusion
- References
Introduction
As interest in vision language models (VLMs) grows, fine-tuning is increasingly seen as a promising way to adapt models for specific applications. For teams exploring this path, it’s important to take some time to assess whether fine-tuning is the most suitable solution and whether it’s likely to lead to meaningful improvement, given the high computational cost and large amounts of data requirements.
In this blog post, we share some examples of internal experiments where fine-tuning failed to improve our model’s capabilities and highlight some key challenges. We consider full supervised fine-tuning (SFT) and Low-Rank Adaptation method (LoRA) which adds small trainable matrices to some specific weights of a frozen model, making fine-tuning efficient with minimal changes.
Note: All experiments used InternVL2.0 models [1] and were conducted in October 2024. Since then, InternVL3.0 has been released. All experiments were done using a single NVIDIA A100 GPU with 80GB of memory.
To fine-tune or not to fine-tune
Prior to the rise of LLMs, fine-tuning was commonly used for smaller-scale models (100M – 300M parameters). However, with the advent of larger models (> 1B parameters), the question of fine-tuning has become more nuanced.
State-of-the-art models are often released in multiple sizes, with larger models offering the best performance but also requiring significantly more resources and large amounts of high-quality data to fine-tune.
As a result, the first obstacle encountered by organizations looking to fine-tune a large model is the requirements in terms of computing infrastructure and volume of high-quality data.
Even if computing resources and access to data are not an issue, fine-tuning may not be suitable for all types of tasks. While it is useful for adapting the model’s output style, vocabulary, tone, it is generally not recommended to inject external knowledge [2]. This is because LLMs and VLMs are not designed to memorize and reliably retrieve highly specific facts (e.g. temperature in a specific place on a specific day) and it is difficult to control or verify that the knowledge gets accurately embedded into the weights during fine-tuning.
Therefore, it’s important to begin by identifying the source of the performance limitations and to assess whether fine-tuning is the most appropriate solution.
Finally, fine-tuning a large pretrained model with high capabilities always comes with some risk. Some issues like catastrophic forgetting, where the model loses previously acquired general knowledge, can lead to degraded performance and a reduced ability to follow instructions.
In this blog post, we share three quick fine-tuning experiments on different datasets that failed to improve performance, and we explore the underlying reasons.
Which modules to keep frozen in a VLM when fine-tuning
A VLM is usually composed of 3 main modules:
A vision encoder which extracts features from input images
A projector, which maps the visual features into the language model’s embedding space
A language model, which processes both the language tokens and the projected image features
When fine-tuning a pretrained model on a new domain-specific dataset, it is common to keep the vision model frozen. This is because the visual encoder is typically trained on large-scale image datasets (e.g., ImageNet [3], LAION [4], etc.) and already captures rich and generalizable visual features. Unless large amounts of training data is available and there is a significant domain gap between the target data and the pretrained data, we recommend keeping it frozen.
As a default, InternVL proposes to keep the vision encoder frozen and to fine-tune the projector and the language model [6]. For our experiments, we also keep the vision part frozen.
Below is a figure of a typical VLM architecture where the vision part is kept frozen for the fine-tuning step.

Alternatively, the projector can also be kept frozen during fine-tuning to help reduce the risk of overfitting when the pretraining and target domains are similar. It can be useful when working with small datasets as the projector is usually small (often just a single layer) which makes it prone to overfitting.
Internal experiments
To explore the limitations of fine-tuning methods like full SFT and LoRA, we conducted experiments on three vision-language challenges.
The DocVQA dataset evaluates content extraction from documents. Our results highlight the importance of identifying whether performance limitations originate from the language model or the vision encoder, since fine-tuning typically targets only the language component.
The AI2D task involves diagram question answering. Our experiments highlight the impact of format mismatch between training and test data, as well as the risk of catastrophic forgetting.
COCO Captions is a dataset designed for the image captioning task. Our experiments point out issues with metric interpretation and the effects of low-quality training data.
The case of DocVQA
DocVQA [7] is a multimodal open-source dataset for content extraction. It contains about 12k images and 50k high-level questions about the documents. The questions are relatively simple and aim to check whether the content was correctly extracted from the image.
Below is an example of a DocVQA image and associated questions.

This dataset is evaluated using the Average Normalized Levenshtein Similarity (ANLS) metric [8] which measures how similar a predicted answer is to the ground truth based on character-level edits. Higher scores mean more accurate text matching.
In the table below, we compare the performance and resource usage of the pretrained models to our fine-tuned models using LoRA.
| Model | 1B | 2B | 4B | 8B | 26B |
|---|---|---|---|---|---|
| Architecture | internvl2_1b_qwen2_0_5b | internvl2_2b_internlm2_1_8b | internvl2_4b_phi3_3_8b | internvl2_8b_internlm2_7b | internvl2_26b_internlm2_20b |
| Pretrained ANLS (val) | 0.7883 | 0.8466 | 0.8699 | 0.8972 | 0.9058 |
| LoRA ANLS (val) | 0.7919 | 0.8405 | 0.8679 | 0.8899 | 0.8991 |
| Difference | +0.0036 | −0.0061 | −0.0020 | −0.0073 | −0.0067 |
| GPU Usage | 39711 MiB | 30963 MiB | 18037 MiB | 43029 MiB | 78877 MiB |
| Training Time | 43 min | 53 min | 1h31m | 2h19m | 6h47m |
Below are the results for the full SFT method.
| Model | 1B | 2B |
|---|---|---|
| Architecture | internvl2_1b_qwen2_0_5b | internvl2_2b_internlm2_1_8b |
| Pretrained ANLS (val) | 0.7883 | 0.8466 |
| Fine-tuned ANLS (val) | 0.7959 | 0.8202 |
| Difference | +0.0076 | −0.0264 |
| GPU Usage | 72423 MiB | 76663 MiB |
| Training Time | 34 min | 51 min |
Unfortunately, there is almost no change in performance by using the proposed fine-tuning methods. There are several possible explanations.
First, since both the dataset (text documents) and the task (text extraction, similar to OCR) are very general, the performance of the pretrained models is already quite good and there is no reason to believe that the pretrained models would be lacking as they have been trained on large amounts of similar data.
Secondly, this task relies heavily on challenging content extraction (some documents are handwritten or hard to read) while the language task (answering simple questions from the extracted text) is more simple. As a result, there is a discrepancy between what the model is learning during fine-tuning (mainly language patterns) and what is actually being evaluated (successful visual content extraction).
Finally, since fine-tuning a VLM typically assumes the vision encoder remains frozen and only the language component is updated, performance won't improve when the image model is the true bottleneck.
The case of AI2D
The AI2D dataset [9] contains over 5,000 grade school science diagrams and 15,000 multiple-choice questions for research on diagram understanding and question answering.

First, we try full SFT using the default parameters.
| Model | 1B |
|---|---|
| Architecture | internvl2_1b_qwen2_0_5b |
| Learning rate | 4e-5 |
| Pretrained ANLS (test) | 0.644 |
| Fine-tuning ANLS (test) | 0.0 |
Unfortunately, the test accuracy drops to 0.
While looking at the actual test results, we noticed a reduced capability to follow instructions by the fine-tuned model. Even when explicitly prompted to answer only with the correct option’s letter, it often returned the content of the option instead.
{ "question": "What is between the head and abdomen?\nA. Antenna\nB. Simple eye\nC. Spiracle\nD. Thorax\nAnswer with the option's letter from the given choices directly.", "image": "345802", "answer": "Thorax", "annotation": "D" }
Example of a fine-tuned model output
Although the answer is correct, the model does not follow the expected format
The main explanation for this issue comes from the training and test sets using different formats.
{ "id": 1, "image": "images/7.png", "conversations": [{"from": "human", "value": "<image>\nWhich plant has leaves modified into spikes?Smilax\nBanayan tree\nUtricularia\nCactus Please answer the question based on the options mentioned before."}, {"from": "gpt", "value": "Cactus"}] }
Example of a training sample
The model is expected to return the text of the correct answer directly (e.g., "Cactus"), not the option number or letter
{ "id": 345802, "image": "data/ai2diagram/AI2D_TEST/345802.jpg", "question": "What is between the head and abdomen?\nA. Antenna\nB. Simple eye\nC. Spiracle\nD. Thorax\nAnswer with the option's letter from the given choices directly.", "question_id": "345802", "answer": "D", "category": "partsOfA", "abcLabel": "False" }
Example of a test sample
The model is expected to answer with the letter corresponding to the correct answer (e.g. "D")
The dataset was downloaded from the InternVL documentation and comes preprocessed for their format, which may differ from versions available on other platforms like HuggingFace
This suggests that the model, having been trained on data where answers are provided as full content, has learned to consistently respond with the answer text rather than the option label even when asked to.
To prevent the model from overfitting on the training data’s format, we reduce the learning rate and repeat the experiments with LoRA and full SFT.
| Model | 1B |
|---|---|
| Architecture | internvl2_1b_qwen2_0_5b |
| Pretrained ANLS (test) | 0.644 |
| LoRA ANLS (test) | 0.615 |
| Fine-tuning ANLS (test) | 0.639 with a small lr (=1e-6) |
This time, we confirm that the model retains its ability to follow the format instruction. However, the performance drops for both methods using LoRA (0.644 → 0.615) and full SFT (0.644 → 0.639).
As we suspect that one reason for the reduced performance is catastrophic forgetting, we repeated the experiment while including part of the general-domain data originally used to train the pretrained weights.
This strategy to limit catastrophic forgetting is widely used when fine-tuning large models (e.g. Swallow [10], InternVL [1], etc.) on domain-specific data, with the goal of enhancing downstream capabilities while retaining the foundational skills [6].
| Model | 1B |
|---|---|
| Architecture | internvl2_1b_qwen2_0_5b |
| LoRA ANLS (test) Train: AI2D + General data |
0.626 |
| Training Time | 26h13min |
Although using additional general data during fine-tuning seems to limit catastrophic forgetting, the performance of the fine-tuned model is still worse than that of the pretrained model. More advanced mitigation techniques exist, but they require significant changes to the training pipeline and a lot of setup effort.
In conclusion, when fine-tuning a model on domain-specific datasets, it is crucial to ensure that the training data is of high quality and closely matches the format and expectations of the test set. Additionally, it is important to consider methods to limit catastrophic forgetting which inevitably make the training process more complex and more time and resource intensive.
The case of COCO Captions
The COCO Captions dataset [11] contains over one and a half million captions describing over 330,000 images. For each image of the training and validation sets, five independent crowdsourced captions are provided.

5 crowd-sourced captions are provided for each sample
First, we confirm that we can reproduce the LoRA results of InternVL authors’ results. We also include the full SFT results which are not provided by the authors. The metrics are the same as used by InternVL.
| Metric | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | METEOR | Rouge | CIDEr |
|---|---|---|---|---|---|---|---|
| Baseline (reproduced by RI) |
0.640 | 0.463 | 0.321 | 0.214 | 0.267 | 0.504 | 0.793 |
| LoRA (by InternVL authors) |
0.805 | 0.649 | 0.504 | 0.385 | 0.300 | 0.595 | 1.312 |
| LoRA (by RI) |
0.804 | 0.649 | 0.501 | 0.382 | 0.299 | 0.594 | 1.305 |
| SFT (by RI) |
0.806 | 0.652 | 0.508 | 0.392 | 0.305 | 0.601 | 1.339 |
In terms of the metrics, we can clearly see a significant improvement for the fine-tuned models compared to the pretrained model.
However, can we truly say that the model performance has improved?
First, it is important to understand the BLEU (BiLingual Evaluation Understudy) metric [12]. This metric was originally used to assess the quality of a translated text by comparing the machine translation to a ground-truth human made translation.
According to Google AutoML documentation [13], there are several points to be careful about when using BLEU.
BLEU is a Corpus-based Metric. It performs badly when used to evaluate individual sentences. It is mainly used to compare whether two corpus are similar (length, vocabulary etc.).
There is no distinction between content and function words. A dropped function word like "a" gets the same penalty as if the name "NASA" was erroneously replaced with "ESA".
Not good at capturing the meaning and grammaticality of a sentence. Dropping a single word like "not" can change the meaning of a sentence. However BLEU only imposes a small penalty since it treats all words equally and considers it as just one word difference.
As a result, it is possible to achieve a very high BLEU score using sentences that don’t make sense or are opposite in meaning. For example, the following pair of sentences has a score of 0.8:
Reference: the cat is on the mat
Candidate: the the the cat mat
In our case, the ground-truth annotations used as reference for BLEU are crowdsourced. Therefore, the descriptions are usually short and may have issues such as typos or grammatical errors.
| Image | Crowdsourced annotation |
|---|---|
|
"some dessert is laying out on a yellow and white plate", → Punctuation and caps issues "A plate containing a slice of dessert, two forks and some piped cream", "Pastry sitting on top of a golden white plate with forks.", → Not detailed enough "Two forks on a plate of cake and cream.", → Not detailed enough "THIS IS A PHOTO OF A DESERT PLATE FOR TWO" → All caps, includes a typo "dessert" → "desert" and is not detailed enough |
After investigating the results of the fine-tuned model, it does seem like the fine-tuned models' answers are more similar to the training data. For several cases, the fine-tuned models end up producing short descriptions that are not very detailed, similarly to the crowdsourced annotations, as shown in the table below.
| Input image | Pretrained 2B | SFT 2B | LoRA 8B |
|---|---|---|---|
|
A dessert plate with a slice of cake, two scoops of ice cream, and a spoon. |
A piece of cake with whipped cream and chocolate sauce. → best, according to human evaluation |
A plate of food with a fork on it. |
|
A man in a cowboy hat rides a horse down a street, with people watching. → best, according to human evaluation |
A man riding a horse down a street next to a crowd of people. | A man riding a horse down a street. |
|
A bed with colorful bedding and pillows is set up in a room covered with blue plastic sheets. → best, according to human evaluation |
A bed in a room with blue curtains and a blue sheet. | A bed with a blue cover and a blue curtain. |
|
A collection of colorful street art is displayed on a wooden fence, with a stop sign and a cityscape illustration. → best, according to human evaluation |
A row of paintings and a stop sign on a fence. | A bunch of paintings are leaning against a fence. |
In this case, neither LoRA nor full SFT significantly improve performance according to human evaluation, even when using larger architectures (e.g., LoRA 8B).
Note: We were unable to include SFT results for 8B due to GPU memory constraints.
Results suggest that the fine-tuned models have indeed adapted to the training data distribution. However, since the training data quality is poor to begin with (facing several quality concerns such as typos, grammatical errors, generally short or inaccurate description), the actual performance of the model has degraded. Evaluation metrics may appear high only because the model's outputs are more similar to the low-quality ground truth annotations.
Small model + SFT vs. Large model + LoRA?
In this section, we briefly discuss the trade-offs between using a small model with full SFT versus a larger model with lightweight adaptation methods such as LoRA.
Full fine-tuning, which updates all the model's parameters, requires significant computing and memory resources. In comparison, LoRA adds small trainable adapter layers to the model while keeping the original weights frozen, making it faster and more affordable to train.
Below is a comparison table of our GPU usage during our experiments.
| Model size | 1B | 2B | 4B | 8B | 26B | 40B | 76B |
|---|---|---|---|---|---|---|---|
| Inference | 6167 | 8361 | 12497 | 21745 | 56609 | 81145 | Out of memory |
| LoRA | 39711 | 30963 | 18037 | 43029 | 78877 | Out of memory | Out of memory |
| Fine-tuning | 72423 | 76663 | Out of memory | Out of memory | Out of memory | Out of memory | Out of memory |
As can be seen in the table, the full fine-tuning process is very resource intensive and our 80GB GPU can only support fine-tuning up to the 2B model.
When hardware is limited, it's important to weigh trade-offs between using a large pretrained model as it is, fine-tuning a mid-sized model with LoRA, or applying full SFT to a smaller model.
The largest InternVL2.0 model that can fit in our A100 80Gb is 26B for LoRA and 2B for full SFT. For the sake of this experiment, we compare the performance of LoRA-8B with SFT-2B on the COCO Captions dataset [11].
| Metric | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | METEOR | Rouge | CIDEr |
|---|---|---|---|---|---|---|---|
| Baseline 2B (by RI) |
0.640 | 0.463 | 0.321 | 0.214 | 0.267 | 0.504 | 0.793 |
| Baseline 8B (by RI) |
0.660 | 0.491 | 0.351 | 0.245 | 0.285 | 0.530 | 0.892 |
| LoRA 8B (by RI) |
0.814 | 0.660 | 0.517 | 0.398 | 0.305 | 0.604 | 1.358 | SFT 2B (by RI) |
0.806 | 0.652 | 0.508 | 0.392 | 0.305 | 0.601 | 1.339 |
Despite requiring less GPU resources, LoRA 8B still performs better than SFT 2B at adapting to the target dataset.
If resources are limited and fine-tuning is necessary, it may be more practical to use a lightweight fine-tuning method that allows us to use larger pretrained architectures within the same resource constraints.
Although we did not explore it in this study, the use of even larger pretrained models can be considered with methods that do not require updating model weights. For example, in-context learning adds few examples to the prompt to provide guidance to the desired output format and tone.
Conclusion
In this blog post, we explored some key challenges of fine-tuning VLMs.
Fine-tuning is often seen as the default choice to improve model performance. However this assumption can be misleading and it is frequently more complex and resource-intensive than one may expect.
There are many reasons why fine-tuning can fail to bring improvement.
The choice of fine-tuning may not be appropriate
Fine-tuning is not appropriate for learning external knowledge without a tremendous amount of data.
If the bottleneck is the image part of the architecture rather than the language part.
The task is very general (e.g. captioning, summarization, etc.) and there is no reason to think that the pretrained model’s training is insufficient.
The training data is not prepared carefully
The training data is too different from the test data and is not appropriate for the intended use of the model.
The training data quality is poor and is not able to properly teach the model. The purpose of fine-tuning is not to "improve" the model but to align it more closely with the distribution of the training data.*
The training process is difficult
General knowledge is lost due to catastrophic forgetting.
The ability to follow instructions or answer questions is lost despite the model successfully adapting to the new data.
Given the high cost and the significant risk of degraded performance, we strongly recommend carefully evaluating whether alternative approaches (such as prompt tuning, retrieval-augmented generation (RAG), etc.) may be more suitable and whether basic requirements (amount and quality of the available data) are met.
References
[1] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. url: https://arxiv.org/abs/2312.14238.
[2] Aditya Jain. “To fine-tune or not to fine-tune.” 2024. url: https://ai.meta.com/blog/when-to-fine-tune-llms-vs-other-techniques/.
[3] Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International journal of computer vision. 2015. url: https://arxiv.org/abs/1409.0575.
[4] Schuhmann, Christoph, et al. "Laion-5b: An open large-scale dataset for training next generation image-text models." Advances in neural information processing systems. 2022. url: https://arxiv.org/abs/2210.08402.
[5] Hugging Face Blog. "Vision Language Models Explained." 2025. url: https://github.com/huggingface/blog/blob/main/vlms.md.
[6] InternVL Authors. "Fine-tune on a Custom Dataset." 2025. url: https://internvl.readthedocs.io/en/latest/internvl2.0/finetune.html.
[7] Mathew, Minesh, Dimosthenis Karatzas, and C. V. Jawahar. "Docvqa: A dataset for vqa on document images." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2021. url: https://arxiv.org/abs/2007.00398.
[8] Biten, Ali Furkan, et al. "Scene text visual question answering." Proceedings of the IEEE/CVF international conference on computer vision. 2019. url: https://arxiv.org/abs/1905.13648.
[9] Kembhavi, Aniruddha, et al. "A diagram is worth a dozen images." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer International Publishing, 2016. url: https://arxiv.org/abs/1603.07396.
[10] Fujii, Kazuki, et al. "Continual pre-training for cross-lingual llm adaptation: Enhancing japanese language capabilities." arXiv preprint arXiv:2404.17790 (2024). url: https://arxiv.org/abs/2404.17790.
[11] Chen, Xinlei, et al. "Microsoft coco captions: Data collection and evaluation server." arXiv preprint arXiv:1504.00325 (2015). url: https://arxiv.org/abs/1504.00325.
[12] Papineni, Kishore, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002. url: https://aclanthology.org/P02-1040.pdf.
[13] Google Cloud documentation. "Understanding the BLEU score." 2025. url: https://cloud.google.com/translate/docs/advanced/automl-evaluate#bleu.



