Investigating fine-tuning limitations for VLMs with three case studies

Hello, this is Aurélie, working as an Artificial Intelligence Engineer at Ridge-i. Today I would like to share some insights about fine-tuning vision language models (VLMs)! 

Introduction

As interest in vision language models (VLMs) grows, fine-tuning is increasingly seen as a promising way to adapt models for specific applications. For teams exploring this path, it’s important to take some time to assess whether fine-tuning is the most suitable solution and whether it’s likely to lead to meaningful improvement, given the high computational cost and large amounts of data requirements.

In this blog post, we share some examples of internal experiments where fine-tuning failed to improve our model’s capabilities and highlight some key challenges. We consider full supervised fine-tuning (SFT) and Low-Rank Adaptation method (LoRA) which adds small trainable matrices to some specific weights of a frozen model, making fine-tuning efficient with minimal changes.

Note: All experiments used InternVL2.0 models [1] and were conducted in October 2024. Since then, InternVL3.0 has been released. All experiments were done using a single NVIDIA A100 GPU with 80GB of memory.

To fine-tune or not to fine-tune

Prior to the rise of LLMs, fine-tuning was commonly used for smaller-scale models (100M – 300M parameters). However, with the advent of larger models (> 1B parameters), the question of fine-tuning has become more nuanced.

Quote from a Meta blog post: "To fine-tune or not to fine-tune" [2]

State-of-the-art models are often released in multiple sizes, with larger models offering the best performance but also requiring significantly more resources and large amounts of high-quality data to fine-tune.

As a result, the first obstacle encountered by organizations looking to fine-tune a large model is the requirements in terms of computing infrastructure and volume of high-quality data.

Even if computing resources and access to data are not an issue, fine-tuning may not be suitable for all types of tasks. While it is useful for adapting the model’s output style, vocabulary, tone, it is generally not recommended to inject external knowledge [2]. This is because LLMs and VLMs are not designed to memorize and reliably retrieve highly specific facts (e.g. temperature in a specific place on a specific day) and it is difficult to control or verify that the knowledge gets accurately embedded into the weights during fine-tuning.

Therefore, it’s important to begin by identifying the source of the performance limitations and to assess whether fine-tuning is the most appropriate solution.

Finally, fine-tuning a large pretrained model with high capabilities always comes with some risk. Some issues like catastrophic forgetting, where the model loses previously acquired general knowledge, can lead to degraded performance and a reduced ability to follow instructions.

In this blog post, we share three quick fine-tuning experiments on different datasets that failed to improve performance, and we explore the underlying reasons.

Which modules to keep frozen in a VLM when fine-tuning

A VLM is usually composed of 3 main modules:

  • A vision encoder which extracts features from input images

  • A projector, which maps the visual features into the language model’s embedding space

  • A language model, which processes both the language tokens and the projected image features

When fine-tuning a pretrained model on a new domain-specific dataset, it is common to keep the vision model frozen. This is because the visual encoder is typically trained on large-scale image datasets (e.g., ImageNet [3], LAION [4], etc.) and already captures rich and generalizable visual features. Unless large amounts of training data is available and there is a significant domain gap between the target data and the pretrained data, we recommend keeping it frozen.

As a default, InternVL proposes to keep the vision encoder frozen and to fine-tune the projector and the language model [6]. For our experiments, we also keep the vision part frozen.

Below is a figure of a typical VLM architecture where the vision part is kept frozen for the fine-tuning step.

The vision encoder is kept frozen in a typical VLM architecture [5]

Alternatively, the projector can also be kept frozen during fine-tuning to help reduce the risk of overfitting when the pretraining and target domains are similar. It can be useful when working with small datasets as the projector is usually small (often just a single layer) which makes it prone to overfitting.

Internal experiments

To explore the limitations of fine-tuning methods like full SFT and LoRA, we conducted experiments on three vision-language challenges.

  • The DocVQA dataset evaluates content extraction from documents. Our results highlight the importance of identifying whether performance limitations originate from the language model or the vision encoder, since fine-tuning typically targets only the language component.

  • The AI2D task involves diagram question answering. Our experiments highlight the impact of format mismatch between training and test data, as well as the risk of catastrophic forgetting.

  • COCO Captions is a dataset designed for the image captioning task. Our experiments point out issues with metric interpretation and the effects of low-quality training data.

The case of DocVQA

DocVQA [7] is a multimodal open-source dataset for content extraction. It contains about 12k images and 50k high-level questions about the documents. The questions are relatively simple and aim to check whether the content was correctly extracted from the image.

Below is an example of a DocVQA image and associated questions.

Example of a DocVQA sample

This dataset is evaluated using the Average Normalized Levenshtein Similarity (ANLS) metric [8] which measures how similar a predicted answer is to the ground truth based on character-level edits. Higher scores mean more accurate text matching.

In the table below, we compare the performance and resource usage of the pretrained models to our fine-tuned models using LoRA.

With LoRA using DocVQA 10k training set
Results are for the validation set
Default parameters: Batch = 16, Per_device = 1
Model 1B 2B 4B 8B 26B
Architecture internvl2_1b_qwen2_0_5b internvl2_2b_internlm2_1_8b internvl2_4b_phi3_3_8b internvl2_8b_internlm2_7b internvl2_26b_internlm2_20b
Pretrained ANLS (val) 0.7883 0.8466 0.8699 0.8972 0.9058
LoRA ANLS (val) 0.7919 0.8405 0.8679 0.8899 0.8991
Difference +0.0036 −0.0061 −0.0020 −0.0073 −0.0067
GPU Usage 39711 MiB 30963 MiB 18037 MiB 43029 MiB 78877 MiB
Training Time 43 min 53 min 1h31m 2h19m 6h47m

Below are the results for the full SFT method.

Full SFT results on DocVQA 10k train set
We use default parameters with Batch = 128, Per_device = 4
Model 1B 2B
Architecture internvl2_1b_qwen2_0_5b internvl2_2b_internlm2_1_8b
Pretrained ANLS (val) 0.7883 0.8466
Fine-tuned ANLS (val) 0.7959 0.8202
Difference +0.0076 −0.0264
GPU Usage 72423 MiB 76663 MiB
Training Time 34 min 51 min

Unfortunately, there is almost no change in performance by using the proposed fine-tuning methods. There are several possible explanations.

First, since both the dataset (text documents) and the task (text extraction, similar to OCR) are very general, the performance of the pretrained models is already quite good and there is no reason to believe that the pretrained models would be lacking as they have been trained on large amounts of similar data.

Secondly, this task relies heavily on challenging content extraction (some documents are handwritten or hard to read) while the language task (answering simple questions from the extracted text) is more simple. As a result, there is a discrepancy between what the model is learning during fine-tuning (mainly language patterns) and what is actually being evaluated (successful visual content extraction).

Finally, since fine-tuning a VLM typically assumes the vision encoder remains frozen and only the language component is updated, performance won't improve when the image model is the true bottleneck.

The case of AI2D

The AI2D dataset [9] contains over 5,000 grade school science diagrams and 15,000 multiple-choice questions for research on diagram understanding and question answering.

Example of AI2D sample image and question

First, we try full SFT using the default parameters.

LoRA and full SFT results on AI2D
Model 1B
Architecture internvl2_1b_qwen2_0_5b
Learning rate 4e-5
Pretrained ANLS (test) 0.644
Fine-tuning ANLS (test) 0.0

Unfortunately, the test accuracy drops to 0.

While looking at the actual test results, we noticed a reduced capability to follow instructions by the fine-tuned model. Even when explicitly prompted to answer only with the correct option’s letter, it often returned the content of the option instead.

{
  "question": "What is between the head and abdomen?\nA. Antenna\nB. Simple eye\nC. Spiracle\nD. Thorax\nAnswer with the option's letter from the given choices directly.",
  "image": "345802",
  "answer": "Thorax",
  "annotation": "D"
}

Example of a fine-tuned model output
Although the answer is correct, the model does not follow the expected format

The main explanation for this issue comes from the training and test sets using different formats.

{
"id": 1, 
"image": "images/7.png", 
"conversations": 
    [{"from": "human", "value": "<image>\nWhich plant has leaves modified into spikes?Smilax\nBanayan tree\nUtricularia\nCactus Please answer the question based on the options mentioned before."}, 
    {"from": "gpt", "value": "Cactus"}]
}

Example of a training sample
The model is expected to return the text of the correct answer directly (e.g., "Cactus"), not the option number or letter

{
"id": 345802, 
"image": "data/ai2diagram/AI2D_TEST/345802.jpg", 
"question": "What is between the head and abdomen?\nA. Antenna\nB. Simple eye\nC. Spiracle\nD. Thorax\nAnswer with the option's letter from the given choices directly.", 
"question_id": "345802", 
"answer": "D", 
"category": "partsOfA", 
"abcLabel": "False"
}

Example of a test sample
The model is expected to answer with the letter corresponding to the correct answer (e.g. "D")
The dataset was downloaded from the InternVL documentation and comes preprocessed for their format, which may differ from versions available on other platforms like HuggingFace

This suggests that the model, having been trained on data where answers are provided as full content, has learned to consistently respond with the answer text rather than the option label even when asked to.

To prevent the model from overfitting on the training data’s format, we reduce the learning rate and repeat the experiments with LoRA and full SFT.

LoRA and full SFT results on AI2D with a reduced lr
Model 1B
Architecture internvl2_1b_qwen2_0_5b
Pretrained ANLS (test) 0.644
LoRA ANLS (test) 0.615
Fine-tuning ANLS (test) 0.639 with a small lr (=1e-6)

This time, we confirm that the model retains its ability to follow the format instruction. However, the performance drops for both methods using LoRA (0.644 → 0.615) and full SFT (0.644 → 0.639).

As we suspect that one reason for the reduced performance is catastrophic forgetting, we repeated the experiment while including part of the general-domain data originally used to train the pretrained weights.

This strategy to limit catastrophic forgetting is widely used when fine-tuning large models (e.g. Swallow [10], InternVL [1], etc.) on domain-specific data, with the goal of enhancing downstream capabilities while retaining the foundational skills [6].

LoRA results on AI2D when using additional general data
Model 1B
Architecture internvl2_1b_qwen2_0_5b
LoRA ANLS (test)
Train: AI2D + General data
0.626
Training Time 26h13min

Although using additional general data during fine-tuning seems to limit catastrophic forgetting, the performance of the fine-tuned model is still worse than that of the pretrained model. More advanced mitigation techniques exist, but they require significant changes to the training pipeline and a lot of setup effort.

In conclusion, when fine-tuning a model on domain-specific datasets, it is crucial to ensure that the training data is of high quality and closely matches the format and expectations of the test set. Additionally, it is important to consider methods to limit catastrophic forgetting which inevitably make the training process more complex and more time and resource intensive.

The case of COCO Captions

The COCO Captions dataset [11] contains over one and a half million captions describing over 330,000 images. For each image of the training and validation sets, five independent crowdsourced captions are provided.

COCO dataset example for image captioning [11]
5 crowd-sourced captions are provided for each sample

First, we confirm that we can reproduce the LoRA results of InternVL authors’ results. We also include the full SFT results which are not provided by the authors. The metrics are the same as used by InternVL.

Comparison of LoRA on InternVL2.0-2B to baseline and authors’ results
Baseline = pretrained 2B model
Metric Bleu_1 Bleu_2 Bleu_3 Bleu_4 METEOR Rouge CIDEr
Baseline
(reproduced by RI)
0.640 0.463 0.321 0.214 0.267 0.504 0.793
LoRA
(by InternVL authors)
0.805 0.649 0.504 0.385 0.300 0.595 1.312
LoRA
(by RI)
0.804 0.649 0.501 0.382 0.299 0.594 1.305
SFT
(by RI)
0.806 0.652 0.508 0.392 0.305 0.601 1.339

In terms of the metrics, we can clearly see a significant improvement for the fine-tuned models compared to the pretrained model.

However, can we truly say that the model performance has improved?

First, it is important to understand the BLEU (BiLingual Evaluation Understudy) metric [12]. This metric was originally used to assess the quality of a translated text by comparing the machine translation to a ground-truth human made translation.

According to Google AutoML documentation [13], there are several points to be careful about when using BLEU.

  • BLEU is a Corpus-based Metric. It performs badly when used to evaluate individual sentences. It is mainly used to compare whether two corpus are similar (length, vocabulary etc.).

  • There is no distinction between content and function words. A dropped function word like "a" gets the same penalty as if the name "NASA" was erroneously replaced with "ESA".

  • Not good at capturing the meaning and grammaticality of a sentence. Dropping a single word like "not" can change the meaning of a sentence. However BLEU only imposes a small penalty since it treats all words equally and considers it as just one word difference.

As a result, it is possible to achieve a very high BLEU score using sentences that don’t make sense or are opposite in meaning. For example, the following pair of sentences has a score of 0.8:

Reference: the cat is on the mat

Candidate: the the the cat mat

In our case, the ground-truth annotations used as reference for BLEU are crowdsourced. Therefore, the descriptions are usually short and may have issues such as typos or grammatical errors.

Example of a crowdsourced annotation
Image Crowdsourced annotation
"some dessert is laying out on a yellow and white plate", → Punctuation and caps issues "A plate containing a slice of dessert, two forks and some piped cream", "Pastry sitting on top of a golden white plate with forks.", → Not detailed enough "Two forks on a plate of cake and cream.", → Not detailed enough "THIS IS A PHOTO OF A DESERT PLATE FOR TWO" → All caps, includes a typo "dessert" → "desert" and is not detailed enough

After investigating the results of the fine-tuned model, it does seem like the fine-tuned models' answers are more similar to the training data. For several cases, the fine-tuned models end up producing short descriptions that are not very detailed, similarly to the crowdsourced annotations, as shown in the table below.

Comparison of Pretrained 2B, SFT-2B and LoRA-8B models on a few examples
Input image Pretrained 2B SFT 2B LoRA 8B
A dessert plate with a slice of cake, two scoops of ice cream, and a spoon. A piece of cake with whipped cream and chocolate sauce.
→ best, according to human evaluation
A plate of food with a fork on it.
A man in a cowboy hat rides a horse down a street, with people watching.
→ best, according to human evaluation
A man riding a horse down a street next to a crowd of people. A man riding a horse down a street.
A bed with colorful bedding and pillows is set up in a room covered with blue plastic sheets.
→ best, according to human evaluation
A bed in a room with blue curtains and a blue sheet. A bed with a blue cover and a blue curtain.
A collection of colorful street art is displayed on a wooden fence, with a stop sign and a cityscape illustration.
→ best, according to human evaluation
A row of paintings and a stop sign on a fence. A bunch of paintings are leaning against a fence.

In this case, neither LoRA nor full SFT significantly improve performance according to human evaluation, even when using larger architectures (e.g., LoRA 8B). 

Note: We were unable to include SFT results for 8B due to GPU memory constraints.

Results suggest that the fine-tuned models have indeed adapted to the training data distribution. However, since the training data quality is poor to begin with (facing several quality concerns such as typos, grammatical errors, generally short or inaccurate description), the actual performance of the model has degraded. Evaluation metrics may appear high only because the model's outputs are more similar to the low-quality ground truth annotations.

Small model + SFT vs. Large model + LoRA?

In this section, we briefly discuss the trade-offs between using a small model with full SFT versus a larger model with lightweight adaptation methods such as LoRA.

Full fine-tuning, which updates all the model's parameters, requires significant computing and memory resources. In comparison, LoRA adds small trainable adapter layers to the model while keeping the original weights frozen, making it faster and more affordable to train.

Below is a comparison table of our GPU usage during our experiments.

Comparison of the GPU usage (MiB) on a NVIDIA A100
Note: Through optimization techniques and fragmentation strategies, it is possible that bigger models take less space in memory than some smaller models
Model size 1B 2B 4B 8B 26B 40B 76B
Inference 6167 8361 12497 21745 56609 81145 Out of memory
LoRA 39711 30963 18037 43029 78877 Out of memory Out of memory
Fine-tuning 72423 76663 Out of memory Out of memory Out of memory Out of memory Out of memory

As can be seen in the table, the full fine-tuning process is very resource intensive and our 80GB GPU can only support fine-tuning up to the 2B model.

When hardware is limited, it's important to weigh trade-offs between using a large pretrained model as it is, fine-tuning a mid-sized model with LoRA, or applying full SFT to a smaller model.

The largest InternVL2.0 model that can fit in our A100 80Gb is 26B for LoRA and 2B for full SFT. For the sake of this experiment, we compare the performance of LoRA-8B with SFT-2B on the COCO Captions dataset [11].

Comparison of LoRA-8B and SFT-2B with InternVL2.0 on COCO Captions dataset
Metric Bleu_1 Bleu_2 Bleu_3 Bleu_4 METEOR Rouge CIDEr
Baseline 2B
(by RI)
0.640 0.463 0.321 0.214 0.267 0.504 0.793
Baseline 8B
(by RI)
0.660 0.491 0.351 0.245 0.285 0.530 0.892
LoRA 8B
(by RI)
0.814 0.660 0.517 0.398 0.305 0.604 1.358
SFT 2B
(by RI)
0.806 0.652 0.508 0.392 0.305 0.601 1.339

Despite requiring less GPU resources, LoRA 8B still performs better than SFT 2B at adapting to the target dataset.

If resources are limited and fine-tuning is necessary, it may be more practical to use a lightweight fine-tuning method that allows us to use larger pretrained architectures within the same resource constraints. 

Although we did not explore it in this study, the use of even larger pretrained models can be considered with methods that do not require updating model weights. For example, in-context learning adds few examples to the prompt to provide guidance to the desired output format and tone.

Conclusion  

In this blog post, we explored some key challenges of fine-tuning VLMs. 

Fine-tuning is often seen as the default choice to improve model performance. However this assumption can be misleading and it is frequently more complex and resource-intensive than one may expect. 

There are many reasons why fine-tuning can fail to bring improvement.

  • The choice of fine-tuning may not be appropriate

    • Fine-tuning is not appropriate for learning external knowledge without a tremendous amount of data.

    • If the bottleneck is the image part of the architecture rather than the language part.

    • The task is very general (e.g. captioning, summarization, etc.) and there is no reason to think that the pretrained model’s training is insufficient.

  • The training data is not prepared carefully

    • The training data is too different from the test data and is not appropriate for the intended use of the model.

    • The training data quality is poor and is not able to properly teach the model. The purpose of fine-tuning is not to "improve" the model but to align it more closely with the distribution of the training data.*

  • The training process is difficult

    • General knowledge is lost due to catastrophic forgetting.

    • The ability to follow instructions or answer questions is lost despite the model successfully adapting to the new data.

Given the high cost and the significant risk of degraded performance, we strongly recommend carefully evaluating whether alternative approaches (such as prompt tuning, retrieval-augmented generation (RAG), etc.) may be more suitable and whether basic requirements (amount and quality of the available data) are met.

References

[1] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. url: https://arxiv.org/abs/2312.14238.

[2] Aditya Jain. “To fine-tune or not to fine-tune.” 2024. url: https://ai.meta.com/blog/when-to-fine-tune-llms-vs-other-techniques/

[3] Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International journal of computer vision. 2015. url: https://arxiv.org/abs/1409.0575.

[4] Schuhmann, Christoph, et al. "Laion-5b: An open large-scale dataset for training next generation image-text models." Advances in neural information processing systems. 2022. url: https://arxiv.org/abs/2210.08402.

[5] Hugging Face Blog. "Vision Language Models Explained." 2025. url: https://github.com/huggingface/blog/blob/main/vlms.md

[6] InternVL Authors. "Fine-tune on a Custom Dataset." 2025. url: https://internvl.readthedocs.io/en/latest/internvl2.0/finetune.html.

[7] Mathew, Minesh, Dimosthenis Karatzas, and C. V. Jawahar. "Docvqa: A dataset for vqa on document images." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2021. url: https://arxiv.org/abs/2007.00398.

[8] Biten, Ali Furkan, et al. "Scene text visual question answering." Proceedings of the IEEE/CVF international conference on computer vision. 2019. url: https://arxiv.org/abs/1905.13648

[9] Kembhavi, Aniruddha, et al. "A diagram is worth a dozen images." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer International Publishing, 2016. url: https://arxiv.org/abs/1603.07396

[10] Fujii, Kazuki, et al. "Continual pre-training for cross-lingual llm adaptation: Enhancing japanese language capabilities." arXiv preprint arXiv:2404.17790 (2024). url: https://arxiv.org/abs/2404.17790

[11] Chen, Xinlei, et al. "Microsoft coco captions: Data collection and evaluation server." arXiv preprint arXiv:1504.00325 (2015). url: https://arxiv.org/abs/1504.00325

[12] Papineni, Kishore, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002. url: https://aclanthology.org/P02-1040.pdf.

 [13] Google Cloud documentation. "Understanding the BLEU score." 2025. url: https://cloud.google.com/translate/docs/advanced/automl-evaluate#bleu.