Using diffusion models to generate synthetic data for real-life projects

Hello, this is Aurélie, working as an Artificial Intelligence Engineer at Ridge-i. Today, I would like to introduce a way to use synthetic data generated with diffusion models to improve the performance of our models even when working with difficult data (low amount of data, unusual data) !

Disclaimer : Survey and case study were conducted in October 2023

Motivation
A survey of existing methods
Case study - SAR images of boats for object counting
Future improvements
Conclusion

Motivation

Recent research papers have shown that it is possible to improve the performance of models for popular benchmarks by using synthetic data generated by diffusion models.

However, the data used in actual projects (such as the ones conducted in Ridge-i) are very different from the ones used in papers.

Can similar methods be applied easily at a low cost in an industrial context ?

A survey of existing methods

Benefits of using diffusion models for synthetic data generation

Diffusion models are a very popular choice for generating synthetic data due to their ability to generate very high quality realistic images.

They are good at generating diverse outputs and can even “combine” knowledge from different classes to generate novel data unseen in the diffusion model’s training data.

If the model has been trained on “avocado” and “chair”, it’s possible to generate “avocado” chairs even if there are no examples in the training set.

This is especially attractive if the client is unable to gather data easily (difficult, illegal or dangerous to collect) as long as we can describe the desired data in enough detail !

Differences between academic research vs. the industry

The most popular way to use diffusion models to generate synthetic data is by far text-to-image methods. By inputting a prompt (= text), an output image matching the description can be generated.

A large amount of research papers can be found online and have already proven that synthetic images from diffusion models can help improve the performance of models trained for various tasks. (e.g. recognition [1], segmentation [2], continual learning for classification [3] etc).

However, the majority of the papers published focus almost exclusively on very general datasets and popular benchmarks.

On the other hand, the majority of projects work on unusual data such as microscopes, satellite, medical imagery etc.

In that case, it does not seem possible to easily use text-to-image methods seen in papers.

Even if we are able to properly come up with a prompt, the results might not be good. For example, it’s very clear that recent diffusion models we have tested for this article (StableDiffusion 2.1., StableDiffusion XL, DALLE-2) don’t know what a “SAR” image is.

"Satellite view of boats in the sea by a Synthetic-aperture radar."

Several methods exist to solve this issue (learn a new definition for our dataset with textual inversion, fine-tuning the diffusion model etc, see one of our blog post for more details) but today, we would like to introduce a no-training, low resource method can be used at a low cost even by non-engineers !

Methods that can be used for actual projects

Writing a (good) prompt can be challenging as data from the industry can be hard to describe. Furthermore, there is no guarantee that diffusion models trained mostly on very general data (e.g. cat, dog, table, chair) would understand a description with technical vocabulary. (e.g. SAR, LiDAR, technical names for machinery parts etc)

One solution would be to use image inputs instead of a text input !

Variations can be generated from an input image and this method can be used on a large variety of data.

Left most column : Input / Second to last columns : Variations

It is possible to use additional inputs like text, a mask, a label map etc. However, the results of our internal tests are not very good.

Text input fails in the same way that text-to-image fails on unusual data. In the example below, (DALLE-2) completely fails at generating trees.

Top. Original image with mask / Bottom : Synthetic images Prompt : “Professional photograph of trees, Satellite imagery, remote sensing, worldview”

A label map can be used to guide the position of the objects we want to generate. However, we were not able to get good results with the current state-of-the-art papers. For example, DenseDiffusion (ICCV 2023) was not able to generate the objects at the desired position and even failed at generating some of them !

Top : Label map / Bottom : Generated images

In conclusion, it seems that any method involving labels or text inputs will fail on unusual data if the diffusion model is not fine-tuned first.

Case study - SAR images of boats for object counting

To demonstrate whether synthetic data can improve the performance of models in a setting similar to actual projects we can see in Ridge-i, we decided to conduct an internal small trial.

Data : SAR images of boats, HRSID dataset (GPL-3.0 license)
Dataset size : Small (60 or 100 images for training and validation)
Task Object detection

Variations using DALLE-2

We have chosen DALLE-2 for the generation of our image since all rights to the generated image are granted to the user.

Note. Stable Diffusion can be used for free even for commercial usage (although it is not recommended by the authors) and could be a better choice in case of large amounts of synthetic data to generate.

Top : Original image / Bottom : Variations

The data was then annotated manually.

Settings

We have decided on two data splits

Split_1 : 80 (Train) / 20 (Validation) / 100 (Test)
Split_2 : 40 (Train) / 20 (Validation) / 100 (Test)

The experiments conditions are as follow :

Baseline - No synthetic data
2 variations per real sample (train and validation)
10 variations per real sample (train and validation)

The goal is to check - Is the performance bigger for a lower amount of real data ? - Does increasing the number of variations improve the performance ?

Model

We have decided to use YOLOv3 (license : GPL-3.0 license) due to its license and ease of use. We are using the same parameters (e.g. batch size, image size etc) for all experiments.

Results

Conclusion

We can see from the results that :

There is a significant improvement in performance by using synthetic data from diffusion models
Using more variations seems to improve the performance. However, the ratio between synthetic/real should be chosen carefully.
The improvement is especially important when the original dataset is very small.

Below are some images of the results we can get from the test set :

Top : Groundtruth label / Bottom : Detection results for experiments 1-4

Future improvements

This time, we were able to get a significant improvement of the performance of our object detection models without any fine-tuning of the stable diffusion models ! As such, this process can be applied at a very low cost even by non-engineers !

If time and resources are available, performance is likely to improve further with fine-tuning, textual inversion etc.

One limitation of our process is that labels can only be generated automatically for classification tasks. Some areas of research seem promising to solve this issue :

Guided models that can take a label map as input
Diffusion models that can output both the result image and the label. In the future, we will keep an eye on good papers and assess whether they can be used for actual projects !

Conclusion

This time, we have confirmed that synthetic data generated by diffusion models can be used to boost the performance of our models even for unusual data such as SAR imagery !

Furthermore, it seems that good results can be achieved at a low cost without fine-tuning the stable diffusion model.