ML data annotation workflow with CVAT, FiftyOne, and DVC

Hello.

My name is Jonathan, an artificial intelligence engineer at Ridge-i inc.

As do many AI engineers, I enjoy designing and training ML (machine learning) models, and experimenting with various models and architectures. However, one of the most important parts of any successful ML project is creating good training data. For a typical computer vision problem, this would involve sourcing raw image data, cleaning it, and annotating it (among other things). This article focuses on the last step, annotating image data.

When annotating an image dataset as part of an ML project, ideally you want the following conditions satisfied:

  • Seamless and fast transition between annotation and model training to enable quick ML cycles.

  • Ability to collaborate on and outsource your data annotation tasks in a way that maintains tight integration with your ML project source code.

To achieve this, you need to have your annotation environment centralized, free of considerations about underlying data formats and structure, and yet easily accessible to the rest of your MLOps code. Additionally, if you're annotating your data in stages, it is also great to be able to switch between dataset versions.

In this article, I will introduce three open source tools you can integrate in your MLOps pipeline to achieve these goals, and include Python code examples of how to use them.

Separation of concerns between annotation and model training

CVAT

CVAT (computer vision annotation tool) is an open-source tool used to annotate data for computer vision models. It offers cloud and self-hosted options and it allows collaboration on annotation projects.

FiftyOne

FiftyOne is a dataset management tool, that enables dataset visualization and model interpretation. It integrates seamlessly with CVAT, requiring minimal code to incorporate CVAT into your project. I mentioned before that the annotation environment should be free of concerns about underlying data formats and dataset structures. Well, typically that would be handled by the code that interfaces with CVAT but FiftyOne abstracts those details further and allows you to focus on the data itself.

DVC

Short for data version control, DVC is a tool that does exactly what its name suggests. It helps you keep track of versions of your data and provides a mechanism to switch between these data versions.

We will use a simple example of annotating an object detection dataset, and go through a simple workflow of importing, annotating, and versioning the dataset.

Setup

  1. Install necessary packages
  2. Set up CVAT
Install necessary packages

Assuming that you already have a python environment set up, you then need to install FiftyOne, CVAT and DVC.

You can install FiftyOne through pip or poetry. For more details, refer to installation instructions.

For CVAT, you can use the cloud solution (CVAT.ai) or install a self-hosted solution. The self-hosted option gives you more control over which version to use, to ensure compatibility with other tools in your ML pipeline (FiftyOne in our case)[*1] . Additionally, it allows you to add models for assisted labeling (like the Segment-Anything-Model). To install a self-hosted CVAT server, you can follow the instructions at the official website.

You also need to install DVC. Installation instructions can be found in the official documentation. Note that you need to have Git installed in your system to enable the data versioning features of DVC. Then open a terminal inside your project directory and run the following command to initialize it as a DVC project:

dvc init

This will create some internal files that should be added to Git as follows:

git commit -m "Initialize DVC"
Setup CVAT

After following the above instructions to install CVAT, you should be able to access it from your browser. Assuming you installed it in your local environment, you can access it from http://localhost:8080. There isn't much else to to do, but optionally, you might want to do the following additional steps (using the CVAT GUI):

  • Create user(s). After following the installation instructions, you should have a super-user account. You can use this account to create users and assign them privileges through user groups.
  • Create an organization. An organization is the basis of collaboration in CVAT. You can add members to an organization and assign them to labeling tasks later.

  • Create a project. You do not need a project to create an annotation task but a project helps you to organize tasks of the same type within CVAT.

Workflow

The workflow will involve the following steps:

  1. Import dataset
  2. Upload data to CVAT
  3. Download annotations
  4. Export and version dataset
1. Import dataset

You can create a FiftyOne dataset by importing from an image directory as follows:

# dataset_import.py

import fiftyone as fo

image_dir = "/path/to/your/images"
dataset_name = "object_detection_test_dataset"
dataset = fo.Dataset(name=dataset_name)
dataset.add_dir(image_dir, dataset_type=fo.types.ImageDirectory)
dataset.persistent = True

session = fo.launch_app(dataset)
session.wait()

The line dataset.persistent = True is important, as it persists your dataset to the FiftyOne database. Without it, your dataset will be deleted from the database once the program exits. Persistent datasets also work better with annotation jobs, so this will come in handy later.

The last two lines simply create a FiftyOne app session and connects it to your dataset to it so you can view your dataset in the browser. session.wait() makes the program wait until you have exited the app.

2. Upload data to CVAT

Next, to upload your dataset to CVAT for annotation, you can do the following:

# start_annotation.py

import fiftyone as fo

dataset_name = "object_detection_test_dataset"
dataset = fo.load_dataset(name=dataset_name)
annotation_key = "ground_truth_detections"

label_schema = {
    "ground_truth": {
        "type": "detections",
        "classes":["class1", "class2", "class3"],
    }
}

dataset.annotate(
    annotation_key,
    label_schema=label_schema,
    organization="your-organization-name",
    project_name="your-project-name",
    task_name="your-task-name",
    url="http://localhost:8080"
)

In the above code, we begin by loading the dataset using fo.load_dataset(name=dataset_name). This is possible because we made the dataset persistent in the previous script.

Next, we define the annotation key.

annotation_key = "ground_truth_detections"

An annotation key is a unique identifier for an annotation run. We will use this to import the annotations back into the dataset.

Next, we define a label schema. This defines:

  1. A label field ground_truth, which simply put, points to a particular set of labels over the same dataset. For example, you can have ground_truth and model_prediction label fields. If your CVAT task will modify existing labels, make sure the value corresponds to the name of the label field you would like to modify.

  2. Type of label (detection bounding boxes in this case), and

  3. Expected object classes.

We then call dataset.annotate() to upload the annotation task, using the annotation key defined earlier. By default, this assumes that your annotation backend is CVAT, but if your default is set to something else, you can choose CVAT as the backend using the backend keyword argument. You can also set some CVAT-specific keyword arguments here, such as organization, project, and task names. If a project with the specified name does not exist, one is created. If the url argument is not provided, FiftyOne connects to the CVAT cloud instead.

3. Download annotations

Once annotation is finished in CVAT, we can then import the annotations back into FiftyOne as follows:

# finish_annotation.py

import fiftyone as fo

dataset_name = "object_detection_test_dataset"
annotation_key = "ground_truth_detections"
dataset = fo.load_dataset(name=dataset_name)

dataset.load_annotations(annotation_key, url="http://localhost:8080")

view = dataset.load_annotation_view(annotation_key)
session = fo.launch_app(view)
session.wait()

After loading the dataset, we call dataset.load_annotations() with the annotation key to import the annotations into the dataset. Note that if you are using a self-hosted CVAT instance, you need to specify the url argument here. Otherwise, this function will try to access the CVAT cloud. The last three lines open the FiftyOne app with a dataset view of the imported annotations. Optionally, you can add the following:

results = dataset.load_annotation_results(annotation_key)
results.cleanup()
dataset.delete_annotation_run(annotation_key)

This will delete the task in CVAT and remove the annotation run from FiftyOne. Note that there are situations where you might not want to do this e.g. if you want to train your model on the available annotations for hyper-parameter tuning purposes, while you wait for the rest of the annotation work. You can call dataset.load_annotations() as many times as necessary, if the task in CVAT and the annotation key still exists.

4. Export and version dataset

Thus far, the annotations have been stored in the FiftyOne database. To manage the dataset with DVC, both the images and annotations need to be exported to a location in disk. As an example, you can export the dataset in YOLOv5 format as follows:

import fiftyone as fo

dataset_name = "object_detection_test_dataset"
dataset = fo.load_dataset(name=dataset_name)

dataset.export(
    export_dir="path/to/dataset",
    dataset_type=fo.types.YOLOv5Dataset,
    label_field="ground_truth",
)

FiftyOne provides several formats for exporting datasets and the ability to switch between formats. By default, this will attempt to merge the outgoing export with any existing dataset of the same format in the specified location. You can disable this behavior by calling export() with overwrite=False. One thing to be note about if you are managing your dataset using DVC, is that some FiftyOne dataset formats store annotations with absolute paths to images. This makes it challenging to access the same dataset from a different machine later. Some of them, however, provide an option to disable absolute paths when exporting your dataset.

To track the exported dataset with DVC, simply run the following commands:

git add path/to/dataset.dvc path/to/.gitignore
git commit -m "First annotation"

You might probably want to configure a remote storage for your dataset. For more details, you can refer to DVC's remote storage guide, but as an example, you could add an AWS remote storage as follows:

dvc remote add -d storage s3://mybucket/dvcstore

and then push your changes with:

dvc push

Note that if you're accessing your dataset remotely, you also need to configure a Git remote and push all DVC related files that are tracked by Git.

That was a brief demonstration of how to use CVAT, FiftyOne, and DVC to create a seamless annotation workflow for your machine learning projects!

References

links:

[1] https://docs.voxel51.com/getting_started/install.html

[2] https://opencv.github.io/cvat/docs/administration/basics/installation/

[3] https://opencv.github.io/cvat/docs/administration/basics/installation/

[4] https://dvc.org/doc/user-guide/data-management/remote-storage

*1:At the time of writing this article, the latest version of FiftyOne (v0.20.1) is compatible with CVAT v2.3.0. Make sure to switch to the correct branch of CVAT before installing through docker composer