Data Version Control (DVC): Beginner's Guide

Hello everyone, I am Ajmain Inqiad Alam, working as an Artificial Intelligence Engineer at Ridge-i Inc. In this article, I will introduce the Data Version Control (DVC) tool.

What is Data Version Control (DVC)?
Use case of DVC
DVC comparison with other tools
Installation of DVC
Example on how to use DVC
Conclusion
Lastly
Reference

In today's world, Machine Learning (ML) and Artificial Intelligence (AI) have become very significant for us. Almost every aspects of our lives has been influenced by AI and ML. As a result, more and more data has been generated and processed by AI engines. Researchers and Engineers have to deal with a large set of dataset in their day-to-day operation to make their AI models more robust. This huge amount of data needs to be tracked and this is a really difficult and monotonous work. To solve this problem, Data Version Control or DVC comes to the picture. DVC helps to keep track of dataset to help in managing ML models along with keeping the history about who created the dataset, who and when altered the dataset.

What is Data Version Control (DVC)?

According to the official website of Data Version Control (https://dvc.org/), DVC tracks ML models and data sets. DVC is built to make ML models shareable and reproducible. It is designed to handle large files, data sets, machine learning models, and metrics as well as code. In short, DVC is Git for Dataset. The first version of DVC was released in 2017 as a simple command line tool. DVC is an open-source platform that gives users option to reuse their ML experiments. Because DVC does not have its own execution engine, it uses wrapper around Git to version control the data for ML experiments.

Data Version Control (Image source: https://dvc.org)

Use case of DVC

Now-a-days more and more data is being generated. As a result, it is very difficult to track all the data for all the experiments. DVC allows users not to remember manually about the location and details of dataset. Here are the use cases of DVC:

Save and reproduce your experiments: DVC allows user to fetch the full version of dataset at any time with the exact same location that was kept before. As a result, it allows users to re-create the same experiment and same results.
Version control models and data: DVC saves the meta file in various form of cloud storage. Most beginner friendly cloud storage is Google drive.
Establish workflow for deployment & collaboration: DVC acts as a platform to share data, results, collaboration and run the finished model in production environment.

DVC comparison with other tools

Along with DVC, there are other data version control tool to use. Pachyderm is one of the famous data pipeline versioning tool. Here is a comparison between two tools:

DVC is lighter than Pachyderm.
DVC is much more community driven where Pachyderm is more enterprise driven tool.
Integration of DVC is much easier than Pachyderm as DVC does not have its own execution engine.
Pachyderm is more robust compare to DVC

Installation of DVC

DVC is easy to install. Click here to go the installation page of DVC. I have used the following command to install DVC into my device:

~/> pip install dvc

Example on how to use DVC

In this section, we will discuss about some commands of DVC

At first, lets create a folder where we will test our DVC.

~/> mkdir DVC

~/> cd DVC

Initialize DVC

~/DVC > git init

~/DVC > dvc init

This output will be shown after that:

Now let's check, what dvc init created for us:

~/DVC > git status

This will give the following output:

Now let's create a folder and save our dataset in that. For this example, I will use the open source data given by DVC.

~/DVC > mkdir data

~/DVC > dvc get https://github.com/iterative/dataset-registry/get-started/data.xml -o data/data.xml

dvc get is just like curl or wget. It is actually a wrapper on curl.

DVC Push

Now let's push our data into a cloud storage. For cloud storage, I am using Google Drive because of its user friendly behavior. You can opt for any other cloud storage.

~/DVC > dvc add data/data.xml

To track the dataset

~/DVC > git add 'data/data.xml.dvc' 'data/.gitignore'

~/DVC > git commit -m "adding raw data"

Now I will add google drive as my cloud storage

~/DVC > dvc remote add -d storage gdrive://folder-id

I fetched folder-id from google drive folder URL. For example:

Folder URL: https://drive.google.com/drive/folders/abcd

then folder-id : abcd

Now we need to add the DVC config file to git so that we can keep track of this configuration

~/DVC > git commit .dvc/config -m "Configure remote storage"

To push the data to cloud storage:

~/DVC > dvc push

If everything is okay, we should have this as output:

~/DVC > Authentication successful. 1 file pushed

If you are running this for the first time, you may encounter the following error:

To handle this error, use the following command:

~/DVC > pip install dvc-gdrive

After that run the dvc push command again and it should give the desired output. Now go to your google drive or desired cloud storage and you will see that one folder is being created by DVC and it has the meta information and data.

Pull, Change and Re-push

Now let's try to pull the pushed data from cloud, make some changes and push again

Pulling data from cloud storage:

~/DVC > dvc pull

Now let's make some changes in the dataset

~/DVC > cp data/data.xml /tmp/data.xml

~/DVC > cat /tmp/data.xml >> data/data.xml

Now let's push this changes to cloud:

~/DVC > dvc add data/data.xml

~/DVC > git add 'data/data.xml.dvc'

~/DVC > git commit -m "updated dataset"

~/DVC > dvc push

It will create a separate folder into your cloud storage, so that it will not replace the previous data.

Conclusion

In this article, we have learnt about the DVC and its usage. At the end, we have seen some example code so that we can have some idea on how to operate DVC. DVC gives users the option to track, visualize and re-use the dataset. If there is a large ML projects where lots of data manupulation or processing is required, DVC is surely recommened to use on that case. Additionally, DVC also helps to debug the ML models by enchancing the reproducibility.