Hello everyone, I am Ajmain Inqiad Alam, working as an Artificial Intelligence Engineer at Ridge-i Inc. In this article, I will introduce the Data Version Control (DVC) tool.
- What is Data Version Control (DVC)?
- Use case of DVC
- DVC comparison with other tools
- Installation of DVC
- Example on how to use DVC
In today's world, Machine Learning (ML) and Artificial Intelligence (AI) have become very significant for us. Almost every aspects of our lives has been influenced by AI and ML. As a result, more and more data has been generated and processed by AI engines. Researchers and Engineers have to deal with a large set of dataset in their day-to-day operation to make their AI models more robust. This huge amount of data needs to be tracked and this is a really difficult and monotonous work. To solve this problem, Data Version Control or DVC comes to the picture. DVC helps to keep track of dataset to help in managing ML models along with keeping the history about who created the dataset, who and when altered the dataset.
What is Data Version Control (DVC)?
According to the official website of Data Version Control (https://dvc.org/), DVC tracks ML models and data sets. DVC is built to make ML models shareable and reproducible. It is designed to handle large files, data sets, machine learning models, and metrics as well as code. In short, DVC is Git for Dataset. The first version of DVC was released in 2017 as a simple command line tool. DVC is an open-source platform that gives users option to reuse their ML experiments. Because DVC does not have its own execution engine, it uses wrapper around Git to version control the data for ML experiments.
Use case of DVC
Now-a-days more and more data is being generated. As a result, it is very difficult to track all the data for all the experiments. DVC allows users not to remember manually about the location and details of dataset. Here are the use cases of DVC:
Save and reproduce your experiments: DVC allows user to fetch the full version of dataset at any time with the exact same location that was kept before. As a result, it allows users to re-create the same experiment and same results.
Establish workflow for deployment & collaboration: DVC acts as a platform to share data, results, collaboration and run the finished model in production environment.
DVC comparison with other tools
Along with DVC, there are other data version control tool to use. Pachyderm is one of the famous data pipeline versioning tool. Here is a comparison between two tools:
- DVC is lighter than Pachyderm.
- DVC is much more community driven where Pachyderm is more enterprise driven tool.
- Integration of DVC is much easier than Pachyderm as DVC does not have its own execution engine.
- Pachyderm is more robust compare to DVC
Installation of DVC
DVC is easy to install. Click here to go the installation page of DVC. I have used the following command to install DVC into my device:
~/> pip install dvc
Example on how to use DVC
In this section, we will discuss about some commands of DVC
- At first, lets create a folder where we will test our DVC.
~/> mkdir DVC
~/> cd DVC
~/DVC > git init
~/DVC > dvc init
This output will be shown after that:
Now let's check, what dvc init created for us:
~/DVC > git status
This will give the following output:
- Now let's create a folder and save our dataset in that. For this example, I will use the open source data given by DVC.
~/DVC > mkdir data
~/DVC > dvc get https://github.com/iterative/dataset-registry/get-started/data.xml -o data/data.xml
Now let's push our data into a cloud storage. For cloud storage, I am using Google Drive because of its user friendly behavior. You can opt for any other cloud storage.
~/DVC > dvc add data/data.xml
To track the dataset
~/DVC > git add 'data/data.xml.dvc' 'data/.gitignore'
~/DVC > git commit -m "adding raw data"
Now I will add google drive as my cloud storage
~/DVC > dvc remote add -d storage gdrive://folder-id
I fetched folder-id from google drive folder URL. For example:
Folder URL: https://drive.google.com/drive/folders/abcd
then folder-id : abcd
Now we need to add the DVC config file to git so that we can keep track of this configuration
~/DVC > git commit .dvc/config -m "Configure remote storage"
To push the data to cloud storage:
~/DVC > dvc push
If everything is okay, we should have this as output:
~/DVC > Authentication successful. 1 file pushed
If you are running this for the first time, you may encounter the following error:
To handle this error, use the following command:
~/DVC > pip install dvc-gdrive
After that run the dvc push command again and it should give the desired output. Now go to your google drive or desired cloud storage and you will see that one folder is being created by DVC and it has the meta information and data.
Pull, Change and Re-push
Now let's try to pull the pushed data from cloud, make some changes and push again
- Pulling data from cloud storage:
~/DVC > dvc pull
- Now let's make some changes in the dataset
- Now let's push this changes to cloud:
~/DVC > dvc add data/data.xml
~/DVC > git add 'data/data.xml.dvc'
~/DVC > git commit -m "updated dataset"
~/DVC > dvc push
It will create a separate folder into your cloud storage, so that it will not replace the previous data.
In this article, we have learnt about the DVC and its usage. At the end, we have seen some example code so that we can have some idea on how to operate DVC. DVC gives users the option to track, visualize and re-use the dataset. If there is a large ML projects where lots of data manupulation or processing is required, DVC is surely recommened to use on that case. Additionally, DVC also helps to debug the ML models by enchancing the reproducibility.
Ridge-i is actively hiring for various positions. Casual interviews are also possible, so if you are interested, please contact us.
- DVC: https://dvc.org/
- DVC youtube channel: https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ