DVC (Data Version Control) Cheat Sheet
DVC is a version control system for machine learning projects, allowing you to track, share, and
reproduce your experiments.
---
1. Getting Started
Initialize DVC in a Repository:
dvc init
Add Files or Directories to DVC:
dvc add <file_or_directory>
Commit Changes:
1. Use Git to commit the .dvc file:
git add <file>.dvc .gitignore
git commit -m "Track data with DVC"
Configure Remote Storage:
dvc remote add -d myremote <remote_storage_url>
Push Data to Remote Storage:
dvc push
Pull Data from Remote Storage:
dvc pull
---
2. Tracking Experiments
Run an Experiment:
dvc repro
Track Parameters:
Specify parameters in a params.yaml file and link them to stages in the pipeline.
Example params.yaml:
learning_rate: 0.01
batch_size: 32
---
3. Pipelines
Define a Pipeline Stage:
dvc stage add -n <stage_name> -d <dependency> -o <output> <command>
Example:
dvc stage add -n train -d train.py -d data.csv -o model.pkl python train.py
Visualize the Pipeline:
dvc dag
Run the Entire Pipeline:
dvc repro
---
4. Metrics and Plots
Log Metrics:
Use a metrics.json or similar file to store metrics:
"accuracy": 0.95,
"loss": 0.05
Track the metrics file:
dvc metrics add metrics.json
Visualize Plots:
Use DVC to generate plots from tracked data files:
dvc plots show <file>
---
5. Versioning Data
Check File Status:
dvc status
Remove Data but Keep Track:
dvc remove <file>.dvc
Checkout Specific Versions:
git checkout <commit_hash>
dvc checkout
---
6. Sharing Projects
Push Project to Git and DVC Remote:
git push
dvc push
Clone a Repository and Retrieve Data:
git clone <repo_url>
dvc pull
---
7. Useful Commands
Show Pipeline Stages:
dvc stage list
Remove Cache:
dvc gc
Show Differences in Metrics:
dvc metrics diff
---
8. Remote Storage Options
DVC supports various remote storage backends:
- AWS S3: s3://bucket-name/path
- Google Drive: gdrive://<folder-id>
- Azure Blob Storage: azure://container-name/path
- SSH: ssh://user@server:/path
- Local Directory: /path/to/storage
Configure remotes using:
dvc remote add -d <name> <url>
---
9. Useful Links
- Official Documentation: https://dvc.org/doc
- DVC GitHub: https://github.com/iterative/dvc