Data Version Control

Note: This the latest branch. It contains the latest updated readme so look at it. Other branches are commited in between so avoid those.

Fork the dvc template from https://github.com/realpython/data-version-control

Clone the forked repository to your computer with the git clone command

git clone git@github.com:YourUsername/data-version-control.git

Make sure to replace YourUsername in the above command with your actual GitHub username.

Steps:

cd to the dvc folder and initialize using dvc init.
You can also create a new git branch for experimentation.
Using dagshub for remote storage so creae a repository on dagshub and follow the dvc storage setup on dagshub. Do it with dvc remote add path/to_dagshub.dvc.
Add train and val data to dvc using dvc add data/raw/train and dvc add data/raw/val. Two new files train.dvc and test.dvc will be created.
The original data will be added in .gitignore so they don't get pushed to the github and only the .dvc files will be added to github.
Push your files to github and original data files to dagshub dvc storage following the commands.
- git add --all
- git commit -m "First commit with setup and DVC files"
- dvc push -r "origin"
- git push --set-upstream origin
Create a script for praparing the dataset and run it with python src/prepare.py.
Add prepared files to dvc and commit others to github using dvc add data prepared/train.csv data/prepared/test.csv and git add --all and git commit -m "Created train and test CSV files".
Run the model with the training script python src/train.py.
Add model to dvc using dvc add model/model.joblib.
Add and commit to github git add --all and git commit -m "Trained random forest classifier".
Run the evaluate file using python src/evaluate.py. A new json file under metrics would be created. I got an accuracy of 98%.
Add and commit the json files to github git add --all and git commit -m "Evaluate the model accuracy".
Push all the changes to github and dvc using git push and dvc push -r "origin".
Tag your commit git tag -a model -m "RandomForest with accuracy 98%". Push your tags git push origin --tags.
You can create further more branches with other experiments and then merge with your final branch.

Creating reproducible pipelines

Create a new branch and remove the .dvc files as these will be again created in pipeline

dvc remove data/prepared/train.csv.dvc data/prepared/test.csv.dvc model/model.joblib.dvc

Now to create a pipeline once dvc run command has to be used. Few arguments to look at before running.
- The -n switch gives the stage a name.
- The -d switch passes the dependencies to the command.
- The -o switch defines the outputs of the command.
- The -M switch defines the metrics of the command
Now running the prepare.py with dvc run as dvc run -n prepare -d src/prepare.py -d data/raw -o data/prepared/train.csv -o data/prepared/test.csv python src/prepare.py
A new dvc.yaml file will be created showing the pipeline.
Similary run for training and evaluate stages.
- dvc run -n train -d src/train.py -d data/prepared/train.csv -o model/model.joblib python src/train.py
- dvc run -n evaluate -d src/evaluate.py -d model/model.joblib -M metrics/accuracy.json python src/evaluate.py
Use dvc metrics show to see the metrics.
Now add, commit and push to github and dvc.
Look at dvc.yaml to see the whole pipeline.
Now if you want to run other experiments you don't need to run dvc run all the times. Thats what reproducible pipeline was all about.
Create new branch and train a new model like Logistic regression.
Now change the model in training.pt file and use dvc status to see the changes inside the files of pipeline.
Now to run this logistic regression function use dvc repro evaluate. This will re run the training and evaluate stages of the pipeline.
Now see the metrics but this time add a new flag -T to see metrics created by all runs. dvc metrics show -T.

Conclusion

So now to run multiple experiments one can just make changes to the necessary files and use dvc repro evaluate to run the pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Version Control

Steps:

Creating reproducible pipelines

Conclusion

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.dvc		.dvc
data		data
metrics		metrics
model		model
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml

License

ChiragChauhan4579/DVC-pipeline

Folders and files

Latest commit

History

Repository files navigation

Data Version Control

Steps:

Creating reproducible pipelines

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages