Versioning in data projects

Reproducibility is a pillar in science, and version control via git has been a blessing to it. For pure Software Engineering it works perfectly. However, machine learning projects are not just only about code, but rather also about the data. The same model trained with two distinct data sets can produce completely different results.

So it comes with no surprise when I stumble with csv files on git repos of data teams, as they struggle to keep track of code and metadata. However this cannot be done for today’s enormous datasets. I have seen several hacks to solve this problem, none of them bullet proof. This post is not about those hacks, rather about an open source solution for it: DVC.

dvc_flow_large

Let us exemplify by using a kaggle challenge: predicting house prices with Advanced Regression Techniques

Continue reading “Versioning in data projects”