There are a couple of benefits you might get to using DVC versus simply enabling LFS with Git. One historical benefit was that git-lfs was limited to 4GB per file on Windows. My understanding is that this isn't a problem any longer if you have a recent version of Git, though if you're stuck with an old version of Git in your workplace (prior to 2.34), you may still run into this problem. Current benefits: * As you mentioned, DVC does support Azure Blob Storage (and Amazon S3, network shares, etc.) for hosting those large files. I'd consider that flexibility to be the biggest advantage. * DVC has support for pipeline operations (dvc.org/doc/start/data-management/data-pipelines), something I did not cover in the video. If you're running a data science project, this may act as a lightweight option for executing code when your training files change. * DVC also supports metrics management (dvc.org/doc/command-reference/metrics) for evaluating model results. This might be useful if you're not using MLflow or another purpose-built technology for model tracking. My bias is typically to have separate metrics management and pipeline operations, so the major compelling reason is file hosting on a variety of platforms.
Hey. Thank you for the video. One question: What is the benefit of using DVC compared to LFS? Is it that it works with Azure Datastores?
There are a couple of benefits you might get to using DVC versus simply enabling LFS with Git.
One historical benefit was that git-lfs was limited to 4GB per file on Windows. My understanding is that this isn't a problem any longer if you have a recent version of Git, though if you're stuck with an old version of Git in your workplace (prior to 2.34), you may still run into this problem.
Current benefits:
* As you mentioned, DVC does support Azure Blob Storage (and Amazon S3, network shares, etc.) for hosting those large files. I'd consider that flexibility to be the biggest advantage.
* DVC has support for pipeline operations (dvc.org/doc/start/data-management/data-pipelines), something I did not cover in the video. If you're running a data science project, this may act as a lightweight option for executing code when your training files change.
* DVC also supports metrics management (dvc.org/doc/command-reference/metrics) for evaluating model results. This might be useful if you're not using MLflow or another purpose-built technology for model tracking.
My bias is typically to have separate metrics management and pipeline operations, so the major compelling reason is file hosting on a variety of platforms.