Using DVC to Store Data Science Artifacts in Azure

Поділитися
Вставка
  • Опубліковано 9 січ 2025

КОМЕНТАРІ • 2

  • @KieCodes
    @KieCodes 11 місяців тому

    Hey. Thank you for the video. One question: What is the benefit of using DVC compared to LFS? Is it that it works with Azure Datastores?

    • @KevinFeasel
      @KevinFeasel  11 місяців тому

      There are a couple of benefits you might get to using DVC versus simply enabling LFS with Git.
      One historical benefit was that git-lfs was limited to 4GB per file on Windows. My understanding is that this isn't a problem any longer if you have a recent version of Git, though if you're stuck with an old version of Git in your workplace (prior to 2.34), you may still run into this problem.
      Current benefits:
      * As you mentioned, DVC does support Azure Blob Storage (and Amazon S3, network shares, etc.) for hosting those large files. I'd consider that flexibility to be the biggest advantage.
      * DVC has support for pipeline operations (dvc.org/doc/start/data-management/data-pipelines), something I did not cover in the video. If you're running a data science project, this may act as a lightweight option for executing code when your training files change.
      * DVC also supports metrics management (dvc.org/doc/command-reference/metrics) for evaluating model results. This might be useful if you're not using MLflow or another purpose-built technology for model tracking.
      My bias is typically to have separate metrics management and pipeline operations, so the major compelling reason is file hosting on a variety of platforms.