(17) - Spark : Cache vs Persist, Accumulator and Broadcast Variable

Поділитися
Вставка
  • Опубліковано 27 сер 2024

КОМЕНТАРІ • 4

  • @pramod3469
    @pramod3469 Рік тому

    very well explained

  • @worldofvishruth9017
    @worldofvishruth9017 8 місяців тому

    Hi , very detailed vedio . Any chance would you share this onenote book please?

  • @kishorekrishnap
    @kishorekrishnap Рік тому

    What is the difference between persist and checkpoint?

    • @dataengineeringforeveryone
      @dataengineeringforeveryone  Рік тому +1

      In Apache Spark, both checkpoint and persist are mechanisms used to optimize the performance of distributed computations. However, they serve different purposes:
      Checkpoint:
      Checkpointing is a mechanism in Spark that saves the state of RDDs (Resilient Distributed Datasets) to a reliable distributed file system like HDFS (Hadoop Distributed File System) or AWS S3. Checkpointing is useful when the lineage of an RDD becomes too long and complex, leading to excessive memory usage and slow recovery times in case of failures. In such cases, you can checkpoint the RDD and truncate the lineage, allowing for faster recovery and better memory usage.
      Persist:
      In contrast, persisting an RDD in memory or on disk is a mechanism that stores the RDD in the cache so that it can be reused across multiple computations. This is useful when you have an RDD that is accessed frequently or when you want to avoid re-computing an RDD multiple times.