Parallelize Pipeline Processing With Sub Node Parallelization - How I Write Pipes Part IV

Поділитися
Вставка
  • Опубліковано 5 лис 2024

КОМЕНТАРІ • 4

  • @javierbosch1338
    @javierbosch1338 4 роки тому +2

    If you data is very big, would this approach still work? Would passing the data back to be aggregated in the dict (to be passed to PartionedDataSet) still work?

    • @DataEngineerOne
      @DataEngineerOne  4 роки тому +1

      If the data is too large for this approach, you might be better off going with a proper data parallel processing solution such as spark. Of course, if you have a beefy enough machine, no data is too large for this approach :)

    • @javierbosch1338
      @javierbosch1338 4 роки тому +1

      @@DataEngineerOne I was using Ray for this task instead of Spark. There was not only a matter of data size but the execution time of writing all that data. Here I was using each worker to write to disk, but wanted a way to use the PartionedDataSet recorded/registered in the catalog to save to file. That way I bypass pushing dataframes across the stacks and processes.

    • @DataEngineerOne
      @DataEngineerOne  4 роки тому +1

      @@javierbosch1338 Aha, I see. Yes, you certainly can should be able to pass the data into a PartitionedDataSet that wraps around a pandas DataFrame, to serialize the reading and writing. But, if the data is as big as you say, you may still run into speed issues with regard to IO