If you data is very big, would this approach still work? Would passing the data back to be aggregated in the dict (to be passed to PartionedDataSet) still work?
If the data is too large for this approach, you might be better off going with a proper data parallel processing solution such as spark. Of course, if you have a beefy enough machine, no data is too large for this approach :)
@@DataEngineerOne I was using Ray for this task instead of Spark. There was not only a matter of data size but the execution time of writing all that data. Here I was using each worker to write to disk, but wanted a way to use the PartionedDataSet recorded/registered in the catalog to save to file. That way I bypass pushing dataframes across the stacks and processes.
@@javierbosch1338 Aha, I see. Yes, you certainly can should be able to pass the data into a PartitionedDataSet that wraps around a pandas DataFrame, to serialize the reading and writing. But, if the data is as big as you say, you may still run into speed issues with regard to IO
If you data is very big, would this approach still work? Would passing the data back to be aggregated in the dict (to be passed to PartionedDataSet) still work?
If the data is too large for this approach, you might be better off going with a proper data parallel processing solution such as spark. Of course, if you have a beefy enough machine, no data is too large for this approach :)
@@DataEngineerOne I was using Ray for this task instead of Spark. There was not only a matter of data size but the execution time of writing all that data. Here I was using each worker to write to disk, but wanted a way to use the PartionedDataSet recorded/registered in the catalog to save to file. That way I bypass pushing dataframes across the stacks and processes.
@@javierbosch1338 Aha, I see. Yes, you certainly can should be able to pass the data into a PartitionedDataSet that wraps around a pandas DataFrame, to serialize the reading and writing. But, if the data is as big as you say, you may still run into speed issues with regard to IO