Run Apache Spark jobs on serverless Dataproc
Вставка
- Опубліковано 5 лют 2025
- Today, I'm excited to share a hands-on example of using a custom container to bundle all Spark job dependencies and execute it on serverless Dataproc. This powerful feature provides a streamlined approach for running Spark jobs without managing any infrastructure, while still offering advanced features like fine-tuning autoscaling-all without incurring the cost of a constantly running cluster. #ApacheSpark #GoogleCloud #Serverless #Dataproc #BigData
00:17 - Table of Contents
01:19 - What is Dataproc?
01:53 - Dataproc vs serverless Dataproc
03:52 - Custom containers on Dataproc
08:14 - A real-world use case
11:33 - Code walk through
20:43 - See it in Action!
25:55 - Summary
Useful links
code: github.com/roc...
slides: docs.google.co...
custom container: cloud.google.c...
serverless vs compute engine: cloud.google.c...
spark submit via REST: cloud.google.c...
service to service communication: cloud.google.c...
---
Need to modernise your data stack? I specialise in Google Cloud solutions, including migrating your analytics workloads into BigQuery, optimising performance, and tailoring solutions to fit your business needs. With deep expertise in the Google Cloud ecosystem, I’ll help you unlock the full potential of your data. Curious about my work? Check out www.fundamenta... to see the impact I’ve made. Let’s chat! Book a call at calendly.com/r... or email richardhe@fundamenta.co. 🚀📊
Awesome presentation! Far better than so much other, mostly self-promo, content
Thanks so much ❤ glad you found it useful. The goal of this channel is to showcase ideas that can actually work well to solve real world problems.
Thank you for the video, your content is easy to follow and quite well explained. I really enjoyed learning the example workflow you presented.
Thank you for the nice words! I am glad you found this useful ❤
thank you. This is really clear and well articulated. Hard to find on youtube Data Eng stuff
Solid video as always. +1 for a video on setting up Cloud Run with IAP.
Thanks Ivan, I will do that one in the next few weeks
Hi Richard, Love your content, always wanted someone to do GCP training videos emphasizing real world use cases, I work in Bigquery and Composer, I wanted to learn dataproc and dataflow. But everywhere i see same type of trainings not much focusing on real world implementations. I wanted to learn how dataproc and dataflow jobs are deployed in different environments like dev test and prod, your videos are helping a lot, hope you will do more videos on dataflow and dataproc, how we use this in real projects in how we create these jobs using CICD
No worries glad you found this useful ❤
@@practicalgcp2780 I have one doubt, in an organization if we have many dataproc jobs how will we create it in different environments like dev test and prod, can you please do a video on that
Thanks for the video and for sharing it.
Thanks for this; the content is helpful but also the information rate is set to practitioner non specialist, and IMHO that's the level where primers should live
Thanks alot for this!
I agree with keeping code separate from the docker container, however, in your case there was just a single py file, but in my case I have a whole repo which is needed when the main file is executed. How do you suggest I handle this?
No worries 😉 and just wondering if you tried to zip the repo (only your files and modules not dependencies, I am worried it might conflict with what’s installed on the container) and put them in gcs and include at runtime during sparksubmit?
There is a post explaining this too, let me know if this works for your scenario stackoverflow.com/questions/32785903/submit-a-pyspark-job-to-a-cluster-with-the-py-files-argument
I haven’t done this for quite a long time, as I have been trying to avoid using spark as much as I can due to the complexity on setting things up 😂 let me know if this approach still works.
I would be keen to know what you still use Spark for these days? Especially if you are on Google cloud you have quite a lot of options like bigframes or BigQuery itself, which from my experience covers majority of the scenarios spark used to do (as SQL is becoming really powerful on analytical databases). I could be wrong though as everyone’s situation is different.
Hi Richard, can we create dataproc serverless job in different gcp project using service account?
I am not sure I understood you fully, but service account can do anything in any project regardless which project the service account is created from. The way it works is by granting the service account IAM permission from the project you want the job to be created. Then it will work. But it may not be best way to do it as that one service account may have too much permission and scope. You can use separate service account, one for each project if you want to reduce scope, or have a master one to impersonate as other service account in those project but keep in mind it’s key to reduce scope of what each service account can do, otherwise when there is a breach, it can be massive damage on everything all together.