Data Engineering Interview - Netflix Clickstream Data Pipeline

Поділитися
Вставка
  • Опубліковано 7 лют 2025
  • Join the waitlist for Exponent’s Data Engineering Interview Course: bit.ly/4cmpq34
    In this video, an expert in data engineering discusses building a near real-time data ingestion pipeline for Netflix clickstream and playback data. The focus is on key metrics like customer churn and path analysis, with insights into using technologies like Kafka, Spark, and NoSQL databases. The video also covers the importance of scalable, time-sensitive data processing and touches on trade-offs in data storage and system design.
    Want to practice peer-to-peer mock interviews? bit.ly/3Xmj8wq
    Chapters -
    00:00 - Intro
    01:15 - Data pipeline for Netflix metrics monitoring
    05:52 - Path analysis and playback insights
    09:42 - Product insights and pipeline design overview
    12:31 - Netflix’s user distribution and metrics
    16:51 - Netflix’s scalability: Up to 50,000 users per second
    25:13 - Spark streaming, Kafka, Flink, data lake
    28:08 - Kafka vs Data lakes for analytics
    30:16 - NoSQL database performance
    37:41 - Big data pipeline design principles
    Watch data science mock interviews from Exponent:
    Probability, P-value and Confidence Intercals: • Probability, P-Value a...
    Retry Transaction ft. Paypal Data Scientist: • Stripe Data Science Mo...
    Amazon Data Science Interview: Linear Regression: • Amazon Data Science In...
    Snap Data Science Mock Interview: Improve Camera Speed: • Snap Data Science Mock...
    👉 Subscribe to our channel: bit.ly/exponentyt
    🕊️ Follow us on Twitter: bit.ly/exptweet
    💙 Like us on Facebook for special discounts: bit.ly/exponentfb
    📷 Check us out on Instagram: bit.ly/exponentig
    📹 Watch us on TikTok: bit.ly/exponen...
    ABOUT US:
    Did you enjoy this interview question and answer? Want to land your dream career? Exponent is an online community, course, and coaching platform to help you ace your upcoming interview. Exponent has helped people land their dream careers at companies like Google, Microsoft, Amazon, and high-growth startups. Exponent is currently licensed by Stanford, Yale, UW, and others.
    Our courses include interview lessons, questions, and complete answers with video walkthroughs. Access hours of real interview videos, where we analyze what went right or wrong, and our 1000+ community of expert coaches and industry professionals, to help you get your dream job and more!

КОМЕНТАРІ • 13

  • @tryexponent
    @tryexponent  5 місяців тому

    Join the waitlist for Exponent’s Data Engineering Interview Course: bit.ly/4cmpq34

  • @harisridhar1668
    @harisridhar1668 5 місяців тому +11

    I strongly appreciated the trade-offs and the architecture insights discussed:
    1. A hybrid approach combined Spark Streaming versus Apache Flink as distributed computing platforms, based on latency criteria for clickstream metrics. The justification being that Apache Spark streaming works well with >= 1 second metrics generation, whereas Apache Flink meets single millisecond / sub-second performance.
    2. To use the push model ( agents and daemons ) versus the pull model ( the infra ) for large-scale data pipelines : the former being better for real-time needs ( even if it may overwhelms the pipeline ), whereas the latter is polling based and may fail to deliver real-time ( or close-enough-to-real-time ) customer behavior insights.
    3. The justification for using a NoSQL DB versus a SQL DB for compute storage : NoSQL being schema-less, high-performant, and having low-latency reads and writes - and thus, able to handle large event volumes under ingestion ( e.g. Kafka's 50,000 events / second ), and how he identified the RDBMS storage as a potential pipeline bottleneck.

  • @barikung
    @barikung 11 днів тому

    Thank you for this video. I enjoy watching it and love how you relate the architecture design to AWS services, along with the business-level assumptions that lead to the design of the architecture !

  • @prashantsalgaocar
    @prashantsalgaocar 5 місяців тому +11

    I thought this was too high level. There were no non functional requirements discussed. Also a lot of the complexity was abstracted with Lamdba usage. There should have been more discussion on some of the core functional requirements and non functional requirements and some more deep dives which this system design lacked.

  • @angelotheman
    @angelotheman 5 місяців тому +5

    We need more of these. However try to make it suitable for beginners or better still, state the experience in the title so we know whom this is directed to.
    Thanks

    • @tryexponent
      @tryexponent  5 місяців тому +3

      Great idea! We're actually working on this. Maybe adding "This is how a junior candidate answers. This is how a senior candidate answers."
      Hopefully rolling out soon. Stay tuned

  • @briandevvn
    @briandevvn 4 місяці тому

    As per the changes/add, I think we could add the monitoring services there to see the system health and notify if any schema changes are important

  • @Ikyua
    @Ikyua 5 місяців тому

    I love this channel so much keep up the content :)!

  • @akshayshankar3707
    @akshayshankar3707 5 місяців тому

    When are you launching the data engineering course?

    • @tryexponent
      @tryexponent  4 місяці тому

      Hey akshayshankar3707, we are planning to launch it in 1-2 months time. Join our waitlist so you get notified when it happens!
      www.tryexponent.com/courses/data-engineering

    • @tryexponent
      @tryexponent  4 місяці тому +1

      Likely in October! Finishing up some final lessons right now.

  • @arjunekrishna7044
    @arjunekrishna7044 2 місяці тому

    he looks like the director lokesh kanagaraj lol