Uber Data Engineer Interview: Design a Ride Sharing Schema

Поділитися
Вставка
  • Опубліковано 1 лип 2024
  • Today I'm interviewing Jitesh again on a data engineering and data modeling question around designing a schema for a ride sharing company like Uber or Lyft.
    Here's the question we're tackling: www.interviewquery.com/questi...
    Want to be featured in the next mock interview video? Apply here: airtable.com/shrdQrwKK7xxGLm6l
    👉 Subscribe to my data science channel: bit.ly/2xYkyUM
    🔥 Get 10% off data engineer interview prep: www.interviewquery.com/pricin...
    ❓ Check out our new SQL interview course: www.interviewquery.com/course...
    🔑 Get professional coaching from Jitesh here: www.interviewquery.com/coachi...
    🐦 Follow us on Twitter: / interview_query
    Are you a hiring manager looking to build data infrastructure with top-tier data engineers? Let Outsearch.ai take care of filtering out candidates while you focus on other things that matter: www.outsearch.ai/?...
    Video Breakdown:
    00:30 - Clarifying questions about the backend and functionality
    2:28 - Initial design and solution for latency
    11:18 - How would you design for analytics
    19:15 - What does the schema look like all together?
    22:34 - What's the benefit of a partition?
    More from Jay:
    Read my personal blog: datastream.substack.com/
    Follow me on LinkedIn: / jay-feng-ab66b049
    Find me on Twitter: / datasciencejay
  • Наука та технологія

КОМЕНТАРІ • 50

  • @iqjayfeng
    @iqjayfeng  11 днів тому

    If you enjoyed discussing this question with us, explore more on our website!
    Yelp-Like System: www.interviewquery.com/questions/yelp-like-system?
    Clickstream Data: www.interviewquery.com/questions/clickstream-data?
    Click Data Schema: www.interviewquery.com/questions/click-data-schema?

  • @gr82moro
    @gr82moro Рік тому +47

    The data model design should not consider actual implementation details, and too early to consider optimizations, which can be added and fine-tuned later on. First thing to do is to understand the functional requirements and draft how many "entities" make sense to the business and then work out the ER diagram to satisfy the requirements and future needs.

    • @pankm2989
      @pankm2989 9 місяців тому +3

      exactly my thoughts at the start but overall good approach from the guy

  • @tejasagrawal-kl9mp
    @tejasagrawal-kl9mp Місяць тому +4

    I feel the approach used here is very unorganised. Jitesh directly jumped on the optimisations, and the modelling part. Also Jay asked about the schema of Dimension & Fact tables, which were not provided explicitly. NoSql database was chosen but the question specifically asks about the join conditions and how would we maintain relationship between different tables.
    In my view the approach should be like :
    1. Clarifying Questions : Use-cases / Volume of data / Analytics or Transactional / Is it particular location based ? / What metrics we are looking at / Final goal of system ? etc .
    2. Functional and Non-functional requirements : Discussion on Scalability, Latency, Availability, Consistency etc. CAP theorem can be used. Trade-offs . High read throughput or write ? etc.
    3. Key metrics involved (if any) : Daily no. of rides, avg daily active users, avg rides over a month,90days etc.
    4. High-level design : Storage , ETL process (if any) , final data warehouse, data lake , relational database or nosql or data warehouse etc.
    5 . low-level design : Actual table and schema modelling . Dimension and fact tables with their attributes. Partition or index keys. Any aggregate tables to calculate metrics etc.
    6. Security and compliance issue ( if any). Optimisations (if any)
    7 .Summarise whole approach and communicate effectively.
    I am not an expert, but this is what I learned after giving multiple Data system design interviews in companies (including FAANG). Thanks

  • @vaibhavmaurya8505
    @vaibhavmaurya8505 Рік тому +42

    After listening first 3 minutes, I do not agree with the approach. Why you have not started with Entity Relationship model? You have immediately jumped on the technical requirements, I think one should first complete the functional requirement with data modeling.

    • @PhillyHank
      @PhillyHank 8 місяців тому +3

      🤔 it’s probably always good to talk through ERD and lock that with the interviewer. However, in this specific example, the candidate (TC) quickly chose a NOSQL db as they were optimizing for latency. Generally speaking, NO SQL means there really isn’t a entity relationship. I do think they should have started with
      1. Use cases
      2a. Filter the interview to 1-2 use cases (e.g., finding a driver/requesting a trip, providing feedback on a trip, payments, etc…)
      2b. Talk through limitations (e.g., one language so no L10N, I18N, no accessibility, worldwide or national scale, no saving favorite drivers/trips locally, therefore no need to have a db on the local surface, like a mobile phone)
      3. Align on what we are optimizing for… that’s a bit different than use cases, but TC called these “use cases” - e.g., optimizing for speed/performance; optimizing for analytics; optimizing for revenue generation
      4. Align on compliance, GDPR, CCPA, etc ## simply mention to interviewer, e.g., “As TC I will assume I must comply with some of the key data governance regulations like GDPR, so the data will have to be stored (in some fashion) where the customers and/or drivers reside.”
      5. Design and get feedback
      6. Optimize later and then get feedback

  • @arunasingh8617
    @arunasingh8617 2 роки тому +8

    This is one of the most useful video for Data Engineer. Great Job (y)

  • @sagarpatil1646
    @sagarpatil1646 2 роки тому +1

    This is one of the best Channel for Analytics part, God Bless you Jay 🙌🏻, Keep up the Good work 🎉🎈

  • @andreasahrlund-richter2289
    @andreasahrlund-richter2289 3 місяці тому

    This was a very cool interview- I appreciate the focus on tooling and not too much on possible attributes.

  • @kavehrezaeimoghadam360
    @kavehrezaeimoghadam360 Рік тому +26

    Dynamo DB has eventual consistency for writes, terrible choice for actual production app at scale. Transactional databases can scale very well. If Facebook can scale on Mysql DB, for sure any other app can as well. Trips is too crucial concept to sacrifice ACID for low cost latency gains with Nosql.
    Also in dimensional modeling it's not practical to avoid joins and put everything in one table. Scalability is also important in analytics, one need to be able to create new fact table on demand in different grains. Having conformed dimensions and being able to perform proper joins to low grain facts are very important.

    • @tahiliani22
      @tahiliani22 2 місяці тому

      Do you know where can we get access to such content if we are preparing for Database Design interviews?

    • @RU-qv3jl
      @RU-qv3jl 19 годин тому

      @@tahiliani22 From having sat on both sides of the interview for over a decade, it doesn’t matter nearly as much as you think. If I wanted to fail this guy I would. I am not interested in a perfect response and never have been. There is no right answer. Facebook didn’t not settle on their current solution at the start. They started off with MySQL and built on that until they customised it a lot and automated a load of things around it. In an interview you have to show that you can think, have a good approach and are a team player who is willing to learn and has passion. Whilst the approach this guy took was more than OK, it wouldn’t guarantee you a job. You can always pass or fail depending on the interviewer alone.
      I hired two different people who I had to train up for over 18 months each. Really enjoyed working with them. They would not have been able to come up with a particularly good database design for this when I hired them. By the time I left the company there were really good, confident and independent.
      Yes I’ve worked on some really big sharded deployments and also diddly little DB systems. They can all be fun.

  • @gravenguan
    @gravenguan 2 роки тому +1

    Have some real talk from experienced data engineer, that's very good content!

  • @brothermalcolm
    @brothermalcolm 2 роки тому +12

    Interesting how it seemed like Jay initially was asking for the schema of the app backend transactional database yet because the interviewer clarified early on the whole question pivoted towards analytical data warehouse design

  • @timz2917
    @timz2917 6 місяців тому

    The data warehousing model was actually very impressive

  • @Nonchalant2023
    @Nonchalant2023 Рік тому +3

    I don't understand why this person chose a NOSQL solution based on performance if the objective is to create a OLTP database? The main reason to choose nosql (in my ignorant opinion) is to avoid the rigidness of structured databases, not performance---right? Additionally, how do you jump from a nosql model to a relational db for analytics?

  • @muribbimehmet9009
    @muribbimehmet9009 5 місяців тому

    It feels like Jay needs to prepare more for these interviews

  • @ShaluChaudhary
    @ShaluChaudhary Рік тому

    Awesome video

  • @haidar6280
    @haidar6280 2 роки тому +3

    This channel is fantastic. Thanks kindly for your content!

  • @siddharthtripathy3964
    @siddharthtripathy3964 Рік тому

    who's Jitesh Soni. Totally love his clarity of thought.

  • @avinashb4485
    @avinashb4485 Рік тому

    Very useful and informative video. Thanks a lot!!

  • @chunuk4442
    @chunuk4442 2 роки тому +4

    Please upload business intelligence interview questions and answers....

  • @sergior.m.5694
    @sergior.m.5694 Рік тому

    This is great content, good for my brain to think through some of this and think to myself why I agree and disagree with him

  • @priyadarshanmohanty277
    @priyadarshanmohanty277 Рік тому +2

    any reason why Cloud DWHs werent considered such as Snowflake/Big Query ?

  • @kanchidoshi6907
    @kanchidoshi6907 2 роки тому +2

    Thanks for this video. During the actual interview, what if the interviewer is only interested in designing in relational DB style, can you please also share how to proceed in that case?

    • @iqjayfeng
      @iqjayfeng  2 роки тому

      Jitesh highlights this in the second half of the video!

  • @PhillyHank
    @PhillyHank 8 місяців тому

    It’s interesting this did not immediately cover DQ checks, threshold , error handling, etc which is one of the key responsibilities for a data engineer.
    ** Of course, in this example the candidate was building a brand new system and therefore there was no data loads, ingestion, ** STILL, I cannot imagine an systems design interview not covering this topic. At most tech companies, there are two loops -
    1. Design a system from a database perspective - this type of video ## this would include “coding” i.e., running SQL statements
    and
    2. What are the design considerations for designing an entire DB platform? E.g., how to know when to do a data warehouse vs lake (batch vs streaming), ETL/ELT methods, DQ, put together together shrads (sp?), data recall, etc.
    Mixing both of these outcomes into one video / interview is hard to do and watching a video the user would likely get lost in the conversation.

  • @hydtechietalks3607
    @hydtechietalks3607 Рік тому

    In this interview, Latency is solved by NOSQL.. thats a generic answer, is nt it? and listed attributes of general expected.. regional seggregraton.. thats it?

  • @Expateer
    @Expateer Рік тому +6

    Rule #1: Never, ever dual-use an operational table for analytics. Also - why did he choose NoSQL this use case?

    • @alahad4723
      @alahad4723 Рік тому

      Good point. What would be the best approach for analytics? Create a separate table on the backend?

    • @Expateer
      @Expateer Рік тому +3

      @@alahad4723 Exactly. Insert new records in the analytics tables as needed in batches or even mico-batches. And remember that if the information in an analytics table is being used to facilitate transaction processes, then it's no longer an analytics table

    • @jiteshsoni3030
      @jiteshsoni3030 Рік тому +2

      I agree and I did not use the operational table twice. Idea is to solve latency use case using No-SQL and the last denormalized table for analytics.
      Next time, we will use a Drawing/charting tool so it's more clear.
      -- Jitesh

    • @Expateer
      @Expateer Рік тому +13

      @@jiteshsoni3030 Don't mean to pick on you, Jitesh, but this video was wrong on so many levels that I found myself cringing. You're sharding on the initials of a city without any consideration of data distribution or even of dealing with cities with the same name in multiple states (almost every state has a "Springfield"). What if 80% of your trips happen in LA, NYC, and SF? Isn't your sharding going to be worthless? How do you tell the difference between Springfield, OH and Springfield, MO? What if you want to aggregate by State or by country? With no joins, your single table is going to get awfully wide.
      There were many such issues here, so I won't go through all of them - but I'll say that you're not wrong in answering like this. Despite the poorly designed solution, the interviewer is going to nod in agreement -- and you'll get the job. People like me will get washed out because the interviewer has nothing more than a rudimentary understanding of how indexing works and likely got into the organization to begin with by cramming videos on NoSQL and sharding as opposed to having actual experience in data engineering.
      This is the first video like this I've seen on DE, and frankly, I'm appalled.

  • @iqjayfeng
    @iqjayfeng  2 роки тому +4

    If you’re looking for a tailored interview prep for DE, check out our newly launched Data Engineering learning path. Our learning path will help you practice core skills that are commonly evaluated in data engineering interviews. Try it here: www.interviewquery.com/learning-paths/data-engineering

  • @WelcomeToDataverse
    @WelcomeToDataverse Рік тому

    Where can I learn such skills?

  • @Expateer
    @Expateer Рік тому +4

    That sharding is considered before indexing is mind-boggling to me. An unsharded, but well-indexed table would run circles around this solution in terms of speed and flexibility. And no, in a dimensional model, the model does not depend on any one ML algorithm

    • @thndesmondsaid
      @thndesmondsaid 11 місяців тому

      yeah I agree he should have mentioned indexing. That being said, he was talking about partitioning, not sharding. Sharding is when you split the data in a table across several nodes (shards). You can do partitioning of a table within a single relational db on a single server. Sharding becomes relevant when the data is too large to fit on a single server.

    • @Expateer
      @Expateer 11 місяців тому

      ​@@thndesmondsaid I simply don't see how a data engineering interview can be considered "good" if indexing wasn't even mentioned. Partitioning is great and usually comes after and as a complement to indexing, not a replacement for it.

    • @zakaria5304
      @zakaria5304 4 місяці тому

      @@Expateer Most data warehouses don't support indexes

    • @Expateer
      @Expateer 4 місяці тому

      @@zakaria5304 Wow .. now you've left me speechless

  • @ak6877
    @ak6877 6 місяців тому

    How are we talking about optimization at the very beginning?

    • @iqjayfeng
      @iqjayfeng  6 місяців тому

      That’s how we do it

  • @pranavreddykallem2025
    @pranavreddykallem2025 Рік тому

    Great content but too many ads

  • @bayusuharso
    @bayusuharso Рік тому

    I'm confused, he uses a NoSQL DB but query the data in SQL like RDBMS?

    • @fbreetz
      @fbreetz Рік тому +2

      The Transactional Table is NoSQL, but The Analytical is SQL. There would be an ETL process to move from the Transactional Table to the Analytical Table.

    • @jenilchristo8775
      @jenilchristo8775 7 місяців тому

      There are sql apis available as well for no sql dbs like say azure cosmos db

  • @brothermalcolm
    @brothermalcolm 2 роки тому +1

    m-m
    riders-rides-drivers
    each rider takes zero to many rides from drivers
    each driver can provide zero to many rides for riders