I'm so confused, why there are two paths, one to calculate topk per min, the other to calculate topk per hour? why can't the map reduce job to configurable to calculate at different cadence so you can retire the upper path?
Hi, I assume you are referring to the 'Foundational System Design' part. The 'min' in the second path stands for 'minimum' and not 'minute'. The path without 'min heap' represents a naive solution. Sorry for any confusion created. You can assume that we calculate the top K per hour; the actual period is not important for describing the design. Thank you for the feedback!
Thanks for the video, can you provide how the two paths results are stored in the data store, one is from the map-reduce job and the other is navie path from hashmap or(count -min sketch).how the queries are performed and which to figure out?
Thanks for your question. I will try a short version in comment section, it wasn't covered in video, because DB schemas and queries seem to be too overloading on already pretty full of processes video) Let's name things: storage will be PostgreSQL DB storage(distributed file system) will be AWS S3 1) First flow with "preprocessor" and "merger" allows as to obtain topK for specified period but we expect to choose only one period (hour or day or week or whatever). 2) In case if we in the future decided that we want to be flexible and allow arbitrary time interval and time period, we prepose flow when AWS S3 storage is added to allow for construction of arbitrary time period (topk every hour,day,week) and time interval. Rough process of Flow #2 ETL process: 1) Saver will dump file, let's say csv/parquet (in reality it would be a good idea to use batching, but for simplicity let's dump every request). 2) Every request creates a record in predefined csv file. It will be a great idea to partition by ids and inserting event of same id to save file. Also we can add id+time interval, say drafting a new file every day (most of the time you need recent info, this allows you to only touch say last week files for calculating counts every hour for past week) with following fields: timestamp(in ms or seconds, to allow aggregation to minutes, hours ...), id 1713876131, 1 1713876134, 2 1713876136, 1 3) Count job takes it's one/several file for calculation, because we partitioned by ids, we can independently run count on every file 4) Counting process is going as in Flow #1 by calculating to Hash-map and then to min-heap 5) After this we run "merging job" to get topK overall for defined time interval and time period and save it to PostgreSQL (storage) Hope it helps clarify things, feel free to ask if explanation was confusing.
Hey, can you elaborate on why it will not be the right answer? If I understood your question correctly, the issue is the possibility of hash maps used as the basis for the calculation of the min heap on server 1 & 2 colliding? Excuse me for not articulating it more clearly, but due to the fact that the Message Queue before, as well as the "processor" service itself, is distributed, we partition by the id of the post/video/whatever. In which case, every hash map on each server shall not collide with another server. Then, the min-heap will be a correct representation, and merging independent min-heaps indeed gives you the correct overall topK min-heap.
Generally API gateway can do TLS offloading. Why we need LB first before api gateway? It would apigateway first and then LB. Can someone explain please?
I'm so confused, why there are two paths, one to calculate topk per min, the other to calculate topk per hour?
why can't the map reduce job to configurable to calculate at different cadence so you can retire the upper path?
Hi, I assume you are referring to the 'Foundational System Design' part. The 'min' in the second path stands for 'minimum' and not 'minute'. The path without 'min heap' represents a naive solution. Sorry for any confusion created. You can assume that we calculate the top K per hour; the actual period is not important for describing the design. Thank you for the feedback!
Thanks for the video, can you provide how the two paths results are stored in the data store, one is from the map-reduce job and the other is navie path from hashmap or(count -min sketch).how the queries are performed and which to figure out?
Thanks for your question. I will try a short version in comment section, it wasn't covered in video, because DB schemas and queries seem to be too overloading on already pretty full of processes video)
Let's name things:
storage will be PostgreSQL DB
storage(distributed file system) will be AWS S3
1) First flow with "preprocessor" and "merger" allows as to obtain topK for specified period but we expect to choose only one period (hour or day or week or whatever).
2) In case if we in the future decided that we want to be flexible and allow arbitrary time interval and time period, we prepose flow when AWS S3 storage is added to allow for construction of arbitrary time period (topk every hour,day,week) and time interval.
Rough process of Flow #2 ETL process:
1) Saver will dump file, let's say csv/parquet (in reality it would be a good idea to use batching, but for simplicity let's dump every request).
2) Every request creates a record in predefined csv file. It will be a great idea to partition by ids and inserting event of same id to save file. Also we can add id+time interval, say drafting a new file every day (most of the time you need recent info, this allows you to only touch say last week files for calculating counts every hour for past week)
with following fields:
timestamp(in ms or seconds, to allow aggregation to minutes, hours ...), id
1713876131, 1
1713876134, 2
1713876136, 1
3) Count job takes it's one/several file for calculation, because we partitioned by ids, we can independently run count on every file
4) Counting process is going as in Flow #1 by calculating to Hash-map and then to min-heap
5) After this we run "merging job" to get topK overall for defined time interval and time period and save it to PostgreSQL (storage)
Hope it helps clarify things, feel free to ask if explanation was confusing.
When you merges multiple heaps, it doesn't seem the merged topK is the right answer among all heaps.
Hey, can you elaborate on why it will not be the right answer?
If I understood your question correctly, the issue is the possibility of hash maps used as the basis for the calculation of the min heap on server 1 & 2 colliding? Excuse me for not articulating it more clearly, but due to the fact that the Message Queue before, as well as the "processor" service itself, is distributed, we partition by the id of the post/video/whatever. In which case, every hash map on each server shall not collide with another server. Then, the min-heap will be a correct representation, and merging independent min-heaps indeed gives you the correct overall topK min-heap.
Generally API gateway can do TLS offloading. Why we need LB first before api gateway? It would apigateway first and then LB. Can someone explain please?
I got it but, author needs to explain it. so future audience can understand it.