At around 18:00 you state that the client can itself compute the hash of the string in order to prevent the linear time (linear in the length of the string) look-up from the server's local cache. I raise 2 problems with this: 1. It doesn't really make a difference where the hash is computed as far as E2E latency is concerned. After all, hashing is a prerequisite to a hash table look-up, so it needs to happen sequentially before the look-up. 2. With hash tables, hashing is just the first step. After a hash match you have to do an exact key match to rule out the possibility that it was a hash collision / false match. So no matter what, it will not be constant time look-up and will depend on the length of the string. Another thing, your approach of storing just references to leaf nodes to reconstruct actual full words instead of storing full words at each trie node is very interesting. But I'm curious, do you know if it's done this way in a real world application or is it something you came up with? Because, I think it could be costlier than we might think to walk up a trie like that because each node is a different hash table and it would amount to a ton of CPU cache misses. I'm inclined to thinking that simply storing full words at each trie node would be more congenial to the latency requirements (and this is how it's described in Alex Xu's book). With this approach (where we discard your "store trie leaf references" idea), the only way the trie would help us is by keeping the data structure more compact when compared to a full blown hash table which maps each possible prefix to the top K words. Sorry if these are aspects you discussed in the video. Since most of the video is talk without much illustration (not complaining or anything, this is free stuff after all :) ) , I end up having to replay bits and pieces of it, so it's plausible I missed the bigger picture.
1) Fair point! Looking back on it I think I was just referring to reducing load on the cache but you're definitely correct here sometimes I don't even fully understand what I was saying. 2) Yeah but that's usually bound by a constant factor based on the load factor of the map. 3) This is the solution from Grokking actually. I haven't read Alex Xu's book, but I'm glad that you told me so that I can check it out! I don't think that you would even be using the CPU cache here because while you need to store hash tables to go down the trie, going up the trie just requires a single pointer to a parent node in memory. Agreed with your point that storing all the top suggestions directly is ideal, but I believe one of the challenges of the problems is that all of the terms are too much to store (at least in memory).
@@jordanhasnolife5163 The CPU cache comes into the picture when you dereference those pointers to get the characters, right? It's in the same sense that a linked list is known to perform worse as a FIFO queue than a circular array. The data are scattered all around memory (and not in contiguous blocks of memory), so the CPU cache is often missing what you want, so the look-up goes to memory.
The cache is updated like once per day. I don't know what you mean "change the HDFS", but I'm assuming you mean add data to it, in which case yes a new query to our service will go through our stream processing solution and eventually be sunk to HDFS.
Great video, love the way yout articulate your thoughts and your quick capacity estimates! Any reason why we are using REST? REST cannot stream data. We can use a WebSocket connection instead for bi-directional streaming minimizing latency even further than the overhead of establishing a new TCP connection for each character.
I think that's reasonable! Though I guess it also does mean that you're maintaing a persistent connection to the server as long as you have a text field clicked, totally see your point though!
you don't establish a new TCP connection for each new character because of the keep-alive header, which included by default since HTTP 1.1, that keeps TCP connection open. Websockets are redundant here
Why do we need to compute the hash of the query string while we are using range-based partitioning(dynamic) for trie? Could you explain a lit bit more? Is it only required for caching layer?
Good stuff, man. Wanted to get your thoughts on an idea I had. Since the amount of data is so small, do you think it would be advantageous to store the tries (or the prefix->suggestions map) directly on the fetch servers to avoid a network call to Redis? I think we could apply the same ideas wrt to replication/sharding/load balancing to the tries stored on the fetch servers.
@@jordanhasnolife5163 I know this video is old, but redis you also can't store nested hash, you can if you store it as a serialised string, but then your application will have to deserialise. I guess this doesn't change much the design, but instead of fetching from redis, we could just store in the (application) server itself, and every some time the server can re-fetch the redis to update the trie (re-starting the serialization/deserialization).
What a cool Data structure. I love the bottom up update, very clean.
At around 18:00 you state that the client can itself compute the hash of the string in order to prevent the linear time (linear in the length of the string) look-up from the server's local cache. I raise 2 problems with this:
1. It doesn't really make a difference where the hash is computed as far as E2E latency is concerned. After all, hashing is a prerequisite to a hash table look-up, so it needs to happen sequentially before the look-up.
2. With hash tables, hashing is just the first step. After a hash match you have to do an exact key match to rule out the possibility that it was a hash collision / false match.
So no matter what, it will not be constant time look-up and will depend on the length of the string.
Another thing, your approach of storing just references to leaf nodes to reconstruct actual full words instead of storing full words at each trie node is very interesting. But I'm curious, do you know if it's done this way in a real world application or is it something you came up with? Because, I think it could be costlier than we might think to walk up a trie like that because each node is a different hash table and it would amount to a ton of CPU cache misses. I'm inclined to thinking that simply storing full words at each trie node would be more congenial to the latency requirements (and this is how it's described in Alex Xu's book). With this approach (where we discard your "store trie leaf references" idea), the only way the trie would help us is by keeping the data structure more compact when compared to a full blown hash table which maps each possible prefix to the top K words.
Sorry if these are aspects you discussed in the video. Since most of the video is talk without much illustration (not complaining or anything, this is free stuff after all :) ) , I end up having to replay bits and pieces of it, so it's plausible I missed the bigger picture.
1) Fair point! Looking back on it I think I was just referring to reducing load on the cache but you're definitely correct here sometimes I don't even fully understand what I was saying.
2) Yeah but that's usually bound by a constant factor based on the load factor of the map.
3) This is the solution from Grokking actually. I haven't read Alex Xu's book, but I'm glad that you told me so that I can check it out! I don't think that you would even be using the CPU cache here because while you need to store hash tables to go down the trie, going up the trie just requires a single pointer to a parent node in memory. Agreed with your point that storing all the top suggestions directly is ideal, but I believe one of the challenges of the problems is that all of the terms are too much to store (at least in memory).
@@jordanhasnolife5163 The CPU cache comes into the picture when you dereference those pointers to get the characters, right? It's in the same sense that a linked list is known to perform worse as a FIFO queue than a circular array. The data are scattered all around memory (and not in contiguous blocks of memory), so the CPU cache is often missing what you want, so the look-up goes to memory.
Kinda confused about the cache part and the HDFS, how often do you expect to update the cache? Is it the only way to change the HDFS by adding query?
The cache is updated like once per day. I don't know what you mean "change the HDFS", but I'm assuming you mean add data to it, in which case yes a new query to our service will go through our stream processing solution and eventually be sunk to HDFS.
@@jordanhasnolife5163 Cool, thanks. That is exactly the point. I thought some new queries could come from the input of the users.
Great video, love the way yout articulate your thoughts and your quick capacity estimates! Any reason why we are using REST? REST cannot stream data. We can use a WebSocket connection instead for bi-directional streaming minimizing latency even further than the overhead of establishing a new TCP connection for each character.
I think that's reasonable! Though I guess it also does mean that you're maintaing a persistent connection to the server as long as you have a text field clicked, totally see your point though!
you don't establish a new TCP connection for each new character because of the keep-alive header, which included by default since HTTP 1.1, that keeps TCP connection open. Websockets are redundant here
++ Can clarify requirements: touch upon bloom filters to forget about one-hit-wonders.
Sounds good will do next time around
Why do we need to compute the hash of the query string while we are using range-based partitioning(dynamic) for trie? Could you explain a lit bit more? Is it only required for caching layer?
Yep, exactly - just for caching the partitioning is probably best kept by range. We can cache based on a hashed string to avoid hot caches if need be.
Good stuff, man. Wanted to get your thoughts on an idea I had. Since the amount of data is so small, do you think it would be advantageous to store the tries (or the prefix->suggestions map) directly on the fetch servers to avoid a network call to Redis? I think we could apply the same ideas wrt to replication/sharding/load balancing to the tries stored on the fetch servers.
Seems pretty reasonable to me!
Good job!
Can you release it in practice ?)
Sorry can you elaborate on that?
@@jordanhasnolife5163 lol 😂😂
How do you store a trie on Redis? Afaik that data structure is not supported
Tries are implemented with nested hashmaps which I believe you can have on Redis
@@jordanhasnolife5163 I know this video is old, but redis you also can't store nested hash, you can if you store it as a serialised string, but then your application will have to deserialise. I guess this doesn't change much the design, but instead of fetching from redis, we could just store in the (application) server itself, and every some time the server can re-fetch the redis to update the trie (re-starting the serialization/deserialization).
Why Hadoop File System for add query service?
Just so you can periodically run batch jobs on the data held there
By adding query, do you mean adding a term into the trie?
thank you!
14:56 Architecture
Stop letting the ladies sit on your face 🙄
I prefer when the dudes do it tbh