Love that you went through the math (even if it was a bit hand wavey for brevity). This feels like an actually useful data structure, and one that I was not aware of.
That was a fascinating video and I can think of a few places in my code base that could use this to help reduce the number of unnecessary calls to an API. This also feels like it should be a Python package someone implements in C on the back-end.
thanks a lot for making the explanations dark mode, it is eye burning for me to use white mode, mostly cuz I got an eyes issue. it doesn't matter how hard I use a blue filter or how much ambient light I have in the room, full white screen hurts within 10 mins max
@@quintrankid8045 I am not sure about that. I didn't try to search. I just use Dark Reader to darkmode everything and set a shortcut to toggle it very fast when it breaks
@@quintrankid8045mine does, one of the few reasons why I exclusively use Opera GX and their force dark mode for all websites lol (other being mouse gestures). Win11 has a nice dark theme built in. Additionally, all GUIs I write have dark themes as well lol
I’ve always used a single hash function, with several smaller bloom “filters”(bit arrays), each with a length that is prime, and mod the hash by the length of each “filter”. Assuming that your hash function is good, this makes the false positive calculation very easy. You take the proportion of set bits in each bit array and multiply them all together. Ex: your filters are of size 11, 13, and 17 (small for demonstration). And the number of set bits are 4, 4, 6, then the false positive rate is (4/11) * (4/13) * (6/17) = 96/2431 = ~ 4% If your application involves likely getting a lot of “definitely not” responses, you could sort the set of bit arrays by fullness, so the subarray that is most likely to say “definitely not” is tried first, and so on down the line, to speed up ever so slightly the decision process. The other benefit here is if your memory is super constrained (or your bloom filters massively large) and you don’t have a big enough chunk of memory available for the full bit array contiguously in memory, several smaller bit arrays may still be able to be allocated in the spaces of memory available. Very niche use case there, but still.
I failed to parse “numerical hot loop” even though I understood the components. Imma break it down for my own edification * loop -> runs multiple times * hot -> on the critical path * numerical -> doing numeric computations. Why is this even a relevant word? Well because Python addition (and all other numeric operations) are slower than addition accelerated by a library like numpy (especially for parallelizable stuff like working with arrays) because you have to update the values in the python VM (through doing them in the CPU then storing them in memory) vs. numpy or other libs that let you just do more in the CPU. Also the GIL stops multiple processes from working at once. So a “numerical hot loop” -> slower-than-necessary computations on the critical path and you run it multiple times.
At 6:50 it's stated the hash functions must be at least 27 bits long. Then at 7:10 the 256-bit hash function gets divided into four chunks of 48 bits - could we have divided this into *nine* chunks of 27 bits (with some room leftover)? Where did the 48 bits come from - was it because we want 5 hash functions and also want to maximally use the big 256-bit function??
if you have an array size M = 80,000,000 like in the example you need 26.25 hash bits, but now if you take 27 bit hash you get numbers from 0 to 2^27-1 = 134,217,727 so you need to take this number modulo M to get the array index. But then the indices from 0 to 54,217,727 are twice as probable as other ones, and the filter becomes less efective - to avoid this kind of problem it is better to take larger chunks of bits before division. Ofc you can take M = 2^27 and forget both about this problem and the need to divide.
I'm misunderstanding something at 5:22. The probability that a bit is set after n insertions [call this p] is p = 1 - (1 - 1/m)^(kn). But then we say the probability is a false positive rate [call this r] is just r = p^k. But isn't this something that's *exactly* represented by binomial probability? I.e. the probability of one false positive (call this t_1) is t_1 = (n choose 1) * (p)^1 * (1-p)^(n-1), and the probability of two false positives is t_2 = (n choose 2) * (p)^2 * (1-p)^(n-2), and so on. So the false positive rate is *actually* equal to the sum of all the probabilities of getting each number of false positives. i.e. r = t_1 + t_2 + t_3 + ... + t_k. What am I misapplying here? Are they equivalent? I couldn't get the algebra to work showing they are equal.
for false positive rate we consider a probability that some element not added to the filter, is recognized as already added false positive = all k bits with indices generated by the hashes are set - probability P = p^k true negative = at least one bit is not set - probability P' = 1 - p^k = sum (k choose i) * (1-p)^i * p^(k-i) for i in 1..k
"for a fixed bit, the chance that one hash of a single element will set that bit is 1/m": I don't get this. This would mean that the hash only sets one bit, hence the chance of a fixed bit to be set would be 1/m. But my understanding is that a hash will set many bits, in fact, on average half of the bits, so I would expect the chance of a bit to be 1 when hashing one element is 1/2. can anyone explain please?
I think the idea is that when we add an element e to the set, we set the bit at index k_i(e) to 1 for each i. So, in other words, the normal hash output(maybe with some mod) is the index of the bit we set, not the bits we set
@@crystalvoyager2495 I was also confused by this! I think I got thrown off by the animation at 3:45 , not understanding that the three arrows were created by each independent hash function. Thanks for the explainer @piguyalamode164
I don't understand something. Considering this use case wouldn't it mean that you would need to have the whole dataset to initialize the bloom filter? Otherwise how can you know if a site is definitely not in the set without querying the service?
Very nice video as always. One question: I don't understand why many hash functions are needed. The example hash function from your video set a bit uniformly. Why not just call k times this function ?
Great question, since hash functions are not truly random but rather deterministic functions, applying the same hash k times will yield the same value each time, setting only 1 bit of shadow instead of k bits of shadow. This violates the independence assumption we made to perform the analysis and will result in the performance characteristics of a bloom filter with k=1, which is not necessarily bad, but you could do much better with k=5.
because you don't delete from bloom filters, i feel like _any number_ of zeros at the locations given by the hash functions indicates absence (you don't need to check all of them)
It seems like these cool "memory-saving" data structures are becoming less and less used as more and more companies just rely on increasing cloud resources. Thanks for including the mathematics on the parameter variables, I think it's much more interesting than just running brute force tests.
It seems you dont know what you are talking about. Any cloud engineer worth their weight in salt knows when and how to apply such techniques. It seems you are making irrational assumptions without evidence.
This is really cool, thanks for sharing. Since python ints are essentially any length, do you think a more performant implementation could use the a really large int as the memory, or would it mess with the way int is implemented?
Accidentally sent my earlier comment earlier (and I don't see it so idk if it's even there) Anyway, a reason not to use python ints for this is that you need to do bit manipulations on them and I don't know how receptive they are to that (you can definitely do that but having a special class for it will make it a lot less abstract)
If I had enough disk on the server, I'd check the "probably" response against a local sqlite db of the strings indexed by URL. And if the links service had an API for new bad links, I'd pull from that to keep the bloom filter and db updated. I dunno
I'd swear I learned this algorithm with a different name, but I can't seem to find it. Maybe the source made up a name because they didn't know where it came from and I should've always known it as a Bloom filter. Weird either way.
That was... fast. Especially the section starting at 3:30 that explains the main idea. The meat of the whole video is in the 30 seconds that follow, and honestly just felt super rushed. Took me a few attempts to understand that you don't store each hash in memory directly, but that each hash is an address to where in memory you set a bit to 1.
Looks like Subnet Masking, but instead of passing by bit:0, it passes by bit:1 through OR comparison, kinda wierd..what happens if the "possible" is acutally "no", is it checked again ??
Usually some more computationally expensive check if it's a maybe. Sometimes will be nested bloom filters. Think ethereum block bloom filters. To scan for a swap topic in a transaction, rather than check every transaction of every block you can check the block filter and if all bits are set then maybe there is a swap event in there. Then check the transaction bloom filters from the maybe's to find which may have the topic, and only on these maybe's will we bother pulling a transaction receipt and matching hashes to find what we want to decode.
This one's a bit disappointing to lead people to, as there are better structures for this purpose now. Cuckoo filters and binary fuse filters each have better characteristics.
Thanks to this video's sponsor, Hostinger: hostinger.com/mcoding. Don't forget to use coupon code MCODING at checkout to get an additional 10% off!
Code for this one does not seem to be uploaded to the usual github location
The code is up now! github.com/mCodingLLC/VideosSampleCode/blob/master/videos/132_bloom_filter/bloom_filter.py@@Shaft0
Love that you went through the math (even if it was a bit hand wavey for brevity). This feels like an actually useful data structure, and one that I was not aware of.
Details are left as an exercise to the reader :)
Shadows are a brilliant comparison for bloom filters. I love this, thanks!
That was a fascinating video and I can think of a few places in my code base that could use this to help reduce the number of unnecessary calls to an API. This also feels like it should be a Python package someone implements in C on the back-end.
there's a rust-for-python package called rbloom
thanks a lot for making the explanations dark mode, it is eye burning for me to use white mode, mostly cuz I got an eyes issue. it doesn't matter how hard I use a blue filter or how much ambient light I have in the room, full white screen hurts within 10 mins max
Does your browser or OS have a method to invert color?
@@quintrankid8045 I am not sure about that. I didn't try to search.
I just use Dark Reader to darkmode everything and set a shortcut to toggle it very fast when it breaks
@@quintrankid8045mine does, one of the few reasons why I exclusively use Opera GX and their force dark mode for all websites lol (other being mouse gestures). Win11 has a nice dark theme built in. Additionally, all GUIs I write have dark themes as well lol
Skill issue
I love your videos. High information density to explain the main idea so simply and so perfectly and so fast. Then minutes on practical explanation
I’ve always used a single hash function, with several smaller bloom “filters”(bit arrays), each with a length that is prime, and mod the hash by the length of each “filter”.
Assuming that your hash function is good, this makes the false positive calculation very easy. You take the proportion of set bits in each bit array and multiply them all together.
Ex: your filters are of size 11, 13, and 17 (small for demonstration). And the number of set bits are 4, 4, 6, then the false positive rate is (4/11) * (4/13) * (6/17) = 96/2431 = ~ 4%
If your application involves likely getting a lot of “definitely not” responses, you could sort the set of bit arrays by fullness, so the subarray that is most likely to say “definitely not” is tried first, and so on down the line, to speed up ever so slightly the decision process.
The other benefit here is if your memory is super constrained (or your bloom filters massively large) and you don’t have a big enough chunk of memory available for the full bit array contiguously in memory, several smaller bit arrays may still be able to be allocated in the spaces of memory available. Very niche use case there, but still.
I failed to parse “numerical hot loop” even though I understood the components. Imma break it down for my own edification
* loop -> runs multiple times
* hot -> on the critical path
* numerical -> doing numeric computations. Why is this even a relevant word? Well because Python addition (and all other numeric operations) are slower than addition accelerated by a library like numpy (especially for parallelizable stuff like working with arrays) because you have to update the values in the python VM (through doing them in the CPU then storing them in memory) vs. numpy or other libs that let you just do more in the CPU. Also the GIL stops multiple processes from working at once.
So a “numerical hot loop” -> slower-than-necessary computations on the critical path and you run it multiple times.
Best video of the week if not the month. Every second I was fascinated, even the sponsorship was spot on
At 6:50 it's stated the hash functions must be at least 27 bits long. Then at 7:10 the 256-bit hash function gets divided into four chunks of 48 bits - could we have divided this into *nine* chunks of 27 bits (with some room leftover)? Where did the 48 bits come from - was it because we want 5 hash functions and also want to maximally use the big 256-bit function??
if you have an array size M = 80,000,000 like in the example you need 26.25 hash bits, but now if you take 27 bit hash you get numbers from 0 to 2^27-1 = 134,217,727 so you need to take this number modulo M to get the array index. But then the indices from 0 to 54,217,727 are twice as probable as other ones, and the filter becomes less efective - to avoid this kind of problem it is better to take larger chunks of bits before division. Ofc you can take M = 2^27 and forget both about this problem and the need to divide.
I have to say, in general I'm not a big fan of Python, but the way that you write code kind of makes me like it again!
Outstanding! Big motivation to learn algos & data structures. Usecase and shadow comparison priceless!
That was really impressive. Thanks for sharing!
great video introduction to bloom filtrers!
I'm misunderstanding something at 5:22. The probability that a bit is set after n insertions [call this p] is p = 1 - (1 - 1/m)^(kn). But then we say the probability is a false positive rate [call this r] is just r = p^k.
But isn't this something that's *exactly* represented by binomial probability? I.e. the probability of one false positive (call this t_1) is t_1 = (n choose 1) * (p)^1 * (1-p)^(n-1), and the probability of two false positives is t_2 = (n choose 2) * (p)^2 * (1-p)^(n-2), and so on. So the false positive rate is *actually* equal to the sum of all the probabilities of getting each number of false positives. i.e. r = t_1 + t_2 + t_3 + ... + t_k.
What am I misapplying here? Are they equivalent? I couldn't get the algebra to work showing they are equal.
for false positive rate we consider a probability that some element not added to the filter, is recognized as already added
false positive = all k bits with indices generated by the hashes are set - probability P = p^k
true negative = at least one bit is not set - probability P' = 1 - p^k = sum (k choose i) * (1-p)^i * p^(k-i) for i in 1..k
Bloom filters have major application in computational biology. So funny that you posted this video. I was talkiny about this earlier haha.
That's really smart, thanks for showing the idea on the video, it's a good one to have in your pocket
Really appreciate this video, thank you
"for a fixed bit, the chance that one hash of a single element will set that bit is 1/m": I don't get this. This would mean that the hash only sets one bit, hence the chance of a fixed bit to be set would be 1/m. But my understanding is that a hash will set many bits, in fact, on average half of the bits, so I would expect the chance of a bit to be 1 when hashing one element is 1/2. can anyone explain please?
I think the idea is that when we add an element e to the set, we set the bit at index k_i(e) to 1 for each i. So, in other words, the normal hash output(maybe with some mod) is the index of the bit we set, not the bits we set
@@piguyalamode164 "the hash output is the index of the bit we set" that's what I misunderstood initially, thanks a lot for clearing that up!
@@crystalvoyager2495 I was also confused by this! I think I got thrown off by the animation at 3:45 , not understanding that the three arrows were created by each independent hash function. Thanks for the explainer @piguyalamode164
I don't understand something. Considering this use case wouldn't it mean that you would need to have the whole dataset to initialize the bloom filter? Otherwise how can you know if a site is definitely not in the set without querying the service?
Very nice video as always. One question: I don't understand why many hash functions are needed. The example hash function from your video set a bit uniformly. Why not just call k times this function ?
Great question, since hash functions are not truly random but rather deterministic functions, applying the same hash k times will yield the same value each time, setting only 1 bit of shadow instead of k bits of shadow. This violates the independence assumption we made to perform the analysis and will result in the performance characteristics of a bloom filter with k=1, which is not necessarily bad, but you could do much better with k=5.
Where do you learn about those to begin with? In CS grade course? In the wild? Is there a book of spells for such things?
03:37 is using 1 small buffer PER hash function worse than 1 big buffer for all hash functions?
There are also hash functions where you can set the output lenght.
Amazing stuff. Cheers.
Awesome content! Thank you so much.
if you made explanations for algorithms like this, maybe i'd get some motivation to do DS/Algo for interview prep. Great video!
because you don't delete from bloom filters, i feel like _any number_ of zeros at the locations given by the hash functions indicates absence (you don't need to check all of them)
Yep, you got it! The contains method returns True if they are all set, so it returns False if *any* are not set.
It seems like these cool "memory-saving" data structures are becoming less and less used as more and more companies just rely on increasing cloud resources.
Thanks for including the mathematics on the parameter variables, I think it's much more interesting than just running brute force tests.
It seems you dont know what you are talking about. Any cloud engineer worth their weight in salt knows when and how to apply such techniques. It seems you are making irrational assumptions without evidence.
You're just shifting bloom filter to the cloud provider. In the background, such technique are all used in cloud infrastructure as well.
Soooo the more memory the more accuracy or no? (4:27)
Cool thing is that Redis has a built-in Bloom Filter for you to use instead of implementing and managing it yourself.
What if the array is all 1s? Won’t it return a false positive for any element?
That is why you get worse results by using too many hash functions. At infinite hash functions all bits are set to 1
It would, but that is impossible if the number of hashes is less than the number of bits per element.
This is really cool, thanks for sharing. Since python ints are essentially any length, do you think a more performant implementation could use the a really large int as the memory, or would it mess with the way int is implemented?
Accidentally sent my earlier comment earlier (and I don't see it so idk if it's even there)
Anyway, a reason not to use python ints for this is that you need to do bit manipulations on them and I don't know how receptive they are to that (you can definitely do that but having a special class for it will make it a lot less abstract)
If I had enough disk on the server, I'd check the "probably" response against a local sqlite db of the strings indexed by URL. And if the links service had an API for new bad links, I'd pull from that to keep the bloom filter and db updated. I dunno
I'm sure one could keep a handful of known good link filters too to eliminate any fuzzed links
I swear, I always learn something new from you. Greatest gift one can get is experience, and I'm very grateful you are generous with yours
Fascinating stuff as always. Could you do a coding vid on the enhanced trial division method to see how much faster it is than trial division?
Facinating.
I'd swear I learned this algorithm with a different name, but I can't seem to find it. Maybe the source made up a name because they didn't know where it came from and I should've always known it as a Bloom filter. Weird either way.
That was... fast. Especially the section starting at 3:30 that explains the main idea. The meat of the whole video is in the 30 seconds that follow, and honestly just felt super rushed. Took me a few attempts to understand that you don't store each hash in memory directly, but that each hash is an address to where in memory you set a bit to 1.
Looks like Subnet Masking, but instead of passing by bit:0, it passes by bit:1 through OR comparison, kinda wierd..what happens if the "possible" is acutally "no", is it checked again ??
Usually some more computationally expensive check if it's a maybe. Sometimes will be nested bloom filters.
Think ethereum block bloom filters. To scan for a swap topic in a transaction, rather than check every transaction of every block you can check the block filter and if all bits are set then maybe there is a swap event in there. Then check the transaction bloom filters from the maybe's to find which may have the topic, and only on these maybe's will we bother pulling a transaction receipt and matching hashes to find what we want to decode.
splunk is using bloomfilters heavily
Please consider this video's sponsor... me! That's right I'm sponsoring myse.... WAIT, WHAT? Not this time?
I want more
Nice
I also want a PhD.
i understood nothing
This one's a bit disappointing to lead people to, as there are better structures for this purpose now. Cuckoo filters and binary fuse filters each have better characteristics.
I was like ooooh damn now we surely got to implement this in C or Rust…. Then you hit us with Python and my heart stopped 😢
discord gang.
8:44 Or stop using python
This sounds like it's voiced by AI