Bloom Filters

Поділитися
Вставка
  • Опубліковано 27 січ 2025

КОМЕНТАРІ • 74

  • @mCoding
    @mCoding  Рік тому +8

    Thanks to this video's sponsor, Hostinger: hostinger.com/mcoding. Don't forget to use coupon code MCODING at checkout to get an additional 10% off!

    • @Shaft0
      @Shaft0 Рік тому

      Code for this one does not seem to be uploaded to the usual github location

    • @mCoding
      @mCoding  Рік тому

      The code is up now! github.com/mCodingLLC/VideosSampleCode/blob/master/videos/132_bloom_filter/bloom_filter.py@@Shaft0

  • @garyantonyo
    @garyantonyo Рік тому +35

    Love that you went through the math (even if it was a bit hand wavey for brevity). This feels like an actually useful data structure, and one that I was not aware of.

    • @mCoding
      @mCoding  Рік тому +15

      Details are left as an exercise to the reader :)

  • @Zaniahiononzenbei
    @Zaniahiononzenbei Рік тому +15

    Shadows are a brilliant comparison for bloom filters. I love this, thanks!

  • @TheOriginalSentack
    @TheOriginalSentack Рік тому +68

    That was a fascinating video and I can think of a few places in my code base that could use this to help reduce the number of unnecessary calls to an API. This also feels like it should be a Python package someone implements in C on the back-end.

    • @aonodensetsu
      @aonodensetsu Рік тому +19

      there's a rust-for-python package called rbloom

  • @glorytoarstotzka330
    @glorytoarstotzka330 Рік тому +94

    thanks a lot for making the explanations dark mode, it is eye burning for me to use white mode, mostly cuz I got an eyes issue. it doesn't matter how hard I use a blue filter or how much ambient light I have in the room, full white screen hurts within 10 mins max

    • @quintrankid8045
      @quintrankid8045 Рік тому +2

      Does your browser or OS have a method to invert color?

    • @glorytoarstotzka330
      @glorytoarstotzka330 Рік тому

      ​@@quintrankid8045 I am not sure about that. I didn't try to search.
      I just use Dark Reader to darkmode everything and set a shortcut to toggle it very fast when it breaks

    • @djixi98
      @djixi98 Рік тому

      ​@@quintrankid8045mine does, one of the few reasons why I exclusively use Opera GX and their force dark mode for all websites lol (other being mouse gestures). Win11 has a nice dark theme built in. Additionally, all GUIs I write have dark themes as well lol

    • @csanadtemesvari9251
      @csanadtemesvari9251 Рік тому +3

      Skill issue

  • @arisweedler4703
    @arisweedler4703 Рік тому +11

    I love your videos. High information density to explain the main idea so simply and so perfectly and so fast. Then minutes on practical explanation

  • @loganrussell48
    @loganrussell48 Рік тому +2

    I’ve always used a single hash function, with several smaller bloom “filters”(bit arrays), each with a length that is prime, and mod the hash by the length of each “filter”.
    Assuming that your hash function is good, this makes the false positive calculation very easy. You take the proportion of set bits in each bit array and multiply them all together.
    Ex: your filters are of size 11, 13, and 17 (small for demonstration). And the number of set bits are 4, 4, 6, then the false positive rate is (4/11) * (4/13) * (6/17) = 96/2431 = ~ 4%
    If your application involves likely getting a lot of “definitely not” responses, you could sort the set of bit arrays by fullness, so the subarray that is most likely to say “definitely not” is tried first, and so on down the line, to speed up ever so slightly the decision process.
    The other benefit here is if your memory is super constrained (or your bloom filters massively large) and you don’t have a big enough chunk of memory available for the full bit array contiguously in memory, several smaller bit arrays may still be able to be allocated in the spaces of memory available. Very niche use case there, but still.

  • @arisweedler4703
    @arisweedler4703 Рік тому +7

    I failed to parse “numerical hot loop” even though I understood the components. Imma break it down for my own edification
    * loop -> runs multiple times
    * hot -> on the critical path
    * numerical -> doing numeric computations. Why is this even a relevant word? Well because Python addition (and all other numeric operations) are slower than addition accelerated by a library like numpy (especially for parallelizable stuff like working with arrays) because you have to update the values in the python VM (through doing them in the CPU then storing them in memory) vs. numpy or other libs that let you just do more in the CPU. Also the GIL stops multiple processes from working at once.
    So a “numerical hot loop” -> slower-than-necessary computations on the critical path and you run it multiple times.

  • @nadavgover6017
    @nadavgover6017 Рік тому

    Best video of the week if not the month. Every second I was fascinated, even the sponsorship was spot on

  • @jpay
    @jpay Рік тому +8

    At 6:50 it's stated the hash functions must be at least 27 bits long. Then at 7:10 the 256-bit hash function gets divided into four chunks of 48 bits - could we have divided this into *nine* chunks of 27 bits (with some room leftover)? Where did the 48 bits come from - was it because we want 5 hash functions and also want to maximally use the big 256-bit function??

    • @volodymyrchelnokov8175
      @volodymyrchelnokov8175 Рік тому +5

      if you have an array size M = 80,000,000 like in the example you need 26.25 hash bits, but now if you take 27 bit hash you get numbers from 0 to 2^27-1 = 134,217,727 so you need to take this number modulo M to get the array index. But then the indices from 0 to 54,217,727 are twice as probable as other ones, and the filter becomes less efective - to avoid this kind of problem it is better to take larger chunks of bits before division. Ofc you can take M = 2^27 and forget both about this problem and the need to divide.

  • @sufilevy
    @sufilevy Рік тому +4

    I have to say, in general I'm not a big fan of Python, but the way that you write code kind of makes me like it again!

  • @MrTrebor2
    @MrTrebor2 Рік тому +1

    Outstanding! Big motivation to learn algos & data structures. Usecase and shadow comparison priceless!

  • @yiannifunataoldotcom
    @yiannifunataoldotcom Рік тому

    That was really impressive. Thanks for sharing!

  • @Khushpich
    @Khushpich Рік тому

    great video introduction to bloom filtrers!

  • @jpay
    @jpay Рік тому +1

    I'm misunderstanding something at 5:22. The probability that a bit is set after n insertions [call this p] is p = 1 - (1 - 1/m)^(kn). But then we say the probability is a false positive rate [call this r] is just r = p^k.
    But isn't this something that's *exactly* represented by binomial probability? I.e. the probability of one false positive (call this t_1) is t_1 = (n choose 1) * (p)^1 * (1-p)^(n-1), and the probability of two false positives is t_2 = (n choose 2) * (p)^2 * (1-p)^(n-2), and so on. So the false positive rate is *actually* equal to the sum of all the probabilities of getting each number of false positives. i.e. r = t_1 + t_2 + t_3 + ... + t_k.
    What am I misapplying here? Are they equivalent? I couldn't get the algebra to work showing they are equal.

    • @volodymyrchelnokov8175
      @volodymyrchelnokov8175 Рік тому

      for false positive rate we consider a probability that some element not added to the filter, is recognized as already added
      false positive = all k bits with indices generated by the hashes are set - probability P = p^k
      true negative = at least one bit is not set - probability P' = 1 - p^k = sum (k choose i) * (1-p)^i * p^(k-i) for i in 1..k

  • @shrayesraman5192
    @shrayesraman5192 Рік тому

    Bloom filters have major application in computational biology. So funny that you posted this video. I was talkiny about this earlier haha.

  • @luizzeroxis
    @luizzeroxis Рік тому

    That's really smart, thanks for showing the idea on the video, it's a good one to have in your pocket

  • @Kylebacca
    @Kylebacca 11 місяців тому

    Really appreciate this video, thank you

  • @crystalvoyager2495
    @crystalvoyager2495 Рік тому +5

    "for a fixed bit, the chance that one hash of a single element will set that bit is 1/m": I don't get this. This would mean that the hash only sets one bit, hence the chance of a fixed bit to be set would be 1/m. But my understanding is that a hash will set many bits, in fact, on average half of the bits, so I would expect the chance of a bit to be 1 when hashing one element is 1/2. can anyone explain please?

    • @piguyalamode164
      @piguyalamode164 Рік тому +9

      I think the idea is that when we add an element e to the set, we set the bit at index k_i(e) to 1 for each i. So, in other words, the normal hash output(maybe with some mod) is the index of the bit we set, not the bits we set

    • @crystalvoyager2495
      @crystalvoyager2495 Рік тому +5

      @@piguyalamode164 "the hash output is the index of the bit we set" that's what I misunderstood initially, thanks a lot for clearing that up!

    • @tchittesh
      @tchittesh Рік тому +5

      @@crystalvoyager2495 I was also confused by this! I think I got thrown off by the animation at 3:45 , not understanding that the three arrows were created by each independent hash function. Thanks for the explainer @piguyalamode164

  • @KappaHakka
    @KappaHakka Рік тому

    I don't understand something. Considering this use case wouldn't it mean that you would need to have the whole dataset to initialize the bloom filter? Otherwise how can you know if a site is definitely not in the set without querying the service?

  • @Salute-hc
    @Salute-hc Рік тому

    Very nice video as always. One question: I don't understand why many hash functions are needed. The example hash function from your video set a bit uniformly. Why not just call k times this function ?

    • @mCoding
      @mCoding  Рік тому +2

      Great question, since hash functions are not truly random but rather deterministic functions, applying the same hash k times will yield the same value each time, setting only 1 bit of shadow instead of k bits of shadow. This violates the independence assumption we made to perform the analysis and will result in the performance characteristics of a bloom filter with k=1, which is not necessarily bad, but you could do much better with k=5.

  • @vsolyomi
    @vsolyomi Рік тому

    Where do you learn about those to begin with? In CS grade course? In the wild? Is there a book of spells for such things?

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff Рік тому

    03:37 is using 1 small buffer PER hash function worse than 1 big buffer for all hash functions?

  • @quintencabo
    @quintencabo Рік тому

    There are also hash functions where you can set the output lenght.

  • @Not_Even_Wrong
    @Not_Even_Wrong Рік тому

    Amazing stuff. Cheers.

  • @osmarperez2341
    @osmarperez2341 Рік тому

    Awesome content! Thank you so much.

  • @siddharth-gandhi
    @siddharth-gandhi Рік тому

    if you made explanations for algorithms like this, maybe i'd get some motivation to do DS/Algo for interview prep. Great video!

  • @VeejayRampay
    @VeejayRampay Рік тому

    because you don't delete from bloom filters, i feel like _any number_ of zeros at the locations given by the hash functions indicates absence (you don't need to check all of them)

    • @mCoding
      @mCoding  Рік тому +3

      Yep, you got it! The contains method returns True if they are all set, so it returns False if *any* are not set.

  • @skellt
    @skellt Рік тому +17

    It seems like these cool "memory-saving" data structures are becoming less and less used as more and more companies just rely on increasing cloud resources.
    Thanks for including the mathematics on the parameter variables, I think it's much more interesting than just running brute force tests.

    • @gregorymorse8423
      @gregorymorse8423 Рік тому +7

      It seems you dont know what you are talking about. Any cloud engineer worth their weight in salt knows when and how to apply such techniques. It seems you are making irrational assumptions without evidence.

    • @t_kon
      @t_kon Рік тому +1

      You're just shifting bloom filter to the cloud provider. In the background, such technique are all used in cloud infrastructure as well.

  • @hoteny
    @hoteny Рік тому

    Soooo the more memory the more accuracy or no? (4:27)

  • @yudhiesh1997
    @yudhiesh1997 Рік тому

    Cool thing is that Redis has a built-in Bloom Filter for you to use instead of implementing and managing it yourself.

  • @efperel
    @efperel Рік тому +1

    What if the array is all 1s? Won’t it return a false positive for any element?

    • @Pownyan
      @Pownyan Рік тому +4

      That is why you get worse results by using too many hash functions. At infinite hash functions all bits are set to 1

    • @NateROCKS112
      @NateROCKS112 Рік тому +1

      It would, but that is impossible if the number of hashes is less than the number of bits per element.

  • @user-hk3ej4hk7m
    @user-hk3ej4hk7m Рік тому +1

    This is really cool, thanks for sharing. Since python ints are essentially any length, do you think a more performant implementation could use the a really large int as the memory, or would it mess with the way int is implemented?

    • @roelant8069
      @roelant8069 Рік тому

      Accidentally sent my earlier comment earlier (and I don't see it so idk if it's even there)
      Anyway, a reason not to use python ints for this is that you need to do bit manipulations on them and I don't know how receptive they are to that (you can definitely do that but having a special class for it will make it a lot less abstract)

  • @SpeedingFlare
    @SpeedingFlare Рік тому

    If I had enough disk on the server, I'd check the "probably" response against a local sqlite db of the strings indexed by URL. And if the links service had an API for new bad links, I'd pull from that to keep the bloom filter and db updated. I dunno

  • @JonathanZigler
    @JonathanZigler Рік тому

    I'm sure one could keep a handful of known good link filters too to eliminate any fuzzed links

  • @andremaldonado7410
    @andremaldonado7410 Рік тому +1

    I swear, I always learn something new from you. Greatest gift one can get is experience, and I'm very grateful you are generous with yours

  • @DavidHodge-z9v
    @DavidHodge-z9v Рік тому

    Fascinating stuff as always. Could you do a coding vid on the enhanced trial division method to see how much faster it is than trial division?

  • @lukekurlandski7653
    @lukekurlandski7653 Рік тому

    Facinating.

  • @anon_y_mousse
    @anon_y_mousse Рік тому

    I'd swear I learned this algorithm with a different name, but I can't seem to find it. Maybe the source made up a name because they didn't know where it came from and I should've always known it as a Bloom filter. Weird either way.

  • @QuantumHistorian
    @QuantumHistorian Рік тому +4

    That was... fast. Especially the section starting at 3:30 that explains the main idea. The meat of the whole video is in the 30 seconds that follow, and honestly just felt super rushed. Took me a few attempts to understand that you don't store each hash in memory directly, but that each hash is an address to where in memory you set a bit to 1.

  • @kenny-kvibe
    @kenny-kvibe Рік тому +1

    Looks like Subnet Masking, but instead of passing by bit:0, it passes by bit:1 through OR comparison, kinda wierd..what happens if the "possible" is acutally "no", is it checked again ??

    • @bearwolffish
      @bearwolffish Рік тому +1

      Usually some more computationally expensive check if it's a maybe. Sometimes will be nested bloom filters.
      Think ethereum block bloom filters. To scan for a swap topic in a transaction, rather than check every transaction of every block you can check the block filter and if all bits are set then maybe there is a swap event in there. Then check the transaction bloom filters from the maybe's to find which may have the topic, and only on these maybe's will we bother pulling a transaction receipt and matching hashes to find what we want to decode.

  • @danielglauche6095
    @danielglauche6095 Рік тому

    splunk is using bloomfilters heavily

  • @jlhjlh
    @jlhjlh Рік тому +1

    Please consider this video's sponsor... me! That's right I'm sponsoring myse.... WAIT, WHAT? Not this time?

  • @stan4676
    @stan4676 Рік тому

    I want more

  • @HTWwpzIuqaObMt
    @HTWwpzIuqaObMt Рік тому

    Nice

  • @simonnjoroge933
    @simonnjoroge933 Рік тому +1

    I also want a PhD.

  • @zizzyballuba4373
    @zizzyballuba4373 Рік тому

    i understood nothing

  • @mikeshardmind
    @mikeshardmind Рік тому +1

    This one's a bit disappointing to lead people to, as there are better structures for this purpose now. Cuckoo filters and binary fuse filters each have better characteristics.

  • @Mateusz143
    @Mateusz143 Рік тому

    I was like ooooh damn now we surely got to implement this in C or Rust…. Then you hit us with Python and my heart stopped 😢

  • @bereck7735
    @bereck7735 Рік тому

    discord gang.

  • @palapapa0201
    @palapapa0201 Рік тому

    8:44 Or stop using python

  • @muhdiversity7409
    @muhdiversity7409 Рік тому

    This sounds like it's voiced by AI