MapReduce - Computerphile

Поділитися
Вставка
  • Опубліковано 7 сер 2024
  • Peforming operations in parallel on big data. Rebecca Tickle explains MapReduce.
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

КОМЕНТАРІ • 63

  • @jaybakerwork
    @jaybakerwork 5 років тому +168

    From reading the comments, I think what some people are missing is the following: the value of MapReduce as it was implemented at Google (and in Hadoop) is that it is a framework for doing these sorts of jobs at scale. Furthermore, the framework allows one to plug in your map and reduce functions independently of everything else. That is why the shuffle is important for example. Often, people see the word count example and they want to just take that out and count at that point. And that might be better for this particular example, but the idea of the framework/algorithm is that it does not "know" at the shuffle phase that you are going to count. I think more examples would really help people understand it better.

  • @szhangster
    @szhangster 5 років тому +365

    Professor Baclawski developed the search engine technology now called MapReduce in 1993. Northeastern University patented this technology, and the patent has not been successfully challenged. Northeastern University sued Google for patent infringement, and Google awarded a settlement to Northeastern University.

  • @No0utlet
    @No0utlet 5 років тому +77

    I remember learning about Hadoop Map Reduce in 2013 and being completely confused about it. Thanks!

  • @tomburns5231
    @tomburns5231 5 років тому +46

    I don't understand what a, b, and c are in the example. I thought they were words, but then the presenter calls them letters. In any case, given the example case of distributing a word count over parallel processors, why isn't the answer simply splitting the whole document/data into chucks, and asking the processors to count the words of one chuck each. Afterwards we can just sum the result. What am I missing?

  • @cl0udbear
    @cl0udbear 5 років тому +73

    I don't get it. So you shuffle the keys between the nodes so that each node is counting all the instances of the same key? Wouldn't that shuffling have been the perfect time to just do the count and get it done with?

  • @STDrepository
    @STDrepository 5 років тому +68

    In this example why wouldn't you just have each server count the words in its data and then send the results back to the operator's machine which adds them up?

  • @NikolajLepka
    @NikolajLepka 5 років тому +287

    map and reduce also just happen to be the two most commonly used functions in functional programming, and have been for decades
    this tech isn't new, it's just being applied at a large scale

    • @BarbarianGod
      @BarbarianGod 5 років тому +38

      Glad I'm not the only one bothered by the opening of this video

  • @philippk7554
    @philippk7554 5 років тому +9

    You can actually use k-means with MapReduce. It basically works by splitting the data into sets and running k-means on different nodes with the same cluster values. After that, the new means are calculated and broadcasted to all nodes for the next iteration.

  • @IceMetalPunk
    @IceMetalPunk 5 років тому +5

    Why does this allow duplicate keys on the same node instead of just one key per word with the value being the total count of the word on that node? Seems like that would speed up both the shuffle and reduce phases?

  • @luciengrondin5802
    @luciengrondin5802 5 років тому +2

    When I hear about map/reduce I think about the corresponding commands in perl and that confuses me because they don't seem to be much related.

  • @233kosta
    @233kosta 5 років тому +3

    Can you do a segment on the MPI protocol?

  • @theguitarhero649
    @theguitarhero649 5 років тому +4

    I just had to do a mapreduce implementation with grpc for my Advanced Operating Systems class that was due last night, wish this had come out earlier -_-

  • @Kachkavalj
    @Kachkavalj 3 роки тому +19

    it's really refreshing to hear someone explain this in a straight forward way ... kudos to that young lady!

  • @cthudo
    @cthudo 5 років тому +210

    I admit I'm an old fogy programmer... I have a massive problem with crediting this to Google. They *may* have been the ones to put a name on it, but that doesn't mean this isn't an exceedingly basic strategy of dividing work and collating the results that was being used by a lot of people for decades.

  • @MokhlesBouzaien
    @MokhlesBouzaien 4 роки тому +4

    It was confusing until I found a Computerphile video.

  • @TheAstronomyDude
    @TheAstronomyDude 5 років тому

    How would you implement this if you want to leave your data untouched? Mirror your data?

  • @steven1671
    @steven1671 5 років тому +28

    Is "reduce" the same as the fold function in Haskell?

    • @Yotanido
      @Yotanido 5 років тому +1

      It is

    • @Earthcomputer
      @Earthcomputer 5 років тому +1

      Yes

    • @klegsy
      @klegsy 5 років тому +3

      Yes! Fold with the added benefit of directionality.

  • @mrtoastyman07
    @mrtoastyman07 5 років тому +52

    This feels like the most impenetrable way of explaining this...

  • @Juan-Hdez
    @Juan-Hdez 2 роки тому

    Very useful. Thank you very much!

  • @monthlyduc
    @monthlyduc 5 років тому +6

    Could you do a video on concurrency and possible solutions/history video? I'm surprised you haven't yet touched on sequential and concurrent execution. Great vid!

    • @mytech6779
      @mytech6779 5 років тому +1

      They have covered concurrency within a CPU core, unless maybe you mean parallel execution.

  • @brookead
    @brookead 5 років тому +59

    Hadoop as an approach to map reduce is so over engineered and complicated that most people avoid it. Also, unless you're at really big scale, the overhead of Hadoop is so big that any decent laptop can outperform it. That's specifically a Hadoop criticism, not a criticism of MR in general. Alternatives like DASK (simplistically: distributed python pandas) is much simpler to manage for most purposes... Not quite the same thing but ultimately, you pick your poison... And as noted by another commenter, the thing Google "pioneered" was _distributed_ MR. MR itself has been a part of many languages for many years... And the good explanation given in this video works just as well for those as it does for distributed versions thereof. :)

  • @Myziikhavus
    @Myziikhavus 5 місяців тому

    This is a good video, no idea wtf the rest are yapping about. She actually goes through the main bit of how data shuffles and moves to different nodes

  • @Metruzanca
    @Metruzanca 5 років тому +5

    Does the way this is done have any similarity to how array.prototype.map() and array.prototype.reduce() works in ES6 JS?
    Edit : i mean at a level of dividing the map process to different processes/threads/tasks to speed up compution.

    • @belst_
      @belst_ 5 років тому +6

      map is not computed concurrently but sequential in js.

  • @ignacio560
    @ignacio560 5 років тому +17

    Thank you, amazing video! Any chance you’ll talk about Spark next? 😄

  • @worlddaves
    @worlddaves 5 років тому +24

    Why do you always write on that particular type of paper in your videos?

    • @samsharma8621
      @samsharma8621 5 років тому +18

      They get that for free

    • @rich1051414
      @rich1051414 5 років тому +72

      Consistency. Computerphile uses dot matrix paper while numberfiile uses brown paper towels.

    • @feldinho
      @feldinho 5 років тому +16

      it's called "continuous form" and was historically used with daisy wheel-type printers (fix type face, size and line height) to ease the reading.

    • @SteelSkin667
      @SteelSkin667 5 років тому +7

      Because computers.

    • @saschar.8736
      @saschar.8736 5 років тому +4

      @@feldinho Yeah; I guess, they have a bunch of it lying around from that era. At home we also have (or had; don't know if my father threw it away, yet) a stack of similar paper from back when we had a dot matrix printer. And I imagine, that, at a university, they got tons of that stuff left.

  • @man-alive
    @man-alive 5 років тому +6

    if the mapping creates a single record for each occurence (e.g. a 1, b 1, c 1), why not just use an array (e.g. a, b, a), since the 1 is implied?

  • @Gunbudder
    @Gunbudder 5 років тому +7

    Map reduce gives me google interview flashbacks...

  • @MayankArora
    @MayankArora 5 років тому +1

    I love computerphile ❤️

  • @unvergebeneid
    @unvergebeneid 5 років тому +371

    Is it just me or was that not a particularly good example?

    • @viorelanghel5532
      @viorelanghel5532 5 років тому +62

      It's the classic map reduce example but you need to imagine 10TB of text, not 4 words a b c d

    • @KylePiira
      @KylePiira 5 років тому +33

      Perhaps not, but word counting is the typical Hello World project for MapReduce so it makes sense to use it.

    • @Ceelvain
      @Ceelvain 5 років тому +53

      The word count is a good example. (Actually the best I came across.) She just should have choosen sentences with actual words, and put it in the context of a search engine.

  • @mro2352
    @mro2352 4 роки тому

    My previous project was with Apache spark which uses Hadoop and map reduce and we used it for batch jobs.

  • @cediddi
    @cediddi 5 років тому +11

    no love for filter :( it's map-reduce-filter (atleast for me)

    • @MarioWenzel
      @MarioWenzel 5 років тому +2

      filter can be part of either phase, depending on how you see it.
      filter p l = foldl (++) [] $ map (\x -> if p then [x] else []) l

  • @WilliamAndrea
    @WilliamAndrea 5 років тому +11

    Why are there duplicate keys? Maybe it's cause I only know Python, but I'm really confused.
    Like for the first line of text, I would write the output of the map phase as a list of tuples, [('a', 1), ('b', 1), ('a', 1)], but wouldn't it make more sense as a dict, {'a': 2, 'b': 1} ?

  • @allmhuran
    @allmhuran 5 років тому +11

    select count(...) group by ...
    Query plan: parallelism, stream aggregate
    But sure, tell me more about how SQL is outdated.

  • @lionp696
    @lionp696 5 років тому +8

    Needed this a week ago for my exam :(

  • @MrTridac
    @MrTridac 5 років тому +102

    When will computer scientists stop putting new fancy names on decade old concepts?

  • @sabriath
    @sabriath 5 років тому +27

    I'm not seeing the practicality of this, shuffling seems to make it pointless. It's almost like seeing a huffman tree being formed, but then introducing it to some meth.

  • @ninocraft1
    @ninocraft1 5 років тому

    brilliant example of a practical application

  • @FahadKiani1
    @FahadKiani1 3 роки тому

    anyone knows who the genius is?

  • @Master1906
    @Master1906 5 років тому +81

    She didn't explain it very well...

  • @coder001
    @coder001 5 років тому +9

    too much talking and not enough visual examples

  • @PilotPlater
    @PilotPlater 5 років тому +9

    forget the nasty comments, I found this fascinating, and so did 94% of people who rated.

  • @geekworthy7938
    @geekworthy7938 5 років тому +29

    Yeah, that was a pretty bad example and quite boring.

  • @__-to3hq
    @__-to3hq 5 років тому +3

    the sound of felt pens makes me cringe

  • @tenseikenzx-3559
    @tenseikenzx-3559 5 років тому +1

    What's better MapReduce or CUDA?

    • @CJSwitz
      @CJSwitz 5 років тому +31

      That's like comparing Apache Webserver to Fortran. They are completely different things.

    • @SteelSkin667
      @SteelSkin667 5 років тому +8

      I don't think these two are directly comparable. I'm pretty sure there are GPGPU implementations of MapReduce that make use of CUDA.

    • @IsYitzach
      @IsYitzach 5 років тому +6

      MapReduce is a generic algorithm paradigm. There's a CUDA implementation for numerical purposes. This is a description for a cluster size problem rather than a GPU size problem.