"Performance Matters" by Emery Berger

Поділитися
Вставка
  • Опубліковано 4 тра 2024
  • Performance clearly matters to users. For example, the most common software update on the AppStore is "Bug fixes and performance enhancements." Now that Moore's Law has ended, programmers have to work hard to get high performance for their applications. But why is performance hard to deliver?
    I will first explain why current approaches to evaluating and optimizing performance don't work, especially on modern hardware and for modern applications. I then present two systems that address these challenges. Stabilizer is a tool that enables statistically sound performance evaluation, making it possible to understand the impact of optimizations and conclude things like the fact that the -O2 and -O3 optimization levels are indistinguishable from noise (sadly true).
    Since compiler optimizations have run out of steam, we need better profiling support, especially for modern concurrent, multi-threaded applications. Coz is a new "causal profiler" that lets programmers optimize for throughput or latency, and which pinpoints and accurately predicts the impact of optimizations. Coz's approach unlocks previously unknown optimization opportunities. Guided by Coz, we improved the performance of Memcached (9%), SQLite (25%), and accelerated six other applications by as much as 68%; in most cases, this involved modifying less than 10 lines of code and took under half an hour (without any prior understanding of the programs!). Coz now ships as part of standard Linux distros (apt install coz-profiler).
    Emery Berger
    University of Massachusetts Amherst
    @emeryberger
    Emery Berger is a Professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst, the flagship campus of the UMass system. He graduated with a Ph.D. in Computer Science from the University of Texas at Austin in 2002. Professor Berger has been a Visiting Scientist at Microsoft Research (where he is currently on sabbatical), the University of Washington, and at the Universitat Politècnica de Catalunya (UPC) / Barcelona Supercomputing Center (BSC). Professor Berger's research spans programming languages, runtime systems, and operating systems, with a particular focus on systems that transparently improve reliability, security, and performance. He and his collaborators have created a number of influential software systems including Hoard, a fast and scalable memory manager that accelerates multithreaded applications (used by companies including British Telecom, Cisco, Crédit Suisse, Reuters, Royal Bank of Canada, SAP, and Tata, and on which the Mac OS X memory manager is based); DieHard, an error-avoiding memory manager that directly influenced the design of the Windows 7 Fault-Tolerant Heap; and DieHarder, a secure memory manager that was an inspiration for hardening changes made to the Windows 8 heap. His honors include a Microsoft Research Fellowship, an NSF CAREER Award, a Lilly Teaching Fellowship, the Distinguished Artifact Award for PLDI 2014, Most Influential Paper Awards at OOPSLA, PLDI, and ASPLOS, three CACM Research Highlights, a Google Research Award, a Microsoft SEIF Award, and Best Paper Awards at FAST, OOPSLA, and SOSP; he was named an ACM Distinguished Member in 2018. Professor Berger is currently serving his second term as an elected member of the SIGPLAN Executive Committee; he served for a decade (2007-2017) as Associate Editor of the ACM Transactions on Programming Languages and Systems, and was Program Chair for PLDI 2016.
  • Наука та технологія

КОМЕНТАРІ • 165

  • @ralph6591
    @ralph6591 4 роки тому +383

    That guy just explained what the p-value is and how it works in just few seconds - that was an entire lecture at uni. Wow

    • @abebuckingham8198
      @abebuckingham8198 2 роки тому +44

      It's much easier to explain when you don't have to calculate it.

    • @nickyyyyy
      @nickyyyyy 2 роки тому +14

      @@abebuckingham8198 Because p values are notoriously difficult to calculate

    • @DajesOfficial
      @DajesOfficial Рік тому +32

      It's much easier to understand when you already know what it is.

    • @davidmark1673
      @davidmark1673 Рік тому +2

      He did didn't he (professional statistician here)

    • @spacelem
      @spacelem Рік тому +1

      @@davidmark1673 if I didn't already know how to calculate a p-value, I'm not sure I'd be able to just from that (although I'd at least know where to start looking).
      (Then again I mostly deal with Bayesian inference and rarely do p-values)

  • @azymohliad
    @azymohliad 4 роки тому +333

    This is definitely one of the best conference talks I've ever seen!

  • @jotun42069
    @jotun42069 4 роки тому +497

    Extremely interesting research and a great presentation, thanks!

  • @Eduardo1007
    @Eduardo1007 3 роки тому +38

    I'm not even a programmer and enjoyed this presentation. I wish I had teachers like him!

  • @joechang8696
    @joechang8696 4 роки тому +138

    Concerning O1, O2 optimization, fit in L1 and L2 cache is a big deal. If O1 binary happens to fit in L1/L2 and O2 does not, then the O1 binary could perform better than O2.
    The big thing today is that memory round-trip access time is a couple of hundred CPU-cycles. Try to avoid too much pointer-chasing code. Prefetch memory when possible.
    Note, Intel Core iX processors up to generation 9 have 256K L2. The Xeon SP lines have 1M L2 at 2 additional cycles access time. 10th gen Core have 512K L2.
    Be aware that Intel processors since about mid-2000 had cache line size of 64 bytes. Prior to that, it was 32-bytes.
    My view, too many software people have purist view of the world, thinking they can achieve great performance without consideration for the details of the underlying hardware.

    • @lincolnsand5127
      @lincolnsand5127 3 роки тому +33

      I've never known anyone focused on performance who didn't care about the underlying hardware. Timur Doumler even has a talk called "Want fast C++? Know your hardware." Also. Most of the stuff you mentioned forms the basis of things like DoD (Data Oriented Design).

    • @nyankers
      @nyankers Рік тому +11

      you can only focus on hardware if you control the hardware

    • @voxelfusion9894
      @voxelfusion9894 Рік тому +6

      @@nyankers precisely. Focusing on hardware is only something you do for compute clusters, where you're running on bare metal, and therefore know exactly what is running where. Otherwise, your cache optimization could just make things worse on a different platform.

  • @RaidenFreeman
    @RaidenFreeman 4 роки тому +158

    This talk is incredible, great job to everyone involved.

  • @swapode
    @swapode 4 роки тому +12

    The SQLite example surprises me a little. Indirect calls seems like something I would expect the compiler to optimize already.

    • @manuelbergler6884
      @manuelbergler6884 4 роки тому +14

      Well, not through function pointers as in the SQLite example

  • @JeiShian
    @JeiShian 4 роки тому +32

    25:05 "You plug this into R with this awesome incantation" hahaha he's the best speaker eer

  • @00jknight
    @00jknight 4 роки тому +162

    This is an awesome talk, with some great, novel information (at least to me). The name of the program, "Stabilizer" is humorous, as it is actually more of an "unstabilizer". Excellent work Emery. I would love to see a example program that demonstrates significant performance delta between memory layouts.

    • @Maciej-Komosinski
      @Maciej-Komosinski 4 роки тому +42

      Stabilizer, because it stabilizes mean results (by "destabilizing" test conditions).

  • @georgepantazes5923
    @georgepantazes5923 4 роки тому +62

    This was an amazing talk! Lots of counterintuitive things to correct in our mental models about performance, thank you for the knowledge! Haha, I loved the "eyeball statistics". Wonderful.

  • @notgate2624
    @notgate2624 4 роки тому +4

    The best talk I've ever seen. These tools are incredible and I learned a ton. Great job!

  • @MartinAndrovich
    @MartinAndrovich 4 роки тому +4

    What an excellent talk with an awesome presenter; going to look more into this, kudos!

  • @nO_d3N1AL
    @nO_d3N1AL 4 роки тому +4

    Excellent and engaging presentation with interesting research. Informative and entertaining. If only every conference talk was like this!

  • @xonarofficial
    @xonarofficial Рік тому +6

    I've never done anything with programming, but still understood almost everything Emery said. Great video!

  • @k98killer
    @k98killer 6 місяців тому +1

    Enlightening talk and amazing results. If only coz existed for every language.

  • @speed488
    @speed488 4 роки тому +1

    Very nice! Thanks for the share. I've been through a few pitfalls stated in the presentation.

  • @rallokkcaz
    @rallokkcaz 4 роки тому +2

    Woah! This is a super awesome talk. Was not expecting as much as I learned.

  • @ianchristensen6319
    @ianchristensen6319 4 роки тому +83

    Excellent talk. Informative, engaging, clear.

  • @ehhhhhhhhhh
    @ehhhhhhhhhh 2 роки тому +14

    Wow. He went from theory, background knowledge, to full blown applied uses at a really nice pace. This is a great lecture for any student in software engineering. Love the reminders that certain optimizations can cause slowdown--as well as the reminder that rolling your own naive hash table can have disastrous consequences for performance (37:23)

    • @TheLoneWolfling
      @TheLoneWolfling 10 місяців тому +1

      And on the flip side, the reminder that indirection can be surprisingly expensive.

  • @sebastianpopa4778
    @sebastianpopa4778 Рік тому

    what a brilliant talk - great content, super clarity, funny

  • @ffp3
    @ffp3 4 роки тому +1

    Great presentation, great material. I've enjoyed watching it a lot, very educational, thanks!

  • @shpensive
    @shpensive 2 роки тому

    Just really brilliant ideas and great execution wow!

  • @iCarNaya
    @iCarNaya 4 роки тому +2

    Eye opening, thanks!

  • @elkino1
    @elkino1 4 роки тому

    Great talk, biased details about biased measurement layout awesome. The best statistics and performance talk. Crystal clear.

  • @iyziejane
    @iyziejane Рік тому +1

    That anecdote really summarizes the grad school experience. "As your advisor, the code just runs faster for me."

  • @MrAwesomesize
    @MrAwesomesize 4 роки тому +18

    Truly amazing talk! Seriously great stuff!

  • @EmeryBerger
    @EmeryBerger 4 роки тому +6

    All code and papers are linked from here: plasma-umass.org.
    To install coz on Debian or Ubuntu:
    % sudo apt-get install coz-profiler
    Papers:
    * "Stabilizer: Statistically Sound Performance Evaluation" [ASPLOS 13]
    people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf
    * "Coz: Finding Code that Counts with Causal Profiling" [SOSP 15 Best Paper, CACM Research Highlight]
    sigops.org/s/conferences/sosp/2015/current/2015-Monterey/printable/090-curtsinger.pdf
    * Mentioned during talk: "Producing Wrong Data Without Doing Anything Obviously Wrong!" [ASPLOS 09]
    Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter Sweeney
    users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
    Slides are here (Keynote):
    emeryberger.github.io/presentations/performance-matters-strangeloop2019.key
    Reddit discussion thread: www.reddit.com/r/programming/comments/d4k5x5/performance_matters_by_emery_berger_strange_loop/

  • @seanmacfoy5326
    @seanmacfoy5326 2 роки тому +3

    This is one of the best talks I have ever seen, full stop

  • @AlexanderZeitler
    @AlexanderZeitler 4 роки тому +1

    Awesome talk, really enjoyed it!

  • @zoulock
    @zoulock 4 роки тому +5

    Amazing idea slowing everything down!

  • @ibollanos
    @ibollanos 2 роки тому

    Extremely interesting and well-presented work!

  • @axelforsman1642
    @axelforsman1642 4 роки тому +1

    Great talk! Thanks

  • @shawn576
    @shawn576 2 роки тому +2

    Excellent presentation. Eventually this stuff will be built into Visual Studio and us casual plebs will be guided into making code that isn't awful.

  • @antoniogarest7516
    @antoniogarest7516 4 роки тому +3

    Very interesting! Great video!

  • @4mb127
    @4mb127 4 роки тому +2

    Superb talk.

  • @TheAmos1989
    @TheAmos1989 4 роки тому

    Fantastic talk!

  • @skilz8098
    @skilz8098 2 роки тому +4

    One of the better talks on "Memory Layout" within a given program or application that has a direct effect on Performance. The jokes and puns are great! No more malloc! Run it as me and it's faster... lol! Great content!

  • @onlyeyeno
    @onlyeyeno 2 роки тому +8

    It has all been written before, but in order to "feed the algorithm" let me just concur, this was to me an extremely interesting and in many ways eye opening talk. And the presentation was absolutely "spot on" for my preferences, clear and and to the point, with the perfect "balance" of overview and details, and then some humor for those of us who like that.
    In short(?) interesting, informative, educational and enjoyable.
    Best regards

  • @ewenchan1239
    @ewenchan1239 4 роки тому +1

    Great talk!

  • @marcusklaas4088
    @marcusklaas4088 4 роки тому +14

    Educational *and* fun! That was great.

  • @casperes0912
    @casperes0912 4 роки тому +12

    This was a bloody amazing speech and great content. Now the coz-profiler just needs to be ported to macOS. If it works on Linux already probably isn't too big a leap.

  • @4AneR
    @4AneR 2 роки тому

    Amazing talk!

  • @dengan699
    @dengan699 4 роки тому +1

    Wow awesome talk, thanks

  • @microcolonel
    @microcolonel 4 роки тому +13

    This is the only helpful performance analysis talk I have ever seen. Spectacular work, and thank you for making the tools available. I think it'd be spectacular to integrate the layout randomization and causal profiling directly into the Rust toolchain, and I can't see why not.
    Edit: seems somebody has ported or begun porting Coz to Rust, very cool. :- )

    • @Raintigress
      @Raintigress 3 роки тому +2

      Does anyone know what paper he's referring to around 10:35? I can't find it anywhere.

  • @Entropy67
    @Entropy67 Рік тому

    how insanely useful... I was thinking about the next steps for optimization. This came in at the right time lol

  • @haydenthai935
    @haydenthai935 2 роки тому +1

    Great speaker

  • @GeoffreyChurchilley
    @GeoffreyChurchilley 3 роки тому +78

    I really like the causal analysis technique! I might be misunderstanding what "layout" is, but I was confused as to why you would randomize it every .5 seconds. It seems like this could wash out optimizations that are actually valid, e.g. optimizations that reduce the probability of cache misses, because out in the wild layout isn't being randomized all the time. It seems like the fact that you get unexpected distributions when only randomizing once per execution could be indicating that different codes do have different performance characteristics across different layouts, meaning that there are potentially useful code-level optimizations. An extreme example of this could be a data structure that monitors its own timing information and adapts to optimize latency assuming static memory layout, because then randomizing the layout could make that structure look way worse than a more naive approach that doesn't bias itself for any particular layout.

    • @Megaranator
      @Megaranator 2 роки тому +23

      To me it seemed that the point of it is to eliminate any layout optimizations and look purely at the code and then on top of this you can add machine specific or runtime layout optimizations if you need them.

    • @LaPingvino
      @LaPingvino 2 роки тому +33

      the randomization is to remove measurement bias, not to improve performance. you don't do this in the actual program, just while you try to figure out what other than memory layout ( = how the memory access is organized) influences your performance.

    • @ray30k
      @ray30k 2 роки тому +21

      My understanding is that the layout is not randomized, but it's *outside of the developers' control*. So say the layout on your testing machine is just right to get a few small speedups, but you've got one unlucky customer whose layout causes a slowdown once a month or something. By randomizing the layout repeatedly during testing, you can avoid the random advantages and disadvantages and instead get a statistically meaningful test result.

    • @johaquila
      @johaquila 2 роки тому +8

      Regarding 0.5 seconds: Isn't this already very long for this kind of test on modern computers?
      Using the right kind of algorithm, it's always possible to construct pathological examples that would require running the program for hours before you can compare the times without getting misleading results. But in practically occurring cases, half a second should be plenty, while also being short enough so you can try many iterations.
      Also, the program that adaptively changes its memory layout is reminiscent of a halting problem construction. That's more of a sophisticated denial of service attack against the performance test, rather than a genuine problem likely to occur in practice.

    • @Google_Censored_Commenter
      @Google_Censored_Commenter 2 роки тому +14

      but the key here is that you can't tell if it's a code improvement of a layout improvement unless you use the stabilizer to randomize the layout.
      If you want to measure whether you did any optimization to the layout after the fact, just compare the non-stabilized layout to the stabilized one. Nothing is lost.

  • @AdamGaffney96
    @AdamGaffney96 2 роки тому +1

    As a data scientist I take personal offence to the R slander, but on a serious note this was a great talk and I really enjoyed listening!

  • @vikx02
    @vikx02 2 роки тому

    Super interesting!

  • @termway9845
    @termway9845 4 роки тому +5

    Great talk about performance. You illustrate quite well the struggle to optimize a program.
    I'm curious about all those real examples in you "Summary of Optimisations" slide.
    Do you have references about them ? (like specific thread or patch note ?)

  • @TristanBailey
    @TristanBailey 4 роки тому +15

    Great presentation and even if I never use there software I’ve an idea how to looking my own performance work and how certain limits work. Thank you.

  • @donaldhobson8873
    @donaldhobson8873 Рік тому +1

    That stabilizer. Instead of randomizing every half second, you could randomize once per test. And then pick the best program.
    Super optimizing compiler.

  • @benjaminramsey4695
    @benjaminramsey4695 2 роки тому

    Holy cow, this guy is GOOD!

  • @Snakke40
    @Snakke40 4 роки тому

    The working diagram of the program makes me think of Program evaluation and review technique/critical path analysis of all things. I wonder if the coz analyser is partly inspired by existing ways to "optimise" projects that utilise PERT/CPA. It's probably closely related, though I'm not proficient in any of those, so I could be way off.
    Good talk though! And very interesting.

  • @swyxTV
    @swyxTV 4 роки тому +58

    i actually LOLed when i saw that hash function profile. and then immediately felt sad that nobody near me would get why this is funny 😂

  • @sebastianwiesendahl5348
    @sebastianwiesendahl5348 4 роки тому +1

    pretty neat

  • @jamcdonald120
    @jamcdonald120 4 роки тому +6

    if only someone would write a language where -03 picks the fastest data structures to use

  • @WesleyZeon
    @WesleyZeon 4 роки тому +7

    Did you guys try using Coz to enhance the performance of its own code?

  • @Daniel_WR_Hart
    @Daniel_WR_Hart Рік тому

    I noticed once in a C++ project that changing the order of some of my very slow functions consistently improves their performance by 3%. Seeing that kind of made me want to give up coding

  • @ittixen
    @ittixen Рік тому +2

    One of those rare talks that's actually full with useful insights and lessons.
    I'm not against people rolling their own hash thing at home though 😉

  • @itsjustaname7311
    @itsjustaname7311 Рік тому

    25:50 whats it all about with the "dash O1 up to dash O11" thing.
    I dont actually know what -XX means in the first place.
    Yould be glad to learn about from you guys!

    • @alvarorodriguezgomez8716
      @alvarorodriguezgomez8716 Рік тому

      You are telling the compiler (the program that translates you quasi human programming into 0 and 1s) to have a deeper look into your code and try to see if it can be made faster by exploiting some of the properties of the processor

  • @EER0000
    @EER0000 4 роки тому +3

    Great talk! I saw there is a version targeting Java available so that gives me hope that some day we will have it for dotnet as well.

  • @planktonfun1
    @planktonfun1 4 роки тому

    Yey optimization!

  • @JobvanderZwan
    @JobvanderZwan 4 роки тому +58

    One thing I'm wondering about: these causal graphs assume that no other variable changes its performance characteristic, right? So wouldn't it be possible that speeding up one bottleneck completely changes the shape of the graphs of all the other ones?

    • @madtrade
      @madtrade 4 роки тому +42

      yes completely ! this is why i reevaluate after each changes. it's specially important when you are using threads

    • @EmeryBerger
      @EmeryBerger 4 роки тому +84

      Right now, Coz tests the effect of a single change. It would also be possible to test the effects of two changes (or more), though this requires an increasingly large number of samples (testing pairs requires O(n^2) experiments, for example).

  • @Antsaboy94
    @Antsaboy94 Рік тому +1

    But we DO have magic!
    1) Have set data to run through the program.
    2) Run it through the program, saving the results of each individual process.
    3) Run it through the program again, skipping different processes by copying pre-calculated results.
    You can even run the process halfway and only then copy the precalculated results. Anything from minor improvement to infinite speed is possible!

  • @JohnGunnels
    @JohnGunnels 4 роки тому

    First class work and presentation, with Prof. Berger's inimitable sense of humor. There are definitely statements made that I need to think about (I don't agree with them 100%, but ... Prof. Berger may well be right and I may well be ... well, "less right"). I think I will "start at the beginning" with: arxiv.org/abs/1608.03676 .

  • @willculpepper9637
    @willculpepper9637 4 роки тому +10

    30:00 Gahhhh! So obvious in retrospect. That, sir, is some fine lateral thinking.

  • @alert.272
    @alert.272 4 роки тому +1

    Hey Dr. Berger if you are still replying, in the graphic at 30:48 it looks like the virtual speedup is done by just putting other processes in a timeout where it completely stops for a while. Is this just an artifact of the graphic? It seems like this would mess with the analysis of competition for resources. Maybe I have the wrong idea in my head, but how can you stretch run-time in each piece in a way to make sure that processes that were happening concurrently before a virtual speedup are still happening at the same time?

    • @alert.272
      @alert.272 4 роки тому

      Oh also your work is awesome. I mean a really novel idea and just so thorough. Wow.

    • @markusklyver6277
      @markusklyver6277 2 роки тому

      Yes

  • @Sarsanoa
    @Sarsanoa 2 роки тому +1

    testing the speed of a piece of code by slowing everything else down seems like it would miss interactions between components. It seems like the ferret example required explicit knowledge of interactions in order to discover the optimization.

    • @JimBob1937
      @JimBob1937 2 роки тому

      Think of it as running the application on a slower computer, everything is slower globally. However, since this is done virtually, you can virtually speed one component up independently. Since everything, but your isolated component, is running at the same global speed, relative interactions among those components should be preserved. You only need be able to isolate a specific component.

  • @WarrenGarabrandt
    @WarrenGarabrandt 2 роки тому

    Did these performance optimizations get implemented in these projects?

  • @ko-Daegu
    @ko-Daegu 2 роки тому

    27:40 wait does that mean the profiler in Unity engine is not good ?

  • @mujtabaalam5907
    @mujtabaalam5907 2 роки тому +2

    Is there any interest in automatically tweaking layout to optimize runtime?

    • @williamdrum9899
      @williamdrum9899 Рік тому

      Reminds me of Z80 Assembly, you can triple the speed you index an array by aligning the array so that the base address ends in 00. Basically you only needed half the pointer to the array

  • @morwar_
    @morwar_ 4 роки тому

    I couldn't find anything about dedup, what is that?

  • @Netherlands031
    @Netherlands031 Рік тому

    41:33 a traditional profiler wouldn't have caught those things, how? Shouldn't the operations that make the program slow be the operations that cost a lot of time?

  • @TheNefari
    @TheNefari 2 роки тому +1

    nothing beats hardware talks and enjoying christmas
    Have a Great Christmas everybody🎅

  • @nevokrien95
    @nevokrien95 2 місяці тому

    It's not fully independent the execution time at time t depends on some state from time t-1 things like temperature which threads finished first branch prediction etc.
    So that means ur assumed normality is problematic because u basically Hand wave those away. I feel like directly testing normality and then doing the rest is best

  • @World_Theory
    @World_Theory 4 роки тому +2

    A “causal profiler”, huh?… I bet that would be useful as a component to building a system that uses an evolutionary algorithm to create a program, or at least, to optimize a program. Specifically, the part that grades the performance score of a unique specimen of code, to decide whether it survives that generation.
    Then you just need a way to get the random code to do the thing you want, and to get it to avoid or fix bugs. Maybe I shouldn't say “just”, in that sentence…

  • @IvanGarcia-cx5jm
    @IvanGarcia-cx5jm Рік тому

    Is it necessary to have the --- in the coz command line? And if it is, is it necessary to name it exclusively with non-alphabetic symbols? It does not seem something I would have in an API if I want to make it intuitive/self-commented/discoverable/usable.

  • @mohokhachai
    @mohokhachai 8 місяців тому

    So how muny function get an adress before reach heaps

  • @FunctionGermany
    @FunctionGermany 4 роки тому +6

    Could someone point me to a good talk or resource where the -O1, -O2 performance cost thing is explained?

    • @EmeryBerger
      @EmeryBerger 4 роки тому +11

      I'm not sure what you're looking for. The technical paper describing Stabilizer and the performance analysis we conducted is here: people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf

    • @ChrisFloofyKitsune
      @ChrisFloofyKitsune 4 роки тому

      They appear to be compiler optimization options. clang.llvm.org/docs/CommandGuide/clang.html#code-generation-options

    • @tomiesz
      @tomiesz 4 роки тому

      It's referring to a compiler option, I know it from gcc but probably something similar in others.
      www.rapidtables.com/code/linux/gcc/gcc-o.html

    • @KaiserTom
      @KaiserTom 4 роки тому

      Generally the higher you go, the larger the size of your program, so an -O3 compiled program can be a good size larger than -O1. While the execution speed very usually increases, you also may need to move a lot more instructions and data around the CPU and to and from RAM, which can get expensive at large program sizes. Going from 1MB at -O1 to 1.5MB at -O3 is no big deal usually, since you can store the entire thing in CPU cache regardless. Going from 100MB to 150MB meanwhile means a significantly more amount of time on waiting on RAM to transfer those extra bytes as well as more cache misses since the CPU has try to juggle and predict what portion of the program should be in cache at any point in time, since it can no longer store it all, and increasing size of the program just makes that prediction even worse.
      Also you have cases where under -O1 an entire function fits in L1 cache and under -O3 that same function doesn't and part of it is forced into L2, leading to the -O1 being much faster in execution despite using slower instructions.

  • @anug14
    @anug14 4 роки тому +9

    Does this work for java?

    • @EmeryBerger
      @EmeryBerger 4 роки тому +19

      There is now an open-source version of Coz for Java! I hope to integrate the tools going forwards.
      github.com/Decave/JCoz

    • @1stMusic
      @1stMusic 4 роки тому

      No, but lookup 'JCoz'

  • @dukereg
    @dukereg 4 роки тому +1

    Am I right to think that both tools are C/C++ only? How easily can the approaches be applied to other languages, programs and systems?

    • @obviouslynot85
      @obviouslynot85 4 роки тому +1

      There is a version of Coz for Java called JCoz.

  • @shaylempert9994
    @shaylempert9994 4 роки тому

    Great talk!
    Are there python versions?

  • @0xCAFEF00D
    @0xCAFEF00D 4 роки тому

    I love the idea of the second tool. Perhaps I underestimate the first.

    • @thimowellner7686
      @thimowellner7686 4 роки тому +4

      For academic purposes, comparing different algorithmic strategies, this is gold honestly. I'm very happy i found this tool and I'll use it in my master thesis, because just running your implementations x times for graphs is nice but not too sound...
      Basically i don't really want to spend too much time on the implementation of the algorithm, i just want to make sure that my theoretical conclusions are correct and that they are not *in all cases* nullified by an underlying process of caching or accessing that i haven't considered

  • @richerite
    @richerite 4 роки тому

    Is there a similar tool for Python?

  • @filip380PL
    @filip380PL Рік тому

    9:40 why is it magic number?

  • @alessandroruggiero8932
    @alessandroruggiero8932 2 роки тому +1

    What is-o2 or 3

  • @janknoblich4129
    @janknoblich4129 4 роки тому

    11:49 what does that mean?

    • @tieswestendorp9830
      @tieswestendorp9830 4 роки тому +3

      The -Ox flags are optional compiler flags that determines the 'optimization level' of some compilers (notably GCC: www.rapidtables.com/code/linux/gcc/gcc-o.html). Essentially, the higher x is in -Ox, the more tricks your compiler will try to apply to optimize your code.

  • @StephenOwen
    @StephenOwen 4 роки тому +1

    This was a cool talk but I think it should have included a description of what A prime, O3, O2, and all of these metrics are. They're mentioned but sound nuanced enough that I feel like I don't understand their importance.

    • @gajbooks
      @gajbooks 4 роки тому +3

      O2 and O3 are just C/C++ compiler optimization options (GCC, Clang, MSVC, etc). O0 means no optimizations are performed, which is useful for debugging, and isn't bad in all scenarios, but sometimes has some really wonky performance effects. The higher the O level, from O1 to O3, the more mangled the assembly looks, but the more optimizations are applied. O3 has a reputation for not actually providing any benefit over O2, and this confirms that. Compiler authors didn't have tools like this to actually check if their "optimizations" were anything more than artifacts, and so they ended up being artifacts.

    • @obviouslynot85
      @obviouslynot85 4 роки тому +8

      Where 'A' is the label identifying one version of code, 'A prime' is simply a label identifying the later version. This is an idiom carried over from mathematics, but otherwise the labels are arbitrary and are not significant beyond this presentation. '-O3' and '-O2' refer to compiler arguments for optimization levels.

  • @MeriaDuck
    @MeriaDuck 2 роки тому

    11:42 that's a bit (read: quite a bit) bigger than I anticipated.

  • @jameshogge
    @jameshogge 4 роки тому

    Couldn't you "stabilise" your program during performance testing by disabling the TLB, cache etc on the test hardware?

    • @gajbooks
      @gajbooks 4 роки тому

      Yes, but I'd love you to show me a modern system where literally disabling the concept of virtual memory would go over well. Cache would be even harder to disable, as it is likely hardwired into the processor, and might barely be changable by microcode.

    • @juzujuzu4555
      @juzujuzu4555 3 роки тому +3

      No, because algorithms that take better advantage of caches etc. are the most important aspect of software. Thus if you would eliminate caches, you could have algorithms that never get cache hits be as fast as best dynamic programming algorithms that utilize caches optimally.

  • @supertren
    @supertren 2 роки тому

    14:50 OMG

  • @whuzzzup
    @whuzzzup 2 роки тому

    And did SQLite change the code to yours?

  • @lucasemanuelgenova9179
    @lucasemanuelgenova9179 2 роки тому

    What does -09 mean?

    • @HoloDaWisewolf
      @HoloDaWisewolf 2 роки тому +1

      The command `gcc main.c` compiles the file main.c into an executable i.e. a program that you can actually run on your computer. If you specify the argument `-O0` to the gcc compiler, then you're telling it to compile your code without performing any optimization whatsoever. "O0" stands for "optimization level 0".
      It will compile relatively quickly, however the resulting program will be relatively slow. The command `gcc -O1 main.c` is likely to produce a faster executable. GCC will optimize the code more aggressively with the argument `-O2`, and `-O3` represent the highest level of optimization. `-O9` isn't a thing.
      What could be expected is that more the compiler optimizes a code, the faster the resulting executable runs. However that's not always the case.

  • @valen8560
    @valen8560 2 роки тому

    bottleneck does make programs slower if you have a bad scheduler or workload distributor.

  • @Nikoeab
    @Nikoeab 4 роки тому +1

    What is the meaning of the "-O3" in this presentation? Edit: Rather, what is the significance of it, and how does it apply to the development process?

    • @obviouslynot85
      @obviouslynot85 4 роки тому +7

      '-O' is a common option for optimization level in program compilers. The argument '3' indicates optimization level 3. Lower numbers mean the compiler will do fewer or less radical optimizations.

    • @markusklyver6277
      @markusklyver6277 2 роки тому +3

      @@obviouslynot85 Also -O9 isn't a thing, that part was a joke.

  • @movax20h
    @movax20h 4 роки тому +6

    The graphs and analysis at 24:20 are flawed. The problem is that -O2 and -O3 actively optimizes the layout, to improve (static) branch prediction and separation of hot and cold code. By using randomizer you throw these optimizations out of the window, so obviously you are going to loose advantage of using these optimizations. The proper way to test is to keep benchmarks well designed, and run in extremal well controlled environment (keep noise to minimum), preferably in a cycle accurate CPU simulator with full cache, prefetch and branch prediction subsystems simulated accurately. Also another aspect is running enough repetitions, and not using average, but to use instead minimums (best possible performance, with removal of noises).

    • @markusklyver6277
      @markusklyver6277 2 роки тому

      You can set up what Coz is allowed to randomize.

    • @markusklyver6277
      @markusklyver6277 2 роки тому +2

      Also I think the point of it is to eliminate any layout optimizations and look purely at the code. The randomization is to remove measurement bias, not to improve performance.

  • @williamdrum9899
    @williamdrum9899 Рік тому +2

    So basically what you're saying is "you can't reliably measure performance anymore"

  • @matju2
    @matju2 Рік тому +1

    BMP didn't even exist in 1984 !

  • @kfftfuftur
    @kfftfuftur 4 роки тому +8

    19:00 on average A' is faster however the closer they get the more it depends on the random layout wether they are faster.
    Why dont you just build the app 30 times on the client with randomized layout and see what works the best?

    • @EmeryBerger
      @EmeryBerger 4 роки тому +41

      I explain this in the talk: any tiny change can easily erase those gains -- even running the code in a different directory or with a different user!

    • @1889990
      @1889990 4 роки тому

      Besides that it would just not be practical, even if it had some impact. Just think about building an application 30 times for a single client in a single deployment... if it takes an hour to compile you have 30 hours of compile time. Your next hotfix is probably less than 30 hours away.
      Since every change of code could impact performance you would need to redeploy after every change and compile again for 30 hours.
      The Software would have more down than uptime!

    • @ashrasmun1
      @ashrasmun1 4 роки тому

      I guess the only "optimization" you could do in that area, is to compile with optimal layout in given directory, for given user etc. and just never touch it. It might be beneficial, but I don't see it getting popular as presented method let's you tackle more important issues.