*My takeaways:* *1. Case study **0:33* *2. Dynamic voltage and frequency scaling (DVFS) **14:08* *3. Quiescing systems* 3.1 Sources of measurement variability 18:16 - Daemons and background jobs - Interrupts - Code and data alignment - Thread placement: operating system tend to use core0 - Runtime Scheduler - Hyperthreading - Multitenancy - DVFS - Turbo Boost - Network traffic 3.2 unquiesced system vs quiesced system 27:18 3.3 Tips on how to get quiesced system 29:55 3.4 Memory errors are completely non-deterministic and this makes getting deterministic measurements on modern hardware really hard 31:45 3.5 Example: code alignment 34:15 *4. Tools for measuring software performance **41:02**: Tips - do the measuring twice using different tools for consistency checking* 4.1 Measure the program externally 43:46: *time* command 4.2 Instrument the program 45:33: *clock_gettime* (RECOMMANDED METHOD), *rdtsc() and gettimeofday()* (DON'T USE) 4.3 Interrupt the program 52:18: runtime your program under *gdb* , and type Ctrl-C at random intervals 4.4 Exploit hardware and operating systems support 54:10: *libpfm4, perf stat* 4.5 Simulate the program 58:07: *cachegrind* *5. Performance modelling **59:50* 5.1 Statistics that represent performance 1:01:55 5.2 How to compare two programs 1:13:15 5.3 Fitting to a model, and issues with modelling 1:19:25
Note that the slide on Geometric Mean shown in the presentation had incorrect values . If you go to the course website directly and look at the updated slide for lecture ten, the slide has been corrected. Also the "observation" has been corrected: The geometric mean of A/B IS the inverse of the geometric mean of B/A.
To answer what the statistical distributions are, as mentioned in my previous comment, 1) The number of samples that will show the speed bug is a binomial distribution having mean nX, where n is the number of samples, and standard deviation sqrt(nX(1-X)). So for 10 samples and X=50%, the speed bug will be seen on 5 +/- 1.6. 2) How big is X? It's a beta distribution: Beta(s+1, (n-s)+1) where n is the number of samples, and s is the number of successes. 3) How much speedup can you expect? That is 1/(1-X) which is distributed as 1 + BetaPrime(s+1, (n-s)+1). Its mean is 1+(n+1)/(n-s) (which is undefined if n=s because it could be an infinite loop). For example if you take 5 samples and 2 of them show the speed bug, the mean speedup is 1+(3/3) = 2. (It is higher than 1/.6 because the distribution has a long tail to the right.)
Yes!!! I guessed his riddle correctly after 30:00, I outcompeted all his MIT students ;-) The riddle is: why even if you have accounted for everything, there is still a source of undeterminism in a running machine causing fluctuation in the timing of running exactly the same algorithm with the same input multiple times. I won't say what it is, no spoilers... ;-)
I didn't guess his first riddle, with the spikes in the performance graph (horizontal: input size, vertical time). It is BTW not even that trivial why the spikes get narrower and narrower. I think you then have to assume that the run-time of the algorithm for a particular input, is much smaller than the period of the clock-frequency alternations. Then it makes sense: for a bigger n, the algorithm runs longer, and the computer will have cooled down more (or heated up more), getting you closer to the next change of clock-frequency. But if the run-time of the algorithm for n is much bigger than the period of frequency changes, then those change could average out. Right?
If a "bottleneck" consumes enough time to be worth fixing, it will be visible in a small number of samples (20 or less), PROVIDED you examine the stack and possibly some variables on each sample. You need to ask, on each sample, WHY is this cycle being spent? Typical answers: - We're calling malloc or free on an object when we could have simply re-used a previous one. - We're in the process of I/O, reading a DLL, to find the translation of a string into the local language, when it's a string like "foobar" that doesn't need to be translated. - We're doing string-compare as part of a series of IF statements, when int-compare would have worked just as well. - We're calling an indexing function on an array object, and the indexing function is checking to see if the index is within range, when we know perfectly well the index is in range. - We're calling a general-purpose matrix multiplication routine, which first checks its arguments to see if they are in row-major order, and we're in the process of calling a utility to check that, when we know perfectly well the arguments are valid, while the size of the matrices is 3x3 and could be multiplied in a tiny fraction of the time. - We're in I/O, trying to find out in an SQL array copy if there are restrictions, and doing this on every row being copied, when we know perfectly well this check need only be done once, or maybe not at all. - We're calling built-in functions like exp, log, sin, cos with arguments that are often the same, or often the same when called from particular locations. - We're doing identical logic repeatedly, when it could be simply compiled-out. ... the list of possible bottlenecks is limited only by the logic of the application. You only have to see one twice to know it accounts for a lot of time, and the fewer samples before you see it twice, the bigger it is. Profiler-guys say you need to Measure, Measure, Measure. BALONEY. The object is to FIND the problems. Measuring doesn't find them. The bigger they are, the sooner just studying the samples will find them. You don't measure tooth size to find a badger. It's the one that bites you.
How can speed bugs hide from profilers? Here's how... Consider a call-graph having 79 routines. It's a big tangle. Suppose the hot-path has 15 routines. The hot-spot itself is 1 routine. It's easy for a speed bug to not be on the hot path. For example, it could consist of A calling B when it doesn't really have to. (Function calls are everywhere in the code. You'd never know this particular one was a problem.) This could be happening on X=50% of call stack samples, but if it is called from multiple different places in the call tree, it will not be accumulated in the hot path. In addition, the only way to find out that A is calling B unnecessarily is to examine the code where A calls B, which you cannot do if you don't have line-of-code level information. The same argument applies to flame-graphs. A could be calling B unnecessarily for 50% of the time, but if this 50% of time is divided over 100 or 1000 different places, it will never be seen because each of the many calls to A is a tiny sliver in the flame-graph. The hot-spot, of course, is irrelevant since A is calling B. Even if B is the hot-spot (which it probably isn't) you need to know that A is calling it unnecessarily, and the hot-spot doesn't tell you that. Another kind of profiler is called a "CPU-profiler" as if that's a good thing. What that means is it is blind to any form of blockage such as I/O, waiting on a semaphore, etc. Sometimes this is called an "ON-CPU" profiler (blind to I/O) as opposed to an "OFF-CPU" profiler (blind to CPU time). Even if you use both of these, you don't know how the time is divided between the two. Something can look like a speed bug but not be because it's in the wrong profiler, and besides, you can't tell why it's being invoked. With call-stack samples, there is no need to distinguish between the two.
As a python programming I am crying all the time throughtout the video. He talks about how the executable's name is put into call stack and it reduces performance. While in python, the entire time its slow :(
Can someone explain how the name of the executable affects timing measurements? The linker will replace the names with actual branch instructions to addresses.
Thanks for this class and youtube channel !! Could you please recommend some other material (references) and if it is possible to download the class slides ?
What I have seen with brilliant students while I was a TA is that they usually keep their brilliant answers in their mind and dont require a pat in the back from the professor.
*My takeaways:*
*1. Case study **0:33*
*2. Dynamic voltage and frequency scaling (DVFS) **14:08*
*3. Quiescing systems*
3.1 Sources of measurement variability 18:16
- Daemons and background jobs
- Interrupts
- Code and data alignment
- Thread placement: operating system tend to use core0
- Runtime Scheduler
- Hyperthreading
- Multitenancy
- DVFS
- Turbo Boost
- Network traffic
3.2 unquiesced system vs quiesced system 27:18
3.3 Tips on how to get quiesced system 29:55
3.4 Memory errors are completely non-deterministic and this makes getting deterministic measurements on modern hardware really hard 31:45
3.5 Example: code alignment 34:15
*4. Tools for measuring software performance **41:02**: Tips - do the measuring twice using different tools for consistency checking*
4.1 Measure the program externally 43:46: *time* command
4.2 Instrument the program 45:33: *clock_gettime* (RECOMMANDED METHOD), *rdtsc() and gettimeofday()* (DON'T USE)
4.3 Interrupt the program 52:18: runtime your program under *gdb* , and type Ctrl-C at random intervals
4.4 Exploit hardware and operating systems support 54:10: *libpfm4, perf stat*
4.5 Simulate the program 58:07: *cachegrind*
*5. Performance modelling **59:50*
5.1 Statistics that represent performance 1:01:55
5.2 How to compare two programs 1:13:15
5.3 Fitting to a model, and issues with modelling 1:19:25
I like his style, he asks questions and let the students think. He shares experiences in a lively way. Very good.
Imagine what it would feel like to see Moore's laws effects so drastically. A 43million dollar machine overtaken by a laptop . fascinating stuff
Note that the slide on Geometric Mean shown in the presentation had incorrect values . If you go to the course website directly and look at the updated slide for lecture ten, the slide has been corrected. Also the "observation" has been corrected: The geometric mean of A/B IS the inverse of the geometric mean of B/A.
Was about to leave the exactly same comment 😂
This guy has solved every issue with timers, so much value in this hour long video, cheers.
EDM BOOTCAMP
In German, we have a saying: "Wer mißt, mißt Mist."
Rough translation: "He who measures, measures junk."
To answer what the statistical distributions are, as mentioned in my previous comment,
1) The number of samples that will show the speed bug is a binomial distribution having mean nX, where n is the number of samples, and standard deviation sqrt(nX(1-X)). So for 10 samples and X=50%, the speed bug will be seen on 5 +/- 1.6.
2) How big is X? It's a beta distribution: Beta(s+1, (n-s)+1) where n is the number of samples, and s is the number of successes.
3) How much speedup can you expect? That is 1/(1-X) which is distributed as 1 + BetaPrime(s+1, (n-s)+1). Its mean is 1+(n+1)/(n-s) (which is undefined if n=s because it could be an infinite loop).
For example if you take 5 samples and 2 of them show the speed bug, the mean speedup is 1+(3/3) = 2. (It is higher than 1/.6 because the distribution has a long tail to the right.)
Yes!!! I guessed his riddle correctly after 30:00, I outcompeted all his MIT students ;-) The riddle is: why even if you have accounted for everything, there is still a source of undeterminism in a running machine causing fluctuation in the timing of running exactly the same algorithm with the same input multiple times. I won't say what it is, no spoilers... ;-)
I didn't guess his first riddle, with the spikes in the performance graph (horizontal: input size, vertical time). It is BTW not even that trivial why the spikes get narrower and narrower. I think you then have to assume that the run-time of the algorithm for a particular input, is much smaller than the period of the clock-frequency alternations. Then it makes sense: for a bigger n, the algorithm runs longer, and the computer will have cooled down more (or heated up more), getting you closer to the next change of clock-frequency. But if the run-time of the algorithm for n is much bigger than the period of frequency changes, then those change could average out. Right?
If a "bottleneck" consumes enough time to be worth fixing, it will be visible in a small number of samples (20 or less), PROVIDED you examine the stack and possibly some variables on each sample. You need to ask, on each sample, WHY is this cycle being spent? Typical answers:
- We're calling malloc or free on an object when we could have simply re-used a previous one.
- We're in the process of I/O, reading a DLL, to find the translation of a string into the local language, when it's a string like "foobar" that doesn't need to be translated.
- We're doing string-compare as part of a series of IF statements, when int-compare would have worked just as well.
- We're calling an indexing function on an array object, and the indexing function is checking to see if the index is within range, when we know perfectly well the index is in range.
- We're calling a general-purpose matrix multiplication routine, which first checks its arguments to see if they are in row-major order, and we're in the process of calling a utility to check that, when we know perfectly well the arguments are valid, while the size of the matrices is 3x3 and could be multiplied in a tiny fraction of the time.
- We're in I/O, trying to find out in an SQL array copy if there are restrictions, and doing this on every row being copied, when we know perfectly well this check need only be done once, or maybe not at all.
- We're calling built-in functions like exp, log, sin, cos with arguments that are often the same, or often the same when called from particular locations.
- We're doing identical logic repeatedly, when it could be simply compiled-out.
... the list of possible bottlenecks is limited only by the logic of the application. You only have to see one twice to know it accounts for a lot of time, and the fewer samples before you see it twice, the bigger it is.
Profiler-guys say you need to Measure, Measure, Measure. BALONEY. The object is to FIND the problems. Measuring doesn't find them. The bigger they are, the sooner just studying the samples will find them.
You don't measure tooth size to find a badger. It's the one that bites you.
How can speed bugs hide from profilers? Here's how...
Consider a call-graph having 79 routines. It's a big tangle. Suppose the hot-path has 15 routines. The hot-spot itself is 1 routine.
It's easy for a speed bug to not be on the hot path. For example, it could consist of A calling B when it doesn't really have to. (Function calls are everywhere in the code. You'd never know this particular one was a problem.) This could be happening on X=50% of call stack samples, but if it is called from multiple different places in the call tree, it will not be accumulated in the hot path. In addition, the only way to find out that A is calling B unnecessarily is to examine the code where A calls B, which you cannot do if you don't have line-of-code level information.
The same argument applies to flame-graphs. A could be calling B unnecessarily for 50% of the time, but if this 50% of time is divided over 100 or 1000 different places, it will never be seen because each of the many calls to A is a tiny sliver in the flame-graph.
The hot-spot, of course, is irrelevant since A is calling B. Even if B is the hot-spot (which it probably isn't) you need to know that A is calling it unnecessarily, and the hot-spot doesn't tell you that.
Another kind of profiler is called a "CPU-profiler" as if that's a good thing. What that means is it is blind to any form of blockage such as I/O, waiting on a semaphore, etc. Sometimes this is called an "ON-CPU" profiler (blind to I/O) as opposed to an "OFF-CPU" profiler (blind to CPU time). Even if you use both of these, you don't know how the time is divided between the two. Something can look like a speed bug but not be because it's in the wrong profiler, and besides, you can't tell why it's being invoked. With call-stack samples, there is no need to distinguish between the two.
As a python programming I am crying all the time throughtout the video.
He talks about how the executable's name is put into call stack and it reduces performance.
While in python, the entire time its slow :(
python is supposed to be clear, not fast. If speed is the issue, see my comments.
Can someone explain how the name of the executable affects timing measurements? The linker will replace the names with actual branch instructions to addresses.
If I understand correctly, argv[] is allocated on the stack. argv[0] is generally some form of the program name.
Thanks for this class and youtube channel !! Could you please recommend some other material (references) and if it is possible to download the class slides ?
See the course on MIT OpenCourseWare for the lecture notes, readings, and other materials: ocw.mit.edu/6-172F18. Best wishes on your studies!
@@mitocw Thank you very much !! Could you please let me know the bibliography material (books and articles) used ?
ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/readings/
Amazing how low the participation is.
This is fucking MIT
I was expecting lots of brilliant answers within seconds.
@Patrick Lenihan nice excuse.
What I have seen with brilliant students while I was a TA is that they usually keep their brilliant answers in their mind and dont require a pat in the back from the professor.
Dude literally asked if it was a precision issue (which it by definition was) yet he said no.
How was it a precision problem? Using a more precise clock would not have made the computer run faster. The measurements were still correct.