At 10:00, I think it's 97.62% for the return because you were sampling "stalled-cycles-frontend", meaning this instruction was stalled the vast majority of the time, waiting for the previous instruction to finish (pipeline can't start executing earlier due to data dependency).
I've also done a great deal of performance tuning. The method I rely on requires only a debugger that can be manually interrupted and the call stack displayed. Basically, real software (not just academic toy programs) often has several performance wasters, each taking a percentage of time, like 12.5%, 25%, and 50%. The chance that an interrupt happens while a waste is happening is proportional to its size, and it can readily be seen on the stack. Almost always it is a function call that doesn't really need to be done, and half a dozen halts will spot it. If you find the big one and fix it, speed is doubled, and the remaining time-wasters are twice as big, so they are easier to find the next time you do it. Find all three, and you're eight times faster! Once you do all that, then you can get down to worrying about cache-misses and pipeline stalls.
Absolutely, first, solve the efficiency aspect of the program. Do not call what does not need to be called, and do not read data that you already have. The performance aspect of the program is about doing the same amount of work just faster in time. Hence, cache misses, data alignment, and false sharing become the dominating factors.
Don't know how you don't have many views or any comments, but this was extremely useful! I finally stumbled across perf after looking for better tools than Valgrind to use to profile performance of processes. On my system, I use VMs, so I had to make sure the Virtual Performance Counters were enabled for the VM in ESXi to even allow me to use perf. I really enjoyed the explanation you gave for some of the statistics that are output as well. Do you know of any good resources for understanding more about branches and other stats, or determining which stats to look at over others, or indicators to look for?
Hey, thank you for your kind words. Please share if you find it useful it helps a lot. An excellent place to start is perf.wiki.kernel.org/index.php/Tutorial. To be perfectly honest with you, I have learned perf by working with it. The perf stat and perf record/report are the two most valuable tools I have used, and I highly recommend them. However, if you have a specific problem, please let me know, and I will try to help solve it.
@@Fastware Thanks! I've been reading through that page trying to soak up what I can! I will probably share your video with other engineers I work with. The software quality assurance department at my company is still gaining its feet, and I'm trying to help pull together tools to help supplement our testing results to provide to our development team. Looking forward to seeing more of your videos. Earned a subscriber from me!
Hi, Thanks for your comment. The not counted or not supported can be either because you are running in a virtual machine or because the Linux kernel does not support the CPU that you are running on. Additionally, kernel config might be a problem. Try relaxing sysctl-explorer.net/kernel/perf_event_paranoid/ and sysctl-explorer.net/kernel/kptr_restrict/ or running as root. The '/u' or ':u' after the counter indicates that the counter represents user space counters only. Let me know if you will manage to get it working or email me, which you can find in the video description, and we can try to get it working.
Mates, I found this jewell video, but I could not do the very first step: when running perf got: /sbin/perf: line 6: /usr/libexec/perf.4.18.0-425.19.2.el8_7.x86_64: No such file or directory uninstalling, installing, reinstalling, nothing helps :(
At 10:00, I think it's 97.62% for the return because you were sampling "stalled-cycles-frontend", meaning this instruction was stalled the vast majority of the time, waiting for the previous instruction to finish (pipeline can't start executing earlier due to data dependency).
You might be right here. We are probably stalling here due to the instruction decode dispatch stage.
Just EXCELLENT! Thank you
Glad you enjoyed it!
Thank you, just what I needed
Glad that you found it useful.
I love that poke at OOP in the end :D
Concise and succinct video!
Thank you
I've also done a great deal of performance tuning. The method I rely on requires only a debugger that can be manually interrupted and the call stack displayed. Basically, real software (not just academic toy programs) often has several performance wasters, each taking a percentage of time, like 12.5%, 25%, and 50%. The chance that an interrupt happens while a waste is happening is proportional to its size, and it can readily be seen on the stack. Almost always it is a function call that doesn't really need to be done, and half a dozen halts will spot it. If you find the big one and fix it, speed is doubled, and the remaining time-wasters are twice as big, so they are easier to find the next time you do it. Find all three, and you're eight times faster! Once you do all that, then you can get down to worrying about cache-misses and pipeline stalls.
Absolutely, first, solve the efficiency aspect of the program. Do not call what does not need to be called, and do not read data that you already have. The performance aspect of the program is about doing the same amount of work just faster in time. Hence, cache misses, data alignment, and false sharing become the dominating factors.
GOATED
Don't know how you don't have many views or any comments, but this was extremely useful! I finally stumbled across perf after looking for better tools than Valgrind to use to profile performance of processes.
On my system, I use VMs, so I had to make sure the Virtual Performance Counters were enabled for the VM in ESXi to even allow me to use perf.
I really enjoyed the explanation you gave for some of the statistics that are output as well. Do you know of any good resources for understanding more about branches and other stats, or determining which stats to look at over others, or indicators to look for?
Hey, thank you for your kind words. Please share if you find it useful it helps a lot.
An excellent place to start is perf.wiki.kernel.org/index.php/Tutorial. To be perfectly honest with you, I have learned perf by working with it. The perf stat and perf record/report are the two most valuable tools I have used, and I highly recommend them.
However, if you have a specific problem, please let me know, and I will try to help solve it.
@@Fastware Thanks! I've been reading through that page trying to soak up what I can! I will probably share your video with other engineers I work with. The software quality assurance department at my company is still gaining its feet, and I'm trying to help pull together tools to help supplement our testing results to provide to our development team. Looking forward to seeing more of your videos. Earned a subscriber from me!
I was using WSL2 on Windows 11, perf caused me alot of problems there so I decided to switch completely to Linux and here I am.
Thank you
Welcome!
You have a very beautiful output, why is mine so ugly?
some things are "not counted" and all of them have weird names with a u at the end, also the time is not visualized as pretty
Performance counter stats for './test':
17.46 msec task-clock:u # 0.957 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,013 page-faults:u # 115.290 K/sec
12,202,743 cpu_atom/cycles/u # 0.699 GHz (42.78%)
cpu_core/cycles/u (0.00%)
26,648,232 cpu_atom/instructions/u # 2.18 insn per cycle (54.24%)
cpu_core/instructions/u (0.00%)
1,985,735 cpu_atom/branches/u # 113.729 M/sec (54.35%)
cpu_core/branches/u (0.00%)
4,006 cpu_atom/branch-misses/u # 0.20% of all branches (59.36%)
cpu_core/branch-misses/u (0.00%)
TopdownL1 (cpu_atom) # 15.5 % tma_bad_speculation
# 40.9 % tma_retiring (65.80%)
# 42.1 % tma_backend_bound
# 42.1 % tma_backend_bound_aux
# 1.5 % tma_frontend_bound (71.50%)
18,888,656 L1-dcache-loads:u # 1.082 G/sec (57.09%)
L1-dcache-loads:u (0.00%)
L1-dcache-load-misses:u
L1-dcache-load-misses:u (0.00%)
609 LLC-loads:u # 34.879 K/sec (51.39%)
LLC-loads:u (0.00%)
0 LLC-load-misses:u (45.66%)
LLC-load-misses:u (0.00%)
0.018244928 seconds time elapsed
0.012015000 seconds user
0.005997000 seconds sys
Hi,
Thanks for your comment. The not counted or not supported can be either because you are running in a virtual machine or because the Linux kernel does not support the CPU that you are running on. Additionally, kernel config might be a problem. Try relaxing sysctl-explorer.net/kernel/perf_event_paranoid/ and sysctl-explorer.net/kernel/kptr_restrict/ or running as root. The '/u' or ':u' after the counter indicates that the counter represents user space counters only.
Let me know if you will manage to get it working or email me, which you can find in the video description, and we can try to get it working.
Mates, I found this jewell video,
but I could not do the very first step:
when running perf
got:
/sbin/perf: line 6: /usr/libexec/perf.4.18.0-425.19.2.el8_7.x86_64: No such file or directory
uninstalling, installing, reinstalling, nothing helps :(
Hey, which OS are you running? When you run 'which perf' what is the output?
On some distros perf needs a few packages to get it working correctly.