Parallel C++: Unsafe Math Optimizations

Parallel C++: False Sharing

Extreme SIMD: Optimized Collision Detection in Titanfall

Хліб возять раз на тиждень - як живуть у маленьких селах на Львівщині #shorts

DOMIY & SHUMEI - Не пройде

Легендарный «Цезарь» (легион «Свобода России»). Отставка Путина, захват Харькова и Днепра

Parallel C++: SIMD Intrinsics

CoffeeBeforeArch

Переглядів 5 027

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 4 лис 2024

КОМЕНТАРІ • 14

@markusbuchholz3518 Рік тому ⁺⁹
Hi Nick, I do not have a question but I would like to highlight again that your channel is remarkable. As far as I know, only by following your channel one can capture in a consistent way the latest achievements in SW, especially in excellent C++. It is a huge distinction to be here. Additionally, I appreciate your effort used for the preparation video each day. I have been using C++ for over 2 decades mostly within the robotics domain. Your impressive work gives all of us (the community) a new look at this beauty and encourages us to study more. Thank you so much. Have a nice day!
@CoffeeBeforeArch Рік тому ⁺¹
Thank you for the kind words! Always nice to hear when others enjoy the content. Autonomous robotics was where I got my start in research many years ago (primarily with SLAM) before moving more into the architecture/performance side of things.
Cheers,
--Nick
@user-vw5ex4wf1e Рік тому ⁺¹
Hey Nick, amazing videos as always! Compiling with -ffast-math seems to unlock intrinsics for the transform_reduce baseline as well. Btw your videos are very inspirational, keep it up!
@shaytal100 Рік тому ⁺¹
The performance benefit of using SIMD intrinsics is really impressive! I wonder how often the use of SIMD instructions could speed up every day computing tasks.
My really blind guess would be that they are very underused even in computing intensive software.
Thanks Nick for this fantastic series so far!
@CoffeeBeforeArch Рік тому ⁺¹
Glad you enjoyed it! For many cases, the auto-vectorizer code is good enough. There can be an incredibly high software development cost for using low-level intrinsics (programming in assembly can be tricky work).
There are great examples though of code written entirely/almost entirely in assembly. Intel's MKL (math kernel library) is a great example (along with many high-performance linear algebra libraries out there).
Cheers,
--Nick
@shaytal100 Рік тому ⁺¹
@@CoffeeBeforeArch Well, that is true. I guess there are many libraries for common algorithms that make good use of SIMD instructions. Good point!
@eladon19153 4 місяці тому
Hi nick, I just read about the alignment, and I would like to know why is it an improvement to align at 32 and not 64.. because 64 alignment (on 64bit system) would mean worst case of 4 cache misses and read of 64 bytes, while alignment of 32 would mean worst case of 6 cache misses.
Unless we are talking in 32bit system.
again, I might be wrong with how I perceived the cache, but I figured I will just ask while I still read about it.
Thanks alot
@ahmedazeem5975 Рік тому ⁺¹
Is it possible for you to also cover Arm Neon intrinsics if its possible?
This is a good topic and a good video :)
@CoffeeBeforeArch Рік тому ⁺¹
Thanks for the suggestion, and glad you enjoyed the video :^)
I would like to do more ARM-based performance videos, but unfortunately, I don't have an ARM proc at this moment, so it's a non-starter until that changes.
Cheers,
--Nick
@juancolmenares6185 Місяць тому
Why was it that the compiler did not recognize that it could use the vdpsp instruction? you did mention something about the compiler implementation, but dot product seems like something it should be able to figure out...
@CoffeeBeforeArch Місяць тому ⁺¹
If I recall correctly, it's because of the compiler's guarantees about the floating point arithmetic.
Compilers will guarantee floating point results (regarding the ordering and precision being used) to give repeatable results across platforms.
Vector dot product I believe uses a higher precisions for intermediate operations, and only rounds the final results, therefore giving a different result than if you were to do a dot product in a standard way, floating point standard compliant way. That result will often be more accurate than the standard calculation, but it is non-portable (because intrinsics are hardware specific)
@juancolmenares6185 Місяць тому ⁺¹
@@CoffeeBeforeArch great, thank you fir the explanation and the content!
@anm3037 Рік тому ⁺¹
It’s unfortunate that SIMD doesn’t fit so may practical scenarios.
@SneedsFeeduckAndSeeduck 6 місяців тому
SIMD should be designed like a GPU kernel, to execute one stream of instructions on arbitrary amounts of independent data, instead of simply making a normal program that operates on 4 or 8 or 16 values at once. It just doesn't lend itself well to arbitrary processing the way it is currently common.

Наступне

Автоматичне відтворення

Parallel C++: Unsafe Math Optimizations

Parallel C++: Unsafe Math Optimizations

Parallel C++: False Sharing

Parallel C++: False Sharing

Extreme SIMD: Optimized Collision Detection in Titanfall

Extreme SIMD: Optimized Collision Detection in Titanfall

Хліб возять раз на тиждень - як живуть у маленьких селах на Львівщині #shorts

Хліб возять раз на тиждень – як живуть у маленьких селах на Львівщині #shorts

DOMIY & SHUMEI - Не пройде

DOMIY & SHUMEI - Не пройде

Легендарный «Цезарь» (легион «Свобода России»). Отставка Путина, захват Харькова и Днепра

Легендарный «Цезарь» (легион «Свобода России»). Отставка Путина, захват Харькова и Днепра

👀Пропозиція від військового #війна #мобілізація #зсу #тцк #повістки

👀Пропозиція від військового #війна #мобілізація #зсу #тцк #повістки

Faster than Rust and C++: the PERFECT hash table

Faster than Rust and C++: the PERFECT hash table

SIMD and vectorization using AVX intrinsic functions (Tutorial)

SIMD and vectorization using AVX intrinsic functions (Tutorial)

Fast Inverse Square Root - A Quake III Algorithm

Fast Inverse Square Root — A Quake III Algorithm

Trading at light speed: designing low latency systems in C++ - David Gross - Meeting C++ 2022

Trading at light speed: designing low latency systems in C++ - David Gross - Meeting C++ 2022

All Rust features explained

All Rust features explained

Branchless Programming: Why "If" is Sloowww... and what we can do about it!

Branchless Programming: Why "If" is Sloowww... and what we can do about it!

The Art of SIMD Programming by Sergey Slotin

The Art of SIMD Programming by Sergey Slotin

Parallel C++: OpenMP

Parallel C++: OpenMP

Assembly, System Calls, and Hardware in C++ - David Sankel - CppNow 2023

Assembly, System Calls, and Hardware in C++ - David Sankel - CppNow 2023

Жизнь ТАРАКАНА (смешное видео, юмор, приколы, поржать)

Жизнь ТАРАКАНА (смешное видео, юмор, приколы, поржать)

Дурнєв дивиться сторіс ZОМБІ #56 (napisy PL, eng subtitles)

Дурнєв дивиться сторіс ZОМБІ #56 (napisy PL, eng subtitles)

Чим ви займалися до мобілізації? / hromadske

Чим ви займалися до мобілізації? / hromadske

Хліб возять раз на тиждень - як живуть у маленьких селах на Львівщині #shorts

Хліб возять раз на тиждень – як живуть у маленьких селах на Львівщині #shorts

美味しい食べ物のASMR ASMR FOOD 🍜🍝🍜🥓🥢🍗#asmr #美味しい食べ物#食べ物#vlog

美味しい食べ物のASMR ASMR FOOD 🍜🍝🍜🥓🥢🍗#asmr #美味しい食べ物#食べ物#vlog

TG: nexpertGM ОСНОВАЯ ПРОБЛЕМА РОТОРНОГО МОТОРА СССР #shorts #оживление #automobile #юмор

TG: nexpertGM ОСНОВАЯ ПРОБЛЕМА РОТОРНОГО МОТОРА СССР #shorts #оживление #automobile #юмор

СОБАКА ВЕРНУЛА ТАБАЛАПКИ😱#shorts

СОБАКА ВЕРНУЛА ТАБАЛАПКИ😱#shorts

Motorbike Smashes Into Porsche! 😱

Motorbike Smashes Into Porsche! 😱