- 130
- 1 263 526
Prof. Dr. Ben H. Juurlink
Приєднався 12 лип 2017
The Embedded Systems Architecture (Architektur eingebetteter Systeme, AES) group of Technical University of Berlin (Technische Universität Berlin, TUB) investigates and teaches the field of computer architecture, ranging from low-power embedded systems to massively parallel high-performance systems. We focus on the design, implementation and optimization of high performance embedded systems; taking into account the interactions between applications, tools, and architectures. In addition to high performance we also aim at improving energy efficiency, programmability, predictability, error resilience, as well as other features of emerging computer systems.
www.aes.tu-berlin.de/menue/home_aes/
www.aes.tu-berlin.de/menue/home_aes/
Відео
1 2 2 MIPS64 Addressing Modes and Instruction Formats
Переглядів 9 тис.6 років тому
1 2 2 MIPS64 Addressing Modes and Instruction Formats
1 3 3 MIPS Pipeline Features and Pipeline Hazards
Переглядів 23 тис.6 років тому
1 3 3 MIPS Pipeline Features and Pipeline Hazards
Test 1 5 1 Caches and the Principle of Locality
Переглядів 2 тис.6 років тому
Test 1 5 1 Caches and the Principle of Locality
Test 1 5 2 Direct mapped Cache Organization
Переглядів 9586 років тому
Test 1 5 2 Direct mapped Cache Organization
Test 1 5 4 Basic Cache Optimizations to Reduce Miss Rate
Переглядів 8286 років тому
Test 1 5 4 Basic Cache Optimizations to Reduce Miss Rate
Test 1 5 5 Cache Equations for Set Associative Caches
Переглядів 4726 років тому
Test 1 5 5 Cache Equations for Set Associative Caches
Test 1 5 6 Cache Metrics and Improving AMAT
Переглядів 3226 років тому
Test 1 5 6 Cache Metrics and Improving AMAT
Test 1 5 7 Reduce Miss Penalty by Multilevel Cache
Переглядів 5716 років тому
Test 1 5 7 Reduce Miss Penalty by Multilevel Cache
Test 1 5 8 Give Priority to Read Misses
Переглядів 2816 років тому
Test 1 5 8 Give Priority to Read Misses
Test 2 3 2 SIMD Register File, Data Types, and Instructions
Переглядів 4176 років тому
Test 2 3 2 SIMD Register File, Data Types, and Instructions
Test 2 3 3 SIMD Multiplication Instructions
Переглядів 2936 років тому
Test 2 3 3 SIMD Multiplication Instructions
Test 2 3 4 Special Purpose Instructions & Data Conversions
Переглядів 1656 років тому
Test 2 3 4 Special Purpose Instructions & Data Conversions
Test 2 3 5 Data Alignment and Reordering
Переглядів 2666 років тому
Test 2 3 5 Data Alignment and Reordering
Test 2 4 1 TLP Motivation and Introduction
Переглядів 2336 років тому
Test 2 4 1 TLP Motivation and Introduction
Test 2 4 3 Introduction to Block Multithreading
Переглядів 1086 років тому
Test 2 4 3 Introduction to Block Multithreading
Test 2 4 5 Introduction to Interleaved Multithreading
Переглядів 1136 років тому
Test 2 4 5 Introduction to Interleaved Multithreading
Test 2 4 6 Examples of Interleaved Multithreading
Переглядів 846 років тому
Test 2 4 6 Examples of Interleaved Multithreading
Test 2 4 7 Introduction to Simultaneous Multithreading
Переглядів 2186 років тому
Test 2 4 7 Introduction to Simultaneous Multithreading
Test 2 4 8 Examples of Simultaneous Multithreading
Переглядів 1366 років тому
Test 2 4 8 Examples of Simultaneous Multithreading
1 3 8 Scheduling Instructions for Branch Delay Slot
Переглядів 17 тис.6 років тому
1 3 8 Scheduling Instructions for Branch Delay Slot
Thanks, Prof. This lesson has ingrained this concept in my brain. which is probably the most important theoretical limitation of concurrent programming
i think you could solve unoptimized code only by putting lw Rf , f after lw Rc , c this will make stall to first load and we already have loaded Rf in this way
Serial vs parallel is basically what effects everything not just chip logic. There u can see how more serial based CPU cores cannot scale as good as multi parallel GPUs. It’s all coming back to basics of processing where some tasks must be performed in series and others can be heavily parallelized. That sweet spot between them is constantly being challenged and pushed to achieve best possible results. In my opinion level of true parallelization and processing optimization will only increase also due to slowdowns and limits of chip shrinking. Golden years of just adding more smaller transistors and increasing clock speed r over even they will still play an important role, but surely as not as important as they played in the past.
actually, 40 is not an odd number😁
🙏 thankyou
How do you find the size of the block offset. Is that the size of the cache line in bits?
It looks so ugly. Just use assembly and don't f**k your brain, gays.
Thanh you Professor
thank you very much sir this is one of the best lectures presentations i have seen.
I enjoyed this a lot! Your way of teaching is so engaging and enlightening! Thanks for sharing with us these wonderful videos!
my quiz is today ina couple fo hrs, this saved my bacon, its only worth 100% of my final grade anyways, nothing crazy
thank u so much u better than thouse indians teachers and my univ teacher
Thank you for this lecture. This helped me understand in breaking down to basics of dynamic scheduling for loops.
R9 is 1024, as the compare to jump out of loop
Thank you Sir
goat
marinos antoniou x goggins big biceps
The most clear explanation of this topic !! Thank You
I'm a little confused here. BHT is used to store the history of individual branch. Is there anothere table to record prediction state (prediction wrong or correct)?
Excelent explanation and summary!
Thank you, this has helped me a lot! One thing though is the quiz questions seem very vaguely and confusingly worded: "What is the ideal speedup due to pipelining?" well, from what is said here, pipelining WILL speed up executed instructions. And this is the ideal point of it. It will do this because of how the stages are organized and utilized. So it's both. Unless the question was meant to be "what is the ideal speed up due to, in pipelining?"
Let’s say a single cycle CPU can run 1 MHz on a given process, then a 5 stage CPU can run 5 MHz?
Little Fords
well done professor you do a great job
Too many loops. I have an easier method called "slicing", where I take slices of A and B matrix multiplying them in such a way that I access A rows and B rows. The matrices are stored as typed arrays (contiguous memory blocks), abstracted to be 2d arrays with values stored in row-major order. Even though they are single arrays, it's ulikely that the whole array will fit into a CPU cache. Cache misses are inevitable. But I'm going to solve the problem of column access creating more cache misses without having to transform matrix B, and without adding extra operations (especially multiplication).
It's very clear. Thank you for your work : ]
Perfect for a Networking scheduler... if you're counting the load on each thread to distribute newer requests to the less loaded threads, then you really don't need an atomic variable that wastes CPU cycles by waiting for the threads to be synchronized, you could just read, and even if the scheduler reads an old value that was updated 10 seconds ago by the thread, who cares? It would work perfectly for distrubiting the load. I exaggerated of course, but that'd most likely be a few nanoseconds delay in CPU times between threads, so even more perfect. This relaxed consistency model is preferred in this case over the atomic models that require consistency. That also works because you there would only be one thread always writing to the variable, while the scheduler merely reads it, so no undefined behavior.
00:11 Introduction to computer architecture and objectives 01:20 Computer architecture is the science and art of selecting and interconnecting hardware components. 02:39 Introduction to ENIAC and course objectives 03:59 Understanding and overcoming cache challenges 05:18 Core i5 processors utilize key components for out-of-order execution 06:38 Branch prediction and SIMD instructions are important concepts for accelerating applications. 07:57 Nehalem microarchitecture highlights dual multi-threading support by Intel 09:15 Server processes contain multiple cores and execute multiple processes simultaneously. Crafted by Merlin AI.
At the end the demonstration of mnemonics wasn’t shown on background place
Extraordinary....simple and clear....Thank you very much
Thank you for an amazing lecture.
Slides don't work 😢
Sehr gut.. Danke
The microphone is really killing this great series !
Vielen Dank!
Very good video!
Best explanation on internet! will never forget Amdahl's law
at 6:42 "The two inner loops access an N times N/B submatrix of x", I think instead he should have said "The two inner loops access an N times B submatrix of x"?
thankssss for this video
thank you for sharing knowledge selflessly!
Clear, concise, precise - what I'd expect from a German
Sir you are great but I am not understand
Thanks for much
thanks king I've got an exam tm
Amazing and simple
thanks for great explanation.
Thx Germany 🇩🇪 I need this for finals next week. From 🇺🇸
Thx Germany 🇩🇪 I need this for finals next week. From 🇺🇸
Prof. Dr. Juurlink, thank you so much for your informative and crisp videos! They are very helpful for my computer architecture course here in the U.S.!
Dankeschön!
6:35 valid only if a doesn't overlap with b or c