So, does anybody know an answer to question b? If first instruction is always executed for all 64 lanes, and the other 3 always have same number of threads executed out of 64. To get 67/256 utilization means that 4/64 threads execute all three instructions while 60/64 threads have bubbles. So, for array A[i], 4 elements out of every 64 are positive numbers. Also, it's possible that they are the same 4 lanes as there is no regrouping of threads into warps? What about B and C? Am I missing the point?
This is my take (correct me if i'm wrong):- #Warps # Threads for full utilization per warp # Threads actually utilized per warp Instruction 106012 64 64 > 106012 64 X Add 106012 64 X Add 106012 64 X Add (64 + 3X) / (64 * 4 ) = 67/256
X = 1 So 1 thread per warp after the branch is not utilized.
@@vinnym2923 Yes, for that part of the question it seems you are right, I don't remember what kind of drunk math I was doing. It was two months ago and can't remember my thought process. So it's 1/64 utilized for those three add instructions. Still, the 'main' question I had is how can you determine anything about those three arrays, other than that array A has 1 positive number out of 64. *because in a warp one thread will execute fully, thus condtion (A[i] > 0) is met*. What about B and C?
@@vinnym2923 Update> Just googled and found the solution. For A is what you calculated (1 out of 64 is positive). For arrays B and C answer is: "Nothing". -_-
FOUSTE95 In that case my calculation is wrong. I was assuming an even distribution of un utilized threads in warps. I assumed this because my understanding was that whenever the condition is satisfied, the i th threads won’t take part in the calculation. But based on the solution you mentioned, it looks like the threads are getting in utilized only when we add with A and mov data to A.
On page 36, you talked about losing efficiency because on a condition a warp gets split-up into two warps. What if all the threads follow the same path, so none are taking (for example branch D). Is step D completely skipped? So you're not losing an extra cycle? Or does it do a NOP at D. At some point the PC should skip it? This could mean, building a "huge shader to rule them all" when all threads are following the same conditions, you won't get a panalty?
Next PC is known only after the execution of the branch condition. I think the mask is generated after this. So if only one path C is taken, then the active mask should be 1111 (no need for a fork) and D should be skipped entirely.
A good description of the difference between simt and simd 19:12
So, does anybody know an answer to question b? If first instruction is always executed for all 64 lanes, and the other 3 always have same number of threads executed out of 64. To get 67/256 utilization means that 4/64 threads execute all three instructions while 60/64 threads have bubbles. So, for array A[i], 4 elements out of every 64 are positive numbers. Also, it's possible that they are the same 4 lanes as there is no regrouping of threads into warps? What about B and C? Am I missing the point?
This is my take (correct me if i'm wrong):-
#Warps # Threads for full utilization per warp # Threads actually utilized per warp Instruction
106012 64 64 >
106012 64 X Add
106012 64 X Add
106012 64 X Add
(64 + 3X) / (64 * 4 ) = 67/256
X = 1
So 1 thread per warp after the branch is not utilized.
@@vinnym2923 Yes, for that part of the question it seems you are right, I don't remember what kind of drunk math I was doing. It was two months ago and can't remember my thought process. So it's 1/64 utilized for those three add instructions. Still, the 'main' question I had is how can you determine anything about those three arrays, other than that array A has 1 positive number out of 64. *because in a warp one thread will execute fully, thus condtion (A[i] > 0) is met*. What about B and C?
@@vinnym2923 Update> Just googled and found the solution. For A is what you calculated (1 out of 64 is positive). For arrays B and C answer is: "Nothing". -_-
@@vinnym2923 Here's a link if somebody wants to see other solutions> www.coursehero.com/file/9331629/Homework-4-Solutions/
FOUSTE95 In that case my calculation is wrong. I was assuming an even distribution of un utilized threads in warps. I assumed this because my understanding was that whenever the condition is satisfied, the i th threads won’t take part in the calculation. But based on the solution you mentioned, it looks like the threads are getting in utilized only when we add with A and mov data to A.
How is DAE different from a superscalar like Haswell with separate memory ports (Access) and ALU ports (Execute)?
x86 uses a single instruction stream, DAE has 2 explicitly separate instruction streams that are independent of each other.
On page 36, you talked about losing efficiency because on a condition a warp gets split-up into two warps. What if all the threads follow the same path, so none are taking (for example branch D). Is step D completely skipped? So you're not losing an extra cycle? Or does it do a NOP at D. At some point the PC should skip it?
This could mean, building a "huge shader to rule them all" when all threads are following the same conditions, you won't get a panalty?
Next PC is known only after the execution of the branch condition. I think the mask is generated after this. So if only one path C is taken, then the active mask should be 1111 (no need for a fork) and D should be skipped entirely.
I suppose if your Warp hits 10 it will be computing at all points in space and time?
;)
Relocating the thread (10) in a warp in order to have more denser warp may also need reconfiguring the data access mapping in memory.