Lecture 15. GPUs, VLIW, Execution Models - Carnegie Mellon - Computer Architecture 2015 - Onur Mutlu

Поділитися
Вставка
  • Опубліковано 21 січ 2025

КОМЕНТАРІ • 13

  • @muneshchauhan
    @muneshchauhan 4 роки тому +2

    A good description of the difference between simt and simd 19:12

  • @FOUSTE95
    @FOUSTE95 4 роки тому

    So, does anybody know an answer to question b? If first instruction is always executed for all 64 lanes, and the other 3 always have same number of threads executed out of 64. To get 67/256 utilization means that 4/64 threads execute all three instructions while 60/64 threads have bubbles. So, for array A[i], 4 elements out of every 64 are positive numbers. Also, it's possible that they are the same 4 lanes as there is no regrouping of threads into warps? What about B and C? Am I missing the point?

    • @vinnym2923
      @vinnym2923 4 роки тому

      This is my take (correct me if i'm wrong):-
      #Warps # Threads for full utilization per warp # Threads actually utilized per warp Instruction
      106012 64 64 >
      106012 64 X Add
      106012 64 X Add
      106012 64 X Add
      (64 + 3X) / (64 * 4 ) = 67/256

      X = 1
      So 1 thread per warp after the branch is not utilized.

    • @FOUSTE95
      @FOUSTE95 4 роки тому

      @@vinnym2923 Yes, for that part of the question it seems you are right, I don't remember what kind of drunk math I was doing. It was two months ago and can't remember my thought process. So it's 1/64 utilized for those three add instructions. Still, the 'main' question I had is how can you determine anything about those three arrays, other than that array A has 1 positive number out of 64. *because in a warp one thread will execute fully, thus condtion (A[i] > 0) is met*. What about B and C?

    • @FOUSTE95
      @FOUSTE95 4 роки тому

      @@vinnym2923 Update> Just googled and found the solution. For A is what you calculated (1 out of 64 is positive). For arrays B and C answer is: "Nothing". -_-

    • @FOUSTE95
      @FOUSTE95 4 роки тому +1

      @@vinnym2923 Here's a link if somebody wants to see other solutions> www.coursehero.com/file/9331629/Homework-4-Solutions/

    • @vinnym2923
      @vinnym2923 4 роки тому

      FOUSTE95 In that case my calculation is wrong. I was assuming an even distribution of un utilized threads in warps. I assumed this because my understanding was that whenever the condition is satisfied, the i th threads won’t take part in the calculation. But based on the solution you mentioned, it looks like the threads are getting in utilized only when we add with A and mov data to A.

  • @chrissears2395
    @chrissears2395 5 років тому

    How is DAE different from a superscalar like Haswell with separate memory ports (Access) and ALU ports (Execute)?

    • @wewillrockyou1986
      @wewillrockyou1986 3 роки тому +1

      x86 uses a single instruction stream, DAE has 2 explicitly separate instruction streams that are independent of each other.

  • @jeroenvanlangen8953
    @jeroenvanlangen8953 6 років тому

    On page 36, you talked about losing efficiency because on a condition a warp gets split-up into two warps. What if all the threads follow the same path, so none are taking (for example branch D). Is step D completely skipped? So you're not losing an extra cycle? Or does it do a NOP at D. At some point the PC should skip it?
    This could mean, building a "huge shader to rule them all" when all threads are following the same conditions, you won't get a panalty?

    • @vivekpadigar1033
      @vivekpadigar1033 6 років тому +1

      Next PC is known only after the execution of the branch condition. I think the mask is generated after this. So if only one path C is taken, then the active mask should be 1111 (no need for a fork) and D should be skipped entirely.

  • @JoannaHammond
    @JoannaHammond 5 років тому +1

    I suppose if your Warp hits 10 it will be computing at all points in space and time?
    ;)

    • @muneshchauhan
      @muneshchauhan 4 роки тому

      Relocating the thread (10) in a warp in order to have more denser warp may also need reconfiguring the data access mapping in memory.