At the end when you are bringing out the heads dimension out of the resulting relative_position_values matrix shouldn't the operation be relative_position_values.transpose(1, -1).transpose(0, 1).unsqueeze(0) so we end up with (batch, heads, sequence, context) instead of (batch, heads, context, sequence)?
Good catch! Yes, the context and sequence are in the wrong order (and I've ignored the batch) - your solution puts things in the correct order. We switch to einops later as we put everything together so this will be corrected in later videos. Glad you're enjoying the series :)
Just commenting to say that this series is appreciated and I took the weekend to follow along! Time well spent. Hopefully continue next weekend
Amazing work!
At the end when you are bringing out the heads dimension out of the resulting relative_position_values matrix shouldn't the operation be relative_position_values.transpose(1, -1).transpose(0, 1).unsqueeze(0) so we end up with (batch, heads, sequence, context) instead of (batch, heads, context, sequence)?
Good catch! Yes, the context and sequence are in the wrong order (and I've ignored the batch) - your solution puts things in the correct order. We switch to einops later as we put everything together so this will be corrected in later videos. Glad you're enjoying the series :)
awesome!
Thank you.