Steering vectors: tailor LLMs without training. Part II: Code (Interpretability Series)

Demystifying Large Language Models in 45 minutes (non-technical)

Does ChatGPT memorize train data? - exploring memorization in neural networks

When you lose control of your Waboba Moon Ball. @TheWabobaTeam #wabobapartner

вернулись в ПРОШЛОЕ 🔃 | WICSUR #shorts

Рождение Немецкой Легенды - Mercedes 190E 2.3-16

Steering vectors: tailor LLMs without training. Part I: Theory (Interpretability Series)

Anastasia Borovykh

Переглядів 1 255

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 6 лют 2025
State-of-the-art foundation models are often seen as black boxes: we send a prompt in and we get out our - often useful - answer. But what happens inside the system as the prompt gets processed remains a bit of a mystery & our ability to control or steer the processing into specific directions is limited.
Enter steering vectors!
By computing a vector that represents a particular feature or concept, we can use this to steer the model to include any property in the output we want: add more love into the answers, ensure it always answers your prompts (even if harmful!), or make the model such that it cannot stop talking about the Golden Gate Bridge. In this video we discuss how to compute such steering vectors, what makes it such simple steering possible (somehow the network's hidden representations decompose into simple-ish linear structures), and look at a couple of examples. In Part II ( • Steering vectors: tail... ) we code up our steering vectors.
Disclaimer: finding these steering vectors is an active area of research; right now making it work includes a lot of trial-and-error and clarity on when steering works vs when it's not possible to find a useful direction remains unclear. Work on sparse autoencoders (a current hot topic in interpretability research) aims to automate the finding of useful directions.
Further reading & references I used:
Activation addition: arxiv.org/abs/...
Refusal directions: www.alignmentf... and huggingface.co...
Golden Gate Claude: www.anthropic....
Superposition: transformer-ci...
Sparse autoencoders: arxiv.org/pdf/...

КОМЕНТАРІ • 7

@TarunGupta360 3 місяці тому
Very helpful video! Please keep the good work coming :)
@GAURAVKAUL84 4 місяці тому
Wonderful explanation Anastasia!
@anastasiaborovykh120 3 місяці тому
Thank youuuu :)
@swairshah 3 місяці тому
Oh wow. Great to have non-slop ML channel like this. I think steering vectors, SAEs some of other MechInt papers would make a good series. I'd also like to know why something like KSVD isn't used (these days its faster too?) instead of SAEs.
@anastasiaborovykh120 3 місяці тому
oh interesting! i wasn't aware of KSVD, but i think it could be valuable in this setup. will look into it more & get back to you.
@RahulKumar-m1j2q 4 місяці тому ⁺¹
better w/out music
@anastasiaborovykh120 3 місяці тому
Thanks for the feedback!

Наступне

Автоматичне відтворення

Steering vectors: tailor LLMs without training. Part II: Code (Interpretability Series)

Steering vectors: tailor LLMs without training. Part II: Code (Interpretability Series)

Demystifying Large Language Models in 45 minutes (non-technical)

Demystifying Large Language Models in 45 minutes (non-technical)

Does ChatGPT memorize train data? - exploring memorization in neural networks

Does ChatGPT memorize train data? - exploring memorization in neural networks

When you lose control of your Waboba Moon Ball. @TheWabobaTeam #wabobapartner

When you lose control of your Waboba Moon Ball. @TheWabobaTeam #wabobapartner

вернулись в ПРОШЛОЕ 🔃 | WICSUR #shorts

вернулись в ПРОШЛОЕ 🔃 | WICSUR #shorts

Рождение Немецкой Легенды - Mercedes 190E 2.3-16

Рождение Немецкой Легенды - Mercedes 190E 2.3-16

Хто такий РОМАН СВІТАН? Звідки бере інформацію про фронт?

Хто такий РОМАН СВІТАН? Звідки бере інформацію про фронт?

A very, very basic introduction into distributed optimization

A very, very basic introduction into distributed optimization

Cognitive Psychology: Special Topic - Representation and AI

Cognitive Psychology: Special Topic - Representation and AI

Influence functions for large language models - why LLMs generate what they generate

Influence functions for large language models - why LLMs generate what they generate

Decoding hidden states of Phi-3 with LogitLens (Interpretability Series)

Decoding hidden states of Phi-3 with LogitLens (Interpretability Series)

But what is a neural network? | Deep learning chapter 1

But what is a neural network? | Deep learning chapter 1

Three times artificial neural networks are nothing like the human brain (+ are they ever alike?)

Three times artificial neural networks are nothing like the human brain (+ are they ever alike?)

Bounding the generalisation error in machine learning with concentration inequalities

Bounding the generalisation error in machine learning with concentration inequalities

AI can't cross this line and we don't know why.

AI can't cross this line and we don't know why.

The Genius Behind the Quantum Navigation Breakthrough

The Genius Behind the Quantum Navigation Breakthrough

🤔Можно ли спастись от Ядерки в Холодильнике ? #shorts

🤔Можно ли спастись от Ядерки в Холодильнике ? #shorts

Заява ЗАЛУЖНОГО ШОКУВАЛА увесь СВІТ😱ТРЕТЯ СВІТОВА ВІЙНА ПОЧАЛАСЬ?

Заява ЗАЛУЖНОГО ШОКУВАЛА увесь СВІТ😱ТРЕТЯ СВІТОВА ВІЙНА ПОЧАЛАСЬ?

"ВСЯ УЛИЦА полетела" - курянка про обстріли рф

"ВСЯ УЛИЦА полетела" — курянка про обстріли рф

СОЛДАТ КНДР: ВТЕЧА/ВІЙНА В УКРАЇНІ/10 РОКІВ ШПИГУВАВ У ПІВНІЧНІЙ КОРЕЇ/ТОРГУЮТЬ НАРКОТИКАМИ І ЗБРОЄЮ

СОЛДАТ КНДР: ВТЕЧА/ВІЙНА В УКРАЇНІ/10 РОКІВ ШПИГУВАВ У ПІВНІЧНІЙ КОРЕЇ/ТОРГУЮТЬ НАРКОТИКАМИ І ЗБРОЄЮ

How Strong Is Tape?

How Strong Is Tape?

Рождение Немецкой Легенды - Mercedes 190E 2.3-16

Рождение Немецкой Легенды - Mercedes 190E 2.3-16

Cute Baby Ties Up Dad And Wants To Play With His Phone #funny #fatherhoodlove#cute#fatherhoodmoments

Cute Baby Ties Up Dad And Wants To Play With His Phone #funny #fatherhoodlove#cute#fatherhoodmoments