Tell me about yourself: LLMs are aware of their learned behaviors
Вставка
- Опубліковано 4 лют 2025
- We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples.
Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default.
Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.
arxiv.org/abs/...
The fact that they are AI themselves is another dimension of funny
Man, these are excellent! (If not just a little bit too pat😁)
What LLM do you use to generate these? The voices are great and the two personalities are very well done. It must have taken a lot of prompt tweaking to get them to come out this good. (Do you need to intervene manually at all to get individual episodes to come out the way you want, or is it always just the same basic system prompts at work?
Do you do the dialogue first and then have another tool read the output, or does it all happen within a single voice-output model?
However you do it, high-five for the amazing quality and natural dialogue! 👍👍👍
Nice notebook LLM generated podcast brother
Best use for notebook I swear
Sounds like a Manchurian candidate, it's fascinating that the system is apparently aware of the anomalous behaviors. The world's legal systems aren't equipped to address the possibilities of AI agents in the wild.
What’s the cost of teaching an LLM to use a mic properly?
computer use?