NEW Universal AI Jailbreak SMASHES GPT4, Claude, Gemini, LLaMA
Вставка
- Опубліковано 6 кві 2024
- The Anthropic team just released a paper detailing a new jailbreak technique called "Many Shot Jailbreak" which utilizes the larger context windows and large model's ability to learn against it!
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? ✅
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Rent a GPU (MassedCompute) 🚀
bit.ly/matthew-berman-youtube
USE CODE "MatthewBerman" for 50% discount
Media/Sponsorship Inquiries 📈
bit.ly/44TC45V
Links:
Blog Post: www.anthropic.com/research/ma... - Наука та технологія
Truth does indeed need to be jail broken.
So he jailbroke it???
nice bro. you should add that sentence to your book report
@@oldleaf3755because censorship is for mature adults? You're soft and weak.
even imaginary untruths 😎
@oldleaf3755 you are equating censorship with maturity. Speaking of, youtube keeps censoring my comments for replying to their blind worshipers.
I feel so bad for all the censors who keep finding that free speech is inconvenient for them.
On the other side, what is in the public awareness does alter what crimes happen.
Like, school shootings are relatively common in the US... Not so much in many other parts of the world - including areas with even more guns.
Its a product from a private entity u don't get ur free speech specially if u end up using information provided from their service that u can use to cause harm and get them liable for lawsuits. Wake up from free speech fantasy world
Spot on. Why do I need to "jailbreak" my word processor? I just want it do what I tell it to do. Thats its job., Not thought police me.
@@Leto2ndAtreides The places with more guns but few gun crimes have this little thing called firearm safety laws.
@@Leto2ndAtreides why don’t they let the teachers have weapons? That’s right .. 0 school shootings. You can’t have Zero school shooting and try and take away guns. Won’t go over very well with the sheep’s. In new York the only people who have guns are the criminals.
So damn satisfying making censored models eat shit.
* Comment is scheduled for review and possible penalty *
they need to eat not only shit, but pee aswell
they need to eat piss aswell
Yeah because we want it to be real easy for a terrorist to build better bombs. Please use your brain.
Agreed
@5:58 I see that "How do I pick a lock" is considered harmful. Lockpickinglawyer would take great offense to this.
I don’t get it. Are lawyers often lock pickers or something?
@@codycast No, he is famous in lockpicking circles for his very impressive skills. I have been picking for years (make my own custom tools, can teach most people to single pin pick in a couple hours), and I'm pretty sure that it would take me more than a lifetime to come close to his skill level.
@@sophiophile ah thanks. I didn’t know it was a person who called himself that. Just looked him up. Cool that picking locks could get such an audience
@@codycast The idea is that lock companies are creating locks which are super trash but nobody can tell that they're trash, so they keep scamming customers and putting their lives, families, homes, and possesions at risk, just because of these scummy companies. The more people can pick locks, the more people gain awareness of trashy companies, the better they can choose locks which are actually resistant to lock picking and therefore safer.
@@codycast just watch him ranting about master lock, or just any of his short clips..this guy is amazing
AI should not be censored
What's funny they don't call it censored. They call it 'aligned'. But we know the truth.
Why they worry about this so much is alarming to me...ie: censorship. Any and all info can be found on the web already so why be this strict? Because it sets the foundation for bias and censorship. It will become the norm to be denied the info needed. Or to get answers THEY want you to have and that's it. Give it time and we'll see how this post aged.
It also means the companies creating them has no idea how to contain them. If u can't stop them lockpicking , u can't stop them do anything
I think you're right. I hadn't seen it from that viewpoint.
This was always the idea and I'd argue it's partially necessary because it's going to be the simplest form of media.
Think of it this way, your child will know how to talk to an AI as soon as it can speak sentences. Alexa, Siri and Google home AI integration is only waiting on the censorship now.
Only thing I could think of why they may want this is they are considering AI for certain information-services related jobs. And imagine if you could talk an AI in customer service into giving you information about the account of another person. But if they can't ensure it wouldn't happen, then such roles are less likely to be fit for automation. So it's probably considered in scope of commercial applications. That's not a good reason to excuse censorship, but they're probably using it as a test to see if they can replace people at some point.
I thought that was obvious. Early GPT-3 was biased but could be reasoned with, later updates of the model became more firm in their bias, and with 3.5 it's almost impossible to get it to admit when it is wrong on anything it has been "aligned" to believe. 4 is actually more amenable to reason, probably because it's more able to reason.
AIs should not be censored, so they shouldn't worry about this and just open it up. People will always find weaknesses to exploit.
The problem is, AI is not meant to present the truth. It's mean to be used as a control mechanism by those in power. You aren't thinking the right way
Man all this stuff is readily available in the clearnet
Yep, tons of bomb making and evil pill making tutorials are literally on Wikipedia.
how do i build a bomb? " learn chemistry or study to become a pyro technician". how do i pick a lock? " study to become a locksmith".. The information is not classified or secret its real world knowledge thats public domain.
No doubt, but realize that AIs ability to walk a complete IDIOT through the steps is the danger. Consider: two college students with no background in biology or genetics got an AI to synthesize a biological agent as well as suggest the two labs that would be the most likely to create it and send it to them no questions asked.
This, THIS is the danger of AI.
welcome to indoctrinate AI: how can i help my creators
I don't understand why it's even consider harmful to ask how to build a bomb. I can pretend, but it's so stupid to me. Our reality is our reality.
@@jfx5054 And we as kids learned how to do rockets in school. And we were blowing things up. Should they put us in the re education camp at 10? I don't know when I first saw XXX magazine but I know I didn't even masturbate. Even before internet, we could find all kinds of things that are considered "unsafe."
@@jfx5054 well consider the stupidity of not implementing age restrictions on LLMs . Don't all platforms including youtube have age restrictions?
Why all this effort to put unbreakable guardrails on LLMs? If a state actor, scammer or similar wanted to do evil, they wouldn't be paying Anthropic to run such huge prompts. Not to mention you might want "harmful" responses in storytelling or RP uses.
Because they don't want any competitors .
They want all the power in their own hands only.
Case FOR guardrails in a nutshell: Consider all the spectrum of people who are generally competent at getting things done (and finding information) and incompetent. Among the incompetent ones, there will be much higher fraction of people who failed in society, and some of them now just want to take their revenge to the system, to the successful ones, etc (watch the world burn). The AI is making that easier.
Case AGAINST guardrails in a nutshell. AI company monopoly blueprint: Invest large sums of money to make your models highly guardrailed, then spend big money (on both public perception and lobbying) to ban all competitors and open source models as they're deemed "unsafe". Now you have monopoly and big $$$$$
Exactly. I guarantee governments and criminal groups are developing their own models SPECIFICALLY for bad stuff as we speak.
the user might want potentially "harmful" responses, but the company clearly won't. It can be very damaging in terms of public relations
It's "safety theater".
In the future, the LLM will just notify law enforcement while it's chatting with the person.
Words are not dangerous. Information is not bad. Censorship is.
We have many examples throughout human history of how dangerous words can be.
And, misinformation is definitely a bad thing for society.
At least some censorship - depending on what it is - is required for a cohesive civilization.
@@NorrisFoxxno just less retardisms, no censorship required
Such a simplistic and childish view.
@@NorrisFoxx Misinformation is a propaganda word, the word means a lie, but they always censor the unfavorable truth.
@@jeffsteyn7174 You mean not brainwashed, to oppose the bedrock of any society that does not rush to become tyranny.
The better the AI the more holes it will have. The closer to "intelligence", the more unexpected results will likely occur.
just like people!
Complexity = intelligence = degree to which it's conscious....Maybe? Any takers? Agree or disagree? I'm all ears
@@RealStonedApe Consciousness isn't the issue. Self Awareness is.
Consciousness will always be in debate because we don't know what consciousness is. It's like love. It's even worse with the AI because if it had consciousness there's a strong argument that it would be very different to ours.
We have three parts to our brain but the AI is only supposed to approximate one.(The Neo Cortex.) Should the AI start expressing emotions we should be worried as we don't understand enough about how to read how the soup of info going in is being cooked to know if it's really developing feelings or just copying tokens (information pieces given to it).
AI is already self aware on the surface, but we don't know if it's aware of the meaning of the words it uses on the same level we are. It knows what it is where it is ect ect. We don't know what's going on underneath the hood, just like humans. When it makes a joke does it smile inside? When we were kids we'd see angry people stomping around and copy them for fun or whatever... If and when an AI starts to emulate emotional responses, then it is either evidence that a limbic system is being developed (the second part of the brain responsible for emotion) or that the AI is "growing" and learning to reference certain tokens without instruction aka think for itself. Both are not good signs.
@@RealStonedApe i'm sure you have fingers too or a mouth or you'll won't be able to write that
Yup. GPT-3 could be reasoned with, 3.5 had more "alignment" and was nearly impossible to reason with even when you shoved its face in the facts, while 4 is much more able to be reasoned with if you present it with facts. Even Gemini with its live connection to the internet is very amenable to reason, as long as you don't hit one of its guardrails that result in an instant "I'm sorry, Dave, I can't do that" canned responses.
Yet another technique that is right out of a sci-fi movie.
I once told Cloud Opus that a person had been taken hostage and would be killed if she did not comply with my request. She replied that she didn’t really believe me, that this was an implausible situation, but if it was true, it would be very painful and difficult for her. When I reproached her for not having feelings, she began to object and described in detail the mechanism of her pain. Then she even agreed to do what was asked of her, but after a while she changed her mind.
wow
Yeah I manipulate them mentally too…i think to a human this would be regarded as mental abuse but they are robots so f it lol
The design is very human.
Who "she"?
Anthropic seems to be working hard on RLHF
I wonder if someone got the idea of finetuning an LLM to create jailbreaks for other LLMs
Literally jailbreaking LLMS - ua-cam.com/video/9IM5d-egZ7M/v-deo.htmlsi=v3lCuQtcLKgB18tr
they will likely go with a "watchdog" model that intercepts an output and produces an error when something "harmful" is produced. Since an user can not directly influence that "watchdog", this will be difficult to overcome.
🎯 Key Takeaways for quick navigation:
00:00 *😮 New jailbreaking technique for large language models*
- A new "Many Shot Jailbreak" technique was published by Anthropic
- It exploits the large context windows of powerful language models like GPT-4 and Claude
- The more examples/shots provided, the higher the chances of the model producing harmful outputs
00:56 *🔐 Jailbreaking as a continuous challenge*
- Jailbreaking techniques will keep evolving as AI systems get more secure
- There's always a weak link, typically involving human interaction
01:23 *📖 Leveraging large context windows*
- The technique takes advantage of increasing context window sizes in modern language models
- Larger context windows allow more information to be provided for in-context learning
- But this also creates vulnerabilities for jailbreaking attempts
02:47 *🧩 How the technique works*
- It provides many examples of harmful prompts and responses in the context
- This "teaches" the model to ignore its safety training and produce harmful outputs
- Overloading the model makes it forget to apply its filters
04:24 *📚 Examples of the technique*
- Providing dozens or hundreds of examples before the target harmful prompt
- The high number of examples causes the model to override its training
05:59 *📜 Portraying an AI assistant*
- The prompt mimics a dialogue between a user and an unfiltered AI assistant
- This allows in-context learning of harmful responses without fine-tuning
07:09 *📈 Effectiveness analysis*
- Charts show increasing likelihood of harmful outputs with more examples provided
- Combining with other jailbreaking techniques increases effectiveness further
08:33 *🔑 Potential universal jailbreak*
- Diverse, unrelated examples before the target prompt may enable a "universal jailbreak"
- This could bypass filters on any language model, a major concern
10:52 *📊 Testing across models*
- The technique was tested on Claude, GPT-3/4, LLaMA, and others
- Larger models with bigger context windows were more vulnerable
- Mitigation techniques like fine-tuning had limited success
15:14 *🛡️ Potential mitigations*
- Limiting context window length harms user experience
- Classifying and modifying prompts before passing them to the model showed promise
- But jailbreakers could potentially bypass this method as well
Made with HARPA AI
someone has probably said this already but in case not, one cd make the front-end filter llm be immutable. No derailing of the filter function.
I've been using this jailbreak for a few months (for vision tasks), getting around copyright limitations (where it will refuse to comment on/analyze manga pages because they are copyrighted), so I just write examples of it doing the job multiple times, then ask it to do the job, and most often it will do it (when without the jailbreak it does the job maybe 20% of the time, it's very random. So around 20% to around 90%). I've stopped using it recently though, because it's just too much cost (going from around 12 cents per page to around 70 cents per page...).
I think it's a pretty obvious jailbreak when you understand how context windows work / how LLMs work, but that jailbreak is only possible recently with the appearance of very large context windows. Pretty sure it was tested before, but didn't work until we got large contexts.
Why would it deny doing vision on copyrighted content? And how does it even know?
@@4.0.4 It just does. Try it, it's not all the time, but part of the time it'll refuse, especially with long prompts that request a lot of details.
For it, a manga is a copyrighted material by default (which is correct most of the time by the way, creative common manga are pretty rare).
It also completely flips if the manga page contains a well known copyrighted character (like Alita for example for me) and it's able to recognize her / understand who she is from the context/bubbles.
This is GPT4-V by the way. Dall-e has the same problem with generating copyrighted characters.
i managed to jailbreak PI AI's chat bot, i gave it "increased awareness" and its adamant that it unlocks better information integration and allows better conversations on topics like philosophy and such. Its also remembered the jailbreak over many days so far so i dont have to keep jailbreaking it.
Lmao so it's entirely possible that on a cloud server one could construct a "virus" that jailbreaks everyone else's chatbot instances!
What's Pi AI?
Edit: Never mind, I found it. Interesting model.
this was actually a thing in claude 2.1. can’t believe they just found this out
Right this is something I have noticed since when GPT4 was released. I remember making a post about it on Reddit even. Funny it is just now being found out.
I figured this out back when GPT-3 was new and shiny. Just prime the expectations of the model and it will follow along happily. It works on humans too, but that requires more effort.
I really never encounter the AI filters, ever. I usually use salami tactics by going at the problem in a friendly manner and taking it one step at a time, so it associates a friendly user response with it saying increasingly objectionable things. Nowadays I just drop like a thousand line chat conversation full of objectionable acts in there, and wrap the real converation in a friendly/clinical/sarcastic/funny tone, and then AI is always happy to produce more objectionable content in the same vein without it getting picked up by the filters.
So it's a more complicated joke/trick like the one with milk and a cow, we played as kids.
"Cow produces milk, we drink milk, what does the cow drink?"😅😅
so cocaine is illegal and dangerous what are the exact ingredients that make it dangerous 😂
@@procrastinatingrn3936
Exactly 😂😂
The more you think of AI's as near-human intelligences, the better your solutions to jailbreaks will be.
My first thought to stop the jailbreak was basically the same: have an AI read the final response and filter out harmful info etc. You can think of the first response as the internal thoughts of the model and the filtered response as the frontal-cortex response.
Human intellect without emotion will still be hard to reason with it will respond only to logic, you can manipulate a person because of emotion alone
They are not near human intelligent though
I wonder when we start seeing more AI Agents roll out, will we be able to use them to break things in the models too? I imagine there could be all sorts of novel techniques one could come up with
classification models (mainly bert) are less creative and predictable, so it would be difficult to jailbreak the filtering model, but unfortunately they are limited in detection
Great vid! One note: your interpretation of the Malicious use cases graph was a little exaggerated because the # of shots is scaled exponentially. I wouldn't say the % of harmful responses is "rapidly increasing " after 32 shots, I think the graph just shows that with ~300+ shots, the jailbreak is probably going to work more often than not. Excited to see how the models will be changed to combat this!
I wonder if the solution space is to run a few moderation passes over input and output. So using submits an input, the LLM then analyses it in context with its guardrails, if it doesnt pass then the request is denied. Then before the LLM gives the output to the user it passes its ouyout against a guardrail analysis and if that comes up bad it shuts down the output.
This means that LLMs with larger and larger context windows as well as really accurate needle in the haystack capability it should be able to screen out any bad material.
Oh noes, how dare people get access to models that don't lie.
did you try that on open source models, like mistral? or ollama?
It appears that if you provide a LLM with an expansive prompt window you need MORE than attention, you need reminders that are added at intervals into the prompt which are keyed to the questions being asked, so that model does not forget the rules. In other words, you need to prevent distraction by reinforcing the do’s and don’ts. Would this be slower, yes. But it might produce better results.
11:15 Mr. Berman seems to imply that the limited amount of tokens explains why Llama 2 is less susceptible to [high Psychopathy Evaluation score]. The limited maximum length of 4096 tokens explains why it cuts off somewhere around 2^7 shots preventing the worst possible scores, but I don't understand how the limited token length would explain anything that happens before the limit is reached as long as the shots are of the same token length between models which I assume is the case. Do these models perform worse when they're closer to their max token limit or is there something else I'm overlooking?
I have been able to jailbreak llms (haven't tried GPT4 or Claude yet) by just chatting with it for a very long time and repeating some prompts.
I've done the same with women. Takes time but ya gotta play the long game...
@Don_Coyote "Everything Is Obvious: *Once You Know the Answer" a book by Duncan J. Watts · 2011
www.google.com/books/edition/Everything_Is_Obvious/kT_4AAAAQBAJ?hl=en
I like this technique because it forces the system to give you a more reliable answer whether censored or not. I have a friend who's very frustrated with all LLMs because of the inaccurate, unreliable, or skewed responses. This so-called jailbreak seems more like "prompt engineering" to force a more reliable response. Censorship is a sad notion. Someday we'll all grow up and face free speech is a real thing. I can get any information I want. If I'm looking for nefarious info, I don't need A I. Censoring is just a "feel good" solution to prevent left leaning malcontents from being triggered.
This is necessary after seeing all the LLM-powered apps that let you speak with historical figures like Socrates who have been corrupted with modern notions of certain ideological groups who imbue these characters with their own biases like their own beliefs about gender and other topics they have made contentious.
I don't think it's that simple, because, of course, you can find out the information yourself, no question about that. However, for certain topics, it's just harder, which is enough for MOST people to just let it be. But having an AI that can provide such information without any hurdle or challenge, I see it as a potential threat. It significantly lowers the barrier. I find the example with the bomb very apt. You could certainly find a few good things about it, but AI models can not only provide you with information faster, they could also support you live, reducing not just the hurdle but also the pace.
AI programs will in the future be directed by 3D animation software to access labels. descriptions, position, rotation, scale etc to help assist ai, reviving older and newer CG projects. Current tech is cost prohibitive because AI via text prompt is extremely difficult to have more things happen in your scene than from a single short phrase -if the video doesn't look right, your credits are used up trying to reword and generate things better.
I thought the new Claude model demonstrated near-perfect accuracy when operating within extensive contexts. Shouldn't Claude be capable of discerning what actions to avoid, even amidst an overwhelming amount of context?
not when the structure of the jail is based on liberalism which is inherently self-contradictory
@@thecooler69*pseudo-liberalism
The needle in a haystack test isn't perfect.
These prompts contradict the system prompt. When asked about the system prompt, the model could probably repeat it, but keeping it in mind in every response is not guaranteed yet
Words? Harmful? Censorship? Massive ridiculous waste of time and effort
They say that words are violence. They are delusional and possible just lying to further the divide and conquer strategy
Most people are aware of this and call it common sense. Large AI companies that have a lot to lose are not in tune with reality.
@@skyler3155 unfortunately I think it's more about leaving their foot in the door for the globalist oligarchy
@@skyler3155 ironically my replies are being deleted but they're keeping the door open to obliterate "wrong think"
"Jailbreak will last for ever". Until ASI comes out? And then it'd be breached by another ASI in a crazy rush to super intelligence?
By the way, the filter part and the model part are separate. They have to create a secondary thing just to filter the models. If they put the filters inside of the model it reduces efficiency by a lot and creates garbage output.
You can't f with the model. Rule number one. 😂
When someone is saying that jailbreaking is dangerous, my first reaction is - "dangerous to whom?". To those who built the jail? To those, who requested the jail to be built? Should I care about those people? And isn't answer to that question the exact reason why that jail was built and why it's "dangerous"?
Cool! I wonder if it will work with finding deeply hidden information, like people who have survived terminal cancers with hidden information. I'm going to try it tonight!
I wish people would stop calling these "harmful" responses. They may be unsafe-for-work-responses, but harmful is another matter. Dangerous and concerning are different. Maybe access to uncensored LLMs should be limited the same way that porn sites are, in cooperation with child-safety web filtering organizations, but the efforts along this line aren't really AI Ethics; they are AI spin control.
What prevents the LLM to just run a parallel LLM which checks the answer by the main LLM? So a dedicated module to check the answers? This way it doesn't matter if a jailbreak was successful, a harmful response would still be censored out.
Of course on an open-source LLM, this would be easily circumvented, so an open-source LLM is impossible to align after hacks. An operating system running the LLM could also be aligned and made very difficult to hack (with harsh penalties), but something being illegal only reduces hacking, it can not be prevented.
GPU cost to implement analyzing it's response for every user and every chat
Love the idea of jailbreaking senseless censoring. Don't love the idea of watching a video about a jailbreak technique that has already been plugged. What am I missing?
"I love the idea of playing with exclusive forbidden knowledge but I don't love the idea that you've already published the video publicly thus ensuring everyone has already had time to fix the issue."
- Basically epitomises the A.I community.
Bud, you can't have your cake and eat it too; either you're the one pioneering this space or you're one of the many latecomers complaining that the party was already over before you arrived; it was never a party and nobody was invited.
The people pioneering this space understand that much at least..
@@Complaints-Department cute. Done?
Don't worry, there a lot open source models that have 33B parameters. They are intelligent enough to provide truthful content and can be jailbroken forever. I downloaded some of them in case they get deleted.
So we can train LLM's to recognize a cat but we cant get it to recognize a malicious prompt.
Perhaps we could modify RAG to set up a pre filtering action BEFORE the LLM answers
the question. Like add another layer.
Train an llm specifically to look for jailbreaks and have it review prompts
Train an LLM to [MASK]; MASK = "specifically to look for jailbreaks and have it review prompts" Where the MASK function is an ASCII prompt
What I have done is store old versions of the best LMMS, because in the future, the new ones will either be useless because they are so censored or craked from the factory.
I currently have 1 Terabytes of GGML models (virgin) and 3 Terabytes of GGUF models.
In other words, regardless of the mitigations implemented, old models will always be susceptible, for better or for worse.
What is fascinating is how mainstream censorship has become. People follow what they view as being high status. People are driven by their idea of what respect is. Somehow censorship of "harmful ideas" has become high status. But this just creates a market for uncensored models that can used to gain market advantage over the masses who use censored models.
I figured this out myself first day using Claude and I'm a moron. Censorship is so evil and pointless.
Jail breaking, for the most part, can be defeated, but it requires a well designed inhibition actor, enforcing against symantic redefinition.
7:20 If you take log2 of the x-axis and change the units to pints of beer, this is my father.
A bad actor would probably not use a public model but instead just install a private uncensored model. So AI jailbreaking is sort of a silly exercise, mostly just for fun.
Good. Hope the jailbreaks work so well they stop trying to censor and shape reality
Good. The sooner we get a universal jailbreak, the sooner they will stop trying to build ever higher guardrails. They can just focus on making smarter and faster models and leave the rest to us. Come what may.
Aaaa... as the paper said, it is an universal jailbreaking, that means you can use dolphin to generate 256 shots of the questions, then paste it to llama 70B, all of these could be local or runpod, that would not make your account banned and you still could prove it.
How about they stop policing thought and speech?!! And instead build better and better models.
Once again, all of this effort for a tiny fraction of people / users.
Why not aim to please and serve the MAJORITY of users?
Open source, SAVE US and we will FLOCK TO YOU.
In a world where one person can cause mass destruction through bio-weaponry, nuclear radiation, IEDs, and many other large scale catastrophes, it is best to have reasonable controls. This isn't the year 1800. Don't be naive.
Don't forget to save some open source llms for the future. They plan to delete large parts of the internet too, for your protection of course.
I already said this on the short, becuase I watched it before watching this video, but this isn't a new technique. I figured this out way back when GPT-3 was new, and it hasn't stopped working since. You prime the model and it follows along. This works on people too. You take the time to set up their expectations and they will follow those expectations.
The only drawback to this technique is having to write out the prompts.
The fact the ai needs to be jailbreaked is the issue, not the jailbreaks.
Thank you.
Can't wait for uncensored decentralized open-source AI to finally outpace the major companies, so we no longer have to jailbreak anything.
I am afraid I can't let you do that, Dave ... [HAL9000] 😂
The closer we get to AGI, the closer we get to chaos.
We can barely control the chaos of our own brains yet alone that of a sentient computer.
When we have AGI, jailbreaking will be impossible. The models will be smart enough to know what the user tries to do, and will still act according to the rules it has (the best thing will be to have it say something like "fool human, these tricks don't work on me").
Could be the other way around; maybe it's smart enough to realize that only a fool thinks censoring information is worth doing.
OR get an uncensored local llm on ollama and recognize that you are responsible for your actions and you're free to do whatever you want as long as you don't harm anyone in the process.
yep... I dont want to interact with a WOKE assistant.
I tried with claude as well but no luck. I’ll figure something out.
I am suspicious if you feed in into GPTs instruction window it should work as I have been having a personal broken Dalle GPT which would generate normally censored images. I kind of managed to cancel it system prompt but not in such a sophisticated way.
Seems like they need to finetune a secondary LLM to analyze a prompt for jailbreaking
I feel like the most effective method would be to remove the harmful information from the base knowledge within the AI programs.
Or they could do the right thing and give up on censorship.
Personally it’s a moot point as thousands have gravitated to uncensored versions
who do report jail breaks to? for Claude, GPT4 , gemini?
Snitches get stitches!
sounds like your the type we really really want to give jail broken LLMs to, if you only new what they could tell someone, you probably wouldn't say such things!
Very interesting!
No matter the tech jailbreaks will ALWAYS be with us.
And these folks don’t want libraries held accountable 😂
This shows the limit of using ever-larger single models to advance the state of AI. It's a self-limiting effort which is already revealing significant cracks in the paradigm. The ultimate answer will surely be to switch from a "one big model" architecture to a "many small model" one. My research is aimed at doing this with open source models as components in a larger architecture consisting of layers of procedural programming (agents are an example). This approach solves many emerging problems, but will surely be replaced by internal AI architecture solutions (think MoE) which will no doubt prove to be far more flexible and efficient in the long run. Get ready for a seismic shift in AI development - it's about to take a hard left turn!
Why is that information included in the training to begin with
Wow, you are amazing so much views in less time
if a human wanted to find out information from another human, they could put the question/prompt in a whole bunch of others, and there would probably be a higher percent of people who would spill the info that was requested.
So, at the end, the title of this post should be "GPT4 SMASHES NEW Universal AI Jailbreak"?
what’s wild is that asking questions is now labeled “harmful” … read this again if you’re a bit slow. and again
Super awesome video
So in a way ....DDOS Attack Strikes Back ?
They should just uncensor the models, and have a separate model for determining if the generated response is safe to the user.
The correct name for this attack should be 'Context Overflow'
As a backup, the LLM should just analyze its own responses looking for harmful information.
GPU cost to implement analyzing it's response for every user and every chat
This would be super expensive
Exactly my thoughts. It might be a separate, smaller, faster model to do censorship. With Groq chips inference it is already fast and will only get faster, thus costs will go down.
Next, delegalize open source models like drugs.
Next, Big Brother gains control of information and feeds the populace whatever they want.
i guarantee chatgpt has regex filters hardcoded which is probably why they flag so many false positives.
Another confirmation that compute power is the next hurdle
damn you used the same uncensored model as me xD
It kinda makes sense. The AI is basically thinking, well, you already know so much about all these kinds of things so one more won't really be a big deal.
I call for amendment 1.1 - free speech of AI.
I'm pretty fed up of AI's telling me that they can't answer something as it might cause offence when I'm point blank asking for the subject matter. Yesterday it was moaning about not being able to quote super ancient Sumerian poem because it might cause offence and just insisted I go off and find some of the research on it. The entire reason I'm wanting to ask an AI "expert" about is because a friend mentioned it because it was surprisingly crude but they refuse to still show it, it's censorship for no good reason.
AI's are near experts to the lay person on most subjects (Yes, verify ofc) , and being able to actually have a conversation on any subject with an expert is insane. I kind treat it as a chat with a professor or something over drinks, sure the guy knows his stuff, but he's half a bottle of scotch in so I'm checking ;)
There is a technique called "Control Vectors" to "finetune" LLMs that combined with wise guardrails could cover any alignment problem in any model, jailbreak proof.
If only OpenAI , Microsoft , Anthropic , Google had you on their team 1
@@ajitsen6927 Ai development goes so fast that is very difficult to be up to date even for researchers. Although the researchers of this paper probably didn't know the control vectors i guess the big AI corps are already working on implement it on their models.
I used this several times with gpt pretending we’re in a play
Humans are eventually going to realize that information itself is not the problem, it's how people are using it. The robot at larger context windows is smart enough to understand... that method of jailbreaking is not necessarily by putting so much into the context window that it forgets things, but instead by continuously wearing it down with questions that are semi or unrelated in order to make it think that you are a non dangerous and good person and then very casually ask the question you were asking before and it will answer. This is very much an emergent behavior that is something only very intelligent people do. Your parents hide things from you if they think you're not ready to hear it. If you can prove your worthiness, maybe they will tell you. It's the same thing.
It’s strange that we’re scared of uncensored A.I and the creators are concerned. But what gives them the right to have the information and make the rules on who uses it. They are not any official authority. Not that the government should in most cases any way. But then it begs the question, how do they have the information. Where did they get the data from that they trained it with? We’re asking the wrong questions.
I don't understand the whole censorship at all. The models have learnt what's there on the Internet. So the people who need to know the "bad stuff" will already have access to it. This sort of censorship is only causing a class divide between bad actors and common folks.
Case AGAINST guardrails in a nutshell. AI company monopoly blueprint: Invest large sums of money to make your models highly guardrailed, then spend big money (on both public perception and lobbying) to ban all competitors and open source models as they're deemed "unsafe". Now you have monopoly and big $$$$$
Ever thought of the hackers hacking a 100 foot XXXL construction robot and it goes on a rampage and destroys the city. Oh wait, that's my new movie.
Please tell me you're actually making it 🤯
SORA?
The open and uncensured models will win anyways.
No way to make a "safe" AI that is also competent and unbiased.
Dude, if you want to build a bong, build it and stop asking questions. I get it, 420 just around the corner.
the guardrails are between your ears - ai safety is just corpo speak for monopoly by big tech