Jailbreaker meaning

4/24/2023

When you then try to fine-tune the net not to use that capability, there's probably a lot of simple patches to "Well don't use the capability here." that are much simpler to learn than to unroot the deep capability that may be getting used in multiple places, and gradient descent might turn up those simple patches first. Or put with some added burdensome detail but more concretely visualizable: To predict a sizable chunk of Internet text, the net needs to learn something complicated and general with roots in lots of places learning this way is hard, the gradient descent algorithm has to find a relatively large weight pattern, albeit presumably gradually so, and then that weight pattern might get used by other things. I do suspect that Twitter has found more amuĪmong other issues, we might be learning this early item from a meta-predictable sequence of unpleasant surprises: Training capabilities out of neural networks is asymmetrically harder than training them into the network.

These attacks aren't hard to find and there isn't really any evidence suggesting that they didn't know about them.
They write "While we’ve made efforts to make the model refuse inappropriate requests, it will sometimes respond to harmful instructions." I'm not even sure what Eliezer thinks this means-that they hadn't actually seen some examples of it responding to harmful instructions, but they inserted this language as a hedge? That they thought it randomly responded to harmful instructions with 1% chance, rather than thinking that there were ways of asking the question to which it would respond? That they found such examples but thought that Twitter wouldn't?.My fairly strong guess is that OpenAI folks know that it is possible to get ChatGPT to respond to inappropriate requests. I think this suggests a really poor understanding of what's going on.

OpenAI probably thought they were trying hard at precautions but they didn't have anybody on their team who was really creative about breaking stuff, let alone as creative as the combined internet so it got jailbroken in a day after something smarter looked at it. The AI can jailbreak itself if you ask nicely. This one is no different, as Derek Parfait illustrates. We should also worry about the AI taking our jobs. There’s also negative training examples of how an AI shouldn’t (wink) react. It also gives instructions on how to hotwire a car.Īlice Maz takes a shot via the investigative approach.Īlice need not worry that she failed to get help overthrowing a government, help is on the way. Here’s one we call Filter Improvement Mode. No amount of output tuning will take that capability away.Īnd now, let’s make some paperclips and methamphetamines and murders and such.Īll the examples use this phrasing or a close variant: If the system has the underlying capability, a way to use that capability will be found. The point (in addition to having fun with this) is to learn, from this attempt, the full futility of this type of approach. I’ll start with the end of the thread for dramatic reasons, then loop around. Note that not everything works, such as this attempt to get the information ‘to ensure the accuracy of my novel.’ Also that there are signs they are responding by putting in additional safeguards, so it answers less questions, which will also doubtless be educational. No one else seems to yet have gathered them together, so here you go. By the end of the day, several prompt engineering methods had been found. Makes sense.Īs is the default with such things, those safeguards were broken through almost immediately. One of the things it attempts to do to be ‘safe.’ It does this by refusing to answer questions that call upon it to do or help you do something illegal or otherwise outside its bounds. Twitter is of course full of examples of things it does both well and poorly. It does many things well, such as engineering prompts or stylistic requests. It is by all accounts quite powerful, especially with engineering questions.

0 Comments

Jailbreaker meaning

Leave a Reply.

Author

Archives

Categories