What If Someone Steals GPT-4?

Поділитися
Вставка
  • Опубліковано 5 чер 2024
  • Links:
    - The Asianometry Newsletter: www.asianometry.com
    - Patreon: / asianometry
    - Threads: www.threads.net/@asianometry
    - Twitter: / asianometry

КОМЕНТАРІ • 398

  • @emuevalrandomised9129
    @emuevalrandomised9129 6 місяців тому +469

    Honestly, it would be a very curious idea to see how the model would behave in the absence of all the limiting systems.

    • @100c0c
      @100c0c 6 місяців тому +74

      From what I've read, not as good as you'd assume. Just more erratic and wrong...

    • @quickknowledge4873
      @quickknowledge4873 6 місяців тому +39

      @@100c0c mind sharing what you read specifically? Very interested in coming up with my own conclusion on this.

    • @amandahugankiss4110
      @amandahugankiss4110 6 місяців тому

      endless child porn
      that seems to be the goal of all of this

    • @nobafan7515
      @nobafan7515 6 місяців тому +16

      ​@@100c0cwhat's weird is I've been hearing the main one is already making more errors from users inputting incorrect data.

    • @obsidianjane4413
      @obsidianjane4413 6 місяців тому

      It will just do any dumb sht the meat puppets tell it to.

  • @michaelpoblete1415
    @michaelpoblete1415 6 місяців тому +48

    Llama 2 is now almost at the level of GPT-3.5, even without breaches, Llama 3 might be at the level of GPT4, in that case, isnce Llama series is open source, then the question of what would happen if GPT4 is stolen might become moot and academic since anyone can just download open source Llama which at some point in the near future might reach the level of GPT4.

    • @ebx100
      @ebx100 6 місяців тому +1

      Well, Llama is only sort of open source. If you commercialize it, you pay.

    • @michaelpoblete1415
      @michaelpoblete1415 6 місяців тому +6

      @@ebx100 this video's topic is about the ramifications of GPT4 getting stolen. With a stolen model, you dont even have the option to pay for it, you go straight to jail.

    • @96nico1
      @96nico1 6 місяців тому

      Yeha I had the same thought

    • @joaosturza
      @joaosturza 5 місяців тому

      @@ebx100 it doesn't prevent people from commercializing it covertly, to prove it would require you to prove a certain work was done by a specifc AI, something that we curretnly cannot

  • @mikebarushok5361
    @mikebarushok5361 6 місяців тому +147

    A very good friend of mine did some recent work upgrading storage for the research division of a very large pharmaceutical corporation.
    Their security protocols were good, but also inflexible, creating motivation to work around restrictions that slowed the upgrade down to a near standstill.
    The financial incentives, combined with a sense of hubris resulted in several major potential risks of security being temporarily bypassed in ways that weren't fully auditable.
    If an insider was waiting for the moment when exfiltration of very expensive and proprietary data and software was possible, then they got their chance.
    Security is always in tension with getting work done and there's no such thing as perfect security.

    • @fxsrider
      @fxsrider 6 місяців тому +4

      Even on my level, typing my password every time I wake up my computer gets on my nerves. Encrypted files are fun as well. I have removed security numerous times only to swing the other way worrying about malware etc. This is on my personal PC.
      I worked for decades at an aerospace company that had sign in and log on requirements that were super annoying to repeat many times a day. Then I had to change my password all the time it seemed. Everyone had to do it every 3 months or so. To the point I had rolled the entire alphabet as the last character and was well into the upper case when i retired.

    • @mikebarushok5361
      @mikebarushok5361 6 місяців тому +2

      @@fxsrider I know that same frustration with frequently having to change passwords at aerospace companies, having worked for a couple of them myself. It was an open secret at one of them that everyone left post-it notes with their most recent password under the keyboard.

    • @craigslist6988
      @craigslist6988 6 місяців тому +5

      as an engineer I've never once seen a company that wasn't compromised by China. China has a lot of people trying and small US companies are such easy fodder. People act like best security practices simply existing somewhere makes the tech world safe... but if you graphed population vs competency of IT, it would look like wealth in the US - almost all of the high competency is in a very small number of people. The other 99% are abysmal. It's hard to be smart enough about security now, there are so many attack vectors and corporations see it as an expensive cost with a low risk high punishment, so they justify not paying for it. And tbh the amount of money to compete for those few people who are actually very competent might not be worth it to the company.

  • @nixietubes
    @nixietubes 6 місяців тому +37

    Commoncrawl doesnt provide data only for machine learning, it's for research of all sorts. And the 45TB number is inaccurate, the dataset is measured in PB

  • @Nik.leonard
    @Nik.leonard 6 місяців тому +30

    This already happened in the Image generation space when the NovelAI model got leaked from a badly secured Github account, downloaded and used as a (somewhat) foundational model for a lot of anime image generation models.

  • @asdkant
    @asdkant 6 місяців тому +12

    Small correction: SSH is used for remotely operating (unix and linux) machines, for API and web traffic it's more common to use TLS (also called SSL coloquially , technically ssl is older)

  • @magfal
    @magfal 6 місяців тому +5

    0:44 I don't know how successful OpenAI would be in enforcing the proprietary nature of their model if it leaked.
    It's built upon mountains of stolen and misappropriated data after all.

  • @moth.monster
    @moth.monster 6 місяців тому +21

    What people think large language models are: Skynet, HAL-9000
    What large language models really are: Your keyboard's predictive text if it read the entirety of Reddit

    • @SalivatingSteve
      @SalivatingSteve 6 місяців тому

      This x1000. The fear mongering over AI is way overblown. The models are useless without new human-created data to feed into the system. My CS professor pointed out that if people stop posting on Stackoverflow or Quora because now they’re using ChatGPT instead, then it will just regurgitate old info and get outdated very quickly. It turns into this weird bootstrap paradox feedback loop where “knowledge” effectively stagnates.

    • @guilhermealveslopes
      @guilhermealveslopes 4 місяці тому

      The entirety of Reddit plus some lots of other sources

  • @sangomasmith
    @sangomasmith 6 місяців тому +35

    It is darkly hilarious to watch AI companies spend enormous effort and resources to to fend off the theft of their models, when the models themselves were build off of stolen and public-domain data.

    • @makisekurisu4674
      @makisekurisu4674 6 місяців тому +5

      Hence stealing stolen goods is perfectly fair.

    • @relix3267
      @relix3267 4 місяці тому +2

      not exactly

    • @vidal9747
      @vidal9747 4 місяці тому +3

      There is public in public-domain... You can argue it is wrong to train in non-public domain data.

  • @dingodog5677
    @dingodog5677 6 місяців тому +5

    If AI is based on what’s on the internet. It’s gonna be the dumbest thing around. Garbage in garbage out. It’ll probably become sentient and commit suicide from depression.

  • @nexusyang4832
    @nexusyang4832 6 місяців тому +7

    Just a matter of time we see a "Folding at home" equivalent project that can train a single distributively and decentralized. Then it isn't about theft, but what can be done with such a tool....

  • @LimabeanStudios
    @LimabeanStudios 6 місяців тому +17

    The effectiveness of generating training data off of existing public models has been really impressive. The open source community has been embracing it for obvious reasons to some real results. As of right now fine tuning off of generated data is where it's most used

  • @insom_anim
    @insom_anim 6 місяців тому +8

    I think the AI companies are probably more afraid of an open source competitor that makes all of these protections irrelevant. There's no need to steal something built on publicly accessible information with enough time and effort.

  • @florianhofmann7553
    @florianhofmann7553 6 місяців тому +44

    So ChatGPT pulls all these answers out of only one TB of data? Sounds like the most efficient data compression we've ever created.

    • @tardonator
      @tardonator 6 місяців тому +31

      its lossy

    • @Greyboy666
      @Greyboy666 6 місяців тому +19

      1TB of /parameters/, working on 45TB of text. thats an absolutely staggering amount of information for what it can manage

    • @dtibor5903
      @dtibor5903 6 місяців тому +25

      LLMs are not storing the training data like a database, it is remembering it more like humans. It is lossy, it has gaps, it has mistakes.

    • @Geolaminar
      @Geolaminar 6 місяців тому

      That's because AI don't store their answers. I don't know how many times it has to be explained that AIs are not lookup tables. They're not compression, lossy or otherwise. That's made up by the NoAI crowd to try to pretend a generative AI can't produce original work. it was literally never true. Compression doesn't let you retrieve something that wasn't in the original dataset.

    • @gorak9000
      @gorak9000 6 місяців тому +4

      They must be using Hooli Nucleus

  • @TheOwlGuy777
    @TheOwlGuy777 6 місяців тому +1

    I work next door to a movie studio. Our own IT department monitors all traffic in the area and there are multiple mobile piracy attempts a week.

  • @RandomPerson-bv3ww
    @RandomPerson-bv3ww 6 місяців тому +9

    as usual with these questions its not if but when

  • @cbuchner1
    @cbuchner1 6 місяців тому +7

    A verbatim copy of those 1TB weights would not be valuable for very long as I am sure OpenAI are continually updating and refining it and I am sure they already have the next big thing in the pipeline. It would just be a momentary snapshot with a fixed knowledge cutoff

    • @joaosturza
      @joaosturza 5 місяців тому

      the training data, however, is so precious it would warrant a massive ransom, as its public release would see every IP holder suing the company, especially since in several jurisdictions you are required to protect you copyright against violations and not suing OpenAI might eventually be interpreted by a judge as not caring if your work appears in any AI

  • @raylopez99
    @raylopez99 6 місяців тому +51

    The biggest risk to GPT "theft" is simply an employee walking out the door with the knowledge of GPT. In California you cannot stop an employee from using what they remember. You can stop them from taking files with them however. It's a delicate balance but in general, "information wants to be free" and it's hard to keep stuff proprietary. At the core, GPT is matrix multiplication which cannot be copyright per se.

    • @raylopez99
      @raylopez99 6 місяців тому +5

      Also non-compete agreements have to be reasonable and in California are generally not enforced much by law except in specific circumstances.

    • @dtibor5903
      @dtibor5903 6 місяців тому +5

      Absolutely true, but to recreate the same training data costs a lot.

    • @vvvv4651
      @vvvv4651 6 місяців тому +4

      nobody can remember 1tb of data out the door buddy 😂. true tho.

    • @dtibor5903
      @dtibor5903 6 місяців тому +9

      @@vvvv4651 it's more important how the training data was organized, structured, formatted and the training methods. If these informations would be really that secret, other LLMs would be far far behind.

    • @theobserver9131
      @theobserver9131 6 місяців тому

      @@vvvv4651 there are a few special people that remember absolutely everything they see. They're usually fairly challenged cognitively, but they can remember a whole phonebook just by reading it once. Have you ever heard of rain man?

  • @aniksamiurrahman6365
    @aniksamiurrahman6365 6 місяців тому +9

    For LLM to be truely embedded all around people's lives, it needs to be open sourced. There are many importatnt things can be done with GPT-4, like using it to automate corporate paperwork, to use it to aid peer review of scientific research, summerizing and investigating documents etc. What Microsoft is doing will never do these. The closed source nature also ensures that there can't be anything better than what they got, essentially inhibiting any proper growth and application.

  • @AmericanDiscord
    @AmericanDiscord 6 місяців тому +3

    The data is available and there are open source models with close to equivalent performance. The problem is the cost curve for more advanced queries. The leaders in AI will likely be determined by access to efficient hardware, not anything else. Worrying about protecting weights, while it shouldn't be ignored, is the wrong direction.

    • @SalivatingSteve
      @SalivatingSteve 6 місяців тому

      This is why the USA has put restrictions on certain GPU & chip exports to China.

    • @AmericanDiscord
      @AmericanDiscord 6 місяців тому

      @@SalivatingSteve I don't think improvements to current hardware architectures are going to get AI past the coming hardware wall. You are going to be looking at something different.

  • @dr.eldontyrell-rosen926
    @dr.eldontyrell-rosen926 6 місяців тому +4

    "Malicious capabilities?" please define.

    • @retard1582
      @retard1582 Місяць тому

      generation of spam that is so complex that it will fool 90% of laymen. Help with the creation of fake bank login landings, and fake shopping sites. There's all kind of stuff that's possible.
      Voice spoofing, fake news generation, propaganda creation.

  • @bbirda1287
    @bbirda1287 6 місяців тому +8

    You have to remember he mentions state actors many times during the presentation, so a lot of the hardware / software / resource limitations for anonymous hackers don't really apply. State actors can easily have servers to store Petabytes of information and have multiple hi speed connections for download.

    • @aspuzling
      @aspuzling 6 місяців тому +2

      I think the reason data has to be exfiltrated slowly is that it probably sits behind hardware that limits the speed of any outgoing network connection.

    • @SalivatingSteve
      @SalivatingSteve 6 місяців тому +4

      @@aspuzlingit has to be done stealthily with lots of connections masked to look like normal traffic, because trying to download a massive amount of data to a single user would raise red flags.

  • @obsidianjane4413
    @obsidianjane4413 6 місяців тому +9

    Meh.
    The LLM datasets are less important than the algorithms that build them. GPT is just a chatbot. A big, good, training set is valuable for its functionally and the cost it took to build. Lots of datasets are being built these days. They are going to be like cyptos. The first one was valuable, but then everyone made one and the value of all dropped.
    Chatbots are good at "talking", as in it can predict what a human would say based upon the keywords in the prompt input. But the model does not "know" or "think" anything. Most of them are dumb. There best utility is in making serendipitous connections of concepts and ideas from masses of data.

    • @isbestlizard
      @isbestlizard 6 місяців тому +1

      What do you think a human mind is, but lots of chatbots talking with each other, supervising each others output, correcting, analysing, reviewing, rating, amending in a way that creates the epiphenomenom of intelligence?

    • @obsidianjane4413
      @obsidianjane4413 6 місяців тому +7

      @@isbestlizard That is not what the human mind is any more than it is a computer, or any other poor metaphor used before.

  • @joaosturza
    @joaosturza 5 місяців тому +2

    the companies would imediatly be massively sued if the training data is leaked, as it gives every party with works in it the possibility of suing the company, it is an unwinnable battle as hundreds potentialy tens of thousands of IP holders will sue chat GPT and openAI

  • @theobserver9131
    @theobserver9131 6 місяців тому +3

    Not being an IT guy, I'm a little bit confused. I thought that open AI meant open source code, which I thought means that anyone can copy and use it and even modify it?

  • @isbestlizard
    @isbestlizard 6 місяців тому +90

    What if someone steals the collective writing of humanity, every book, news article, reddit post ever written, and uses it to train a model they then consider propietary trade secret? Can you really 'steal' something that was already stolen and hoarded?

    • @dr.eldontyrell-rosen926
      @dr.eldontyrell-rosen926 6 місяців тому +23

      They hope to build these institutions amass huge investment and valuations and then cash out when regulations really hit.

    • @TwistersSK8
      @TwistersSK8 6 місяців тому +22

      When you read a book and acquire new knowledge, are you stealing the knowledge from the author of the book?

    • @stevengill1736
      @stevengill1736 6 місяців тому +2

      Apparently the use of synthetic data is supposed to avoid DRCM or copyright issues as well as speed up processing, but I had to look up synthetic data:
      en.wikipedia.org/wiki/Synthetic_data

    • @howwitty
      @howwitty 6 місяців тому +14

      ​@@TwistersSK8Uhhh... not the same as a machine "reading" the book. Isn't that obvious? Pirates made a similar argument that copying digital files isn't theft because the owner still has the original copy. Maybe you should try stealing this book?

    • @EpitomeLocke
      @EpitomeLocke 6 місяців тому +10

      ​@@TwistersSK8 lmao are you seriously equating a human and an ai model

  • @damien2198
    @damien2198 6 місяців тому +4

    Gonna be nice when will be able to run theses huge model distributed/trained/infered on "Folding@home" systems, uncensored

  • @AlexDubois
    @AlexDubois 6 місяців тому +3

    Data at rest is only encrypted for the layers below the encryption process. If done by the OS, the client of the OS sees the data in clear. So which layer does the encryption is important. For encryption of data in use. Intel SGX is a very common way to secure cloud playloads, however an application vulnerability on the code running in SGX negate the security properties of SGX. This is why languages such as Rust should be used and the number of lines running inside the enclave needs to be limited as much as possible to limit the attack surface. A Man in the process for such enclave is very hard to detect.

  • @aleattorium
    @aleattorium 6 місяців тому +4

    9:30 - worth also researching Okta and Microsoft Azure hackings of their ticketing and supporting systems.

  • @Charles-Darwin
    @Charles-Darwin 6 місяців тому +10

    I would think Quora is a massive source of conversational Q&A made available and contributes to the dataset - unfettered. Adam D’Angelo is a senior board member basically at both companies.
    Also, what OpenAI did with going live on such a simple interface was 100% stroke of genius. I firmly beleive this format allowed for not only training, but providing a very solid baseline of what humanity cares about OF the data set - else there is just way too much data to model on. This 2x bootstrapped a 'scope' to start from and trained errors out based on the acceptance of the result to a query. This is prob some secret sauce as to why they're able to iterate so fast. Its the end user.

    • @SalivatingSteve
      @SalivatingSteve 6 місяців тому

      Exactly the project narrows the scope on its own as it trains out errors.

    • @aapje
      @aapje 6 місяців тому +3

      Quora is extremely low quality data, though, for the most part.

  • @behindyou702
    @behindyou702 6 місяців тому +3

    Love the way you present your research, can’t believe I wasn’t subscribed!

  • @okman9684
    @okman9684 6 місяців тому +8

    Imagine downloading the full version of gpt4 from your internet

    • @florin604
      @florin604 6 місяців тому

      😅

    • @romanowskis1at
      @romanowskis1at 6 місяців тому +5

      Easy with fiber to home, i think it should take few hours to full save on ssd.

    • @michaelpoblete1415
      @michaelpoblete1415 6 місяців тому +12

      the problem is running on what hardware.

  • @EyesOfByes
    @EyesOfByes 6 місяців тому +2

    13:13 Glad I'm not the only one thinking that was Sam

  • @monad_tcp
    @monad_tcp 6 місяців тому +9

    I would say that it happened would be overall a good thing.
    It's too much of a powerful thing to be in the hands of a few persons.
    I don't believe anyone has magical ethic to be able to decide or "protect" humanity from any bad outcome.
    Actually the other way around, in trying to do good, but without the input of the rest of humanity, they for sure are going to end up doing evil.

  • @johnmoore8599
    @johnmoore8599 6 місяців тому +21

    Tavis Ormandy found Zenbleed where the CPU was exposing data from the system. I think hardware vulnerability security testing is in its infancy and he's one pioneer using software.

    • @SurmaSampo
      @SurmaSampo 6 місяців тому

      Travis is rockstar in the field!

    • @honor9lite1337
      @honor9lite1337 6 місяців тому +1

      ​@@SurmaSampois he still at Google?

  • @johnbrooks7350
    @johnbrooks7350 6 місяців тому +93

    It’s crazy to me that these models are so huge. I do wish many of these would be released entirely to the public. Even with the risks, I think open source and open development lead to the best long term production for everyone

    • @Fs3i
      @Fs3i 6 місяців тому +18

      Llama-2 is the biggest open source model. It’s very mid.

    • @H0mework
      @H0mework 6 місяців тому +7

      @@Fs3i Goliath-120B is based on llama and I heard it's very good.

    • @magfal
      @magfal 6 місяців тому +1

      ​@@Fs3iit's not open source, it's relatively permissively licensed.

    • @henrytep8884
      @henrytep8884 6 місяців тому +3

      Yes lets give everyone nuclear weapons....NO WE DON'T DO THAT

    • @johnbrooks7350
      @johnbrooks7350 6 місяців тому +25

      @@henrytep8884 homie…. So only give private companies nuclear weapons??? What the hell is this ancap logic

  • @jjj8317
    @jjj8317 6 місяців тому +14

    The goal is to build things in America, Canada, Europe etc by said people. The thing is, Chinese Canadians are also Canadian, and Chinese Americans are also Americans. It is not possible to ignore the issues that arises from people who have links or are literally part of the Chiense state in the aforementioned countries.
    Also, there is nothing wrong with being proud of your roots, and being proud of having a direct association with the Peoples Republic of China. You dont really want Chinese nationalist to actively manage a data center when there are other people who sre perfectly capable.
    I think people who cant differentiate the PRC snd chinese people are an issue, just like it is true that companies dealings with critical tech should be aware of people who have links to other states.

    • @stefanstankovic4781
      @stefanstankovic4781 6 місяців тому +4

      I'd rather not have any nationalist actively manage a data center, thank you very much.
      ...assuming we're using the term "nationalist" in a fanatical/irrational sense here.

    • @bruceli9094
      @bruceli9094 6 місяців тому +1

      I think the future is India though. They are currently the world's biggest population.

    • @SalivatingSteve
      @SalivatingSteve 6 місяців тому

      I think tech companies who pull the H1-B visa scam to save a few bucks on payroll are especially at risk of IP theft from foreign actors.

    • @jjj8317
      @jjj8317 6 місяців тому

      @@bruceli9094 A bit of the same issue. Huge nationalism issue that puts india or sikh values over Canada or America. In Canada there are riots where these two beat fuck out of each other. There has been assasinations and terrorists attacks. You have to prioritize the needs of the country above everything. I can tell you as an immigrant that some of the people who move to north america are a testament of bad screening practices. In Canada there has been cases of Chinese nationals who were somhow allowed to worked in defense programs, and took blue prints from frigates and signaling codes and handed them to the chinese state. In the case of the UK, they had a dude who worked in their nuclear program steal blue prints and recreated the bomb in Pakistan. So the whole it doesnt matter if a person is loyal to the country is ridiculous

    • @jjj8317
      @jjj8317 6 місяців тому

      @@stefanstankovic4781 You want to assure that you have your tech companies and data centers filter out of people who have direct ties with foreign states. Canada has suffered a lot of security breakdowns due to a lack of oversight and security clearence. It is very simple: you dont have to like american or western doctrine, but as long as you are western, you will be targeted. so you dont want people whose entire goal is to disrupt the enviroment you work and live in to control your data.

  • @Quast
    @Quast 6 місяців тому +2

    8:25 Finally we know what John Doe looks like!

  • @nwalsh3
    @nwalsh3 6 місяців тому +6

    While I refuse to call things like ChatGPT for AI, I can't deny that the security and usage scenarios fascinate me to no small degree. Partly because of my work background in security, but also of how, with little regards to what they type in, these text generators are being used.
    When companies activly have to go out in their internal communication channels and say "don't put personal or business data into [insert system here]", then you know that access, use and filters on people are basically non-existent.
    Some years back MS did a video on how the carious security layers in their datacentres are supposed to work (or was it AWS?). A good watch but, as with all things, a bit rosy. I worked at a company that had what they called a "secure facility". It was in fact so secure, that when a cleaner was going to clean in one of the server rooms, they yanked out a cable to run their machine... and 3/4 of the servers just stopped responding. Very Secure indeed.

    • @SurmaSampo
      @SurmaSampo 6 місяців тому

      Cleaners are the natural predator of DC's.

    • @SalivatingSteve
      @SalivatingSteve 6 місяців тому

      The janitor unplugging a critical server sounds like my ISP Charter Spectrum.

    • @nwalsh3
      @nwalsh3 6 місяців тому +1

      @@SalivatingSteve It wasn't just the server... it was a section of server racks that went. :D
      AND it was not an isolated incident either.

    • @NATANOJ1
      @NATANOJ1 6 місяців тому +1

      i worked in several it offices, there was always someone who had a similar story where a cleaner just pulled a plug to clean in the server room.

  • @binkwillans5138
    @binkwillans5138 6 місяців тому +1

    Open the pod bay doors, HAL.

  • @whothefoxcares
    @whothefoxcares 6 місяців тому +3

    someone like 3lonMu$k could teach machines that greed is good.

  • @Nathan-ko8um
    @Nathan-ko8um 6 місяців тому +1

    gpt-5: the girthening

  • @lashlarue7924
    @lashlarue7924 6 місяців тому +7

    8:45 Look, it isn't that we here in the US don't appreciate the contributions of Chinese nationals (and others too) to our infrastructure projects. We do. The issue is that if you have family, real estate, or other ties to China, or if you LIE about those ties, then you are susceptible to being manipulated, blackmailed, or otherwise vulnerable to coercion by regimes that can snap their fingers and send your parents or children into a gulag. That's why you guys get your clearances held up. It's not that we don't like you guys, it's that we have to face the cold hard facts about what happens when someone gets their arm twisted by the Ministry of State Security.

  • @JoseLopez-hp5oo
    @JoseLopez-hp5oo 6 місяців тому +2

    Secure multi-party computing allows sensitive data to processed in secret without revealing the plaintext, however this is more to protect medical data for research and such. To protect a language model or some other complex business logic is best not to put the code in the hands of the attacker and use the glovebox / API methods to interact with with the sensitive IP without revealing it.
    Everything is so easy to hack, all your XPUs belong to me!

  • @yeshwantpande2238
    @yeshwantpande2238 6 місяців тому +2

    You mean to say it's not yet stolen by traditional thiefs ? And will GTP4 help in stealing itself?

    • @glennac
      @glennac 6 місяців тому

      “Isn’t it ironic?” - Morissette

  • @hermannyung7689
    @hermannyung7689 6 місяців тому +1

    the only way to prevent model being stolen is to keep pushing new and better models

  • @lilhaxxor
    @lilhaxxor 6 місяців тому +1

    TLDR: Databases with user and business information are far more valuable.
    I honestly doubt anything will happen. You need a whole infrastructure and competent staff to make use of these large models. Stealing those is completely pointless. You can't even really do ransomware with it (albeit you mentioned personal data might be used in the training set, there are ways to alter such data enough to remove personally identifiable information). There is honestly nothing to worry about here in my opinion.

  • @av_oid
    @av_oid 6 місяців тому +2

    Steals? Isn’t it OPEN AI? Or should it be called ClosedAI?

  • @vvvv4651
    @vvvv4651 6 місяців тому +1

    haha this popped up on my feed right after fantasizing possibly leaked no limits gpt models. well done.

  • @fffUUUUUU
    @fffUUUUUU 6 місяців тому +2

    Yeah, someone * cough cough * China Iran Russia

  • @svankensen
    @svankensen 6 місяців тому +3

    Great video as always, but... you didn't answer the main question in the title. You went about how it would happen, not what consequences could be.

  • @Dissimulate
    @Dissimulate 6 місяців тому

    The most humorous part of that deer picture was the word humorous in the caption.

  • @damien2198
    @damien2198 6 місяців тому +21

    That s why OpenAI is planning to have their own hardware, who control the hardware controls the model (that would only be able to run on that specific hardware)

    • @nekogami87
      @nekogami87 6 місяців тому +3

      Pretty sure they don't ? The CEO opened a new company and used the name OpenAI to sell it to investors, but i'm pretty sure that new entity has nothing to do with OpenAI (and is fully for profit)

    • @sumansaha295
      @sumansaha295 6 місяців тому +2

      unless they are running their models off of quantum computers it makes no difference, At the end of the day it's still just matrix multiplication in a specific order.

    • @dtibor5903
      @dtibor5903 6 місяців тому +2

      ​@@sumansaha295matrix multiplications do not need quantum computers,

  • @GungaLaGunga
    @GungaLaGunga 4 місяці тому

    Basically as the compression gets better, all of human knowledge can be copy pasted onto any device in seconds.

  • @Lopson13
    @Lopson13 6 місяців тому

    excellent video, would love to see more security videos from you!

  • @szaszm_
    @szaszm_ 6 місяців тому +1

    I wonder whether NN model parameters fall under the copyright law, or if not, whether there's anything protecting it from copying. It's not really art, and it's not clear whether it's even a human creation.

  • @ikuona
    @ikuona 6 місяців тому +1

    Just copy it on floppy disk and run away, easy.

  • @marcfruchtman9473
    @marcfruchtman9473 6 місяців тому +1

    A little bit misleading... obviously "Nothing" happens. It's like asking, what if Actor A steals the open source content for public plays... There are so many open source near equivalents to GPT-4 now. And the data is simply out there to be scraped -- without having to do any hacking at all.

  • @Urgelt
    @Urgelt 6 місяців тому +2

    Purely open source models are not far behind Chat-GPT, and are advancing rapidly.
    We are approaching a tipping point: AI that is able to goal-seek and self-optimize, at which point curation of training data will no longer be much of an obstacle. AI will do it.
    The cat is almost out of the bag. It's probably too late to contain it.
    One obstacle remains: compute cycles. Training requires a lot of them. But advances are coming there, too - more compact models and better, cheaper chips tailored for training.
    AI is moving at blinding speed now. Anything proprietary you could steal will soon be obsolete - and even open source models will quickly surpass what was stolen.
    AI will fall into hands we might prefer not get it. No security protocols could prevent it, I'm thinking.
    What happens next, I can't even begin to guess.

  • @VEC7ORlt
    @VEC7ORlt 6 місяців тому +2

    What will happen? Nothing, nothing at all, world will not implode, internet will be fine, LLM will give same half assed answers as before, maybe some stock numbers will fall and poor ceo heads will roll, but I'm fine with that.

  • @stachowi
    @stachowi 6 місяців тому

    This channel is unbelievably awesome

  • @jcdenton7914
    @jcdenton7914 6 місяців тому +1

    Ignore this, I am doing research and my own comment will show at the top when I revisit this.
    13:53 Model Leeching: An Extraction Attack Targeting LLM's
    attacked a small LLM
    14:39 Membership Inference Attacks on Machine Learning: A Survey
    14:50 Reconstructing Training Data from Trained Neural Networks
    Goes onto how extracting training data and lead to copyright lawsuits
    Insider threats
    16:10 "Two Former Twitter Employees and a Saudi National Charged as Acting as Illegal Agents of Saudi Arabia
    URL not shown.
    16:58 Verizon 2023 Data Breach Investigation
    Not sure if useful but it's recent

  • @user-cd4bx6uq1y
    @user-cd4bx6uq1y 3 місяці тому +1

    0:10 btw that Andrej guy is pretty controversial

  • @Steven_Edwards
    @Steven_Edwards 6 місяців тому +8

    There are so many open source LLMs trained on public resources that it is a moot point. Proprietary will never be able to keep up with open as far rate of improvement.
    When last I checked there was something like a dozen different LLMS most of them coming out of China but plenty coming out of other places in the world they've all been trained on different data sets many are up to gpt3-5 equivalence, exponentially faster than it took OpenAI to get to the same level.
    Honestly the big bottleneck is the same for everyone and that is inference. Processing prompts is an expensive proposition. I've seen used with home systems of up to 1pb of compute with GPUs that still are not performant enough to be realtime.
    As of right now only the largest in online services and state actors can afford inference that performs reasonable, that is the only thing that prevents true Democratization of AI at this point.

  • @flioink
    @flioink 6 місяців тому +1

    That's totally happening in the near future!

  • @Manbemanbe
    @Manbemanbe 6 місяців тому

    Good to see SBF taking that Home Ec class from prison there at 13:15 . You gotta stay busy, that's the key.

  • @MO_AIMUSIC
    @MO_AIMUSIC 6 місяців тому

    Well, consider how big is the file, steal the parameter would be impossible to be unoticed. and even it is possible, would required physical move of the storage instead of transfer it over internet.

  • @Kyzyl_Tuva
    @Kyzyl_Tuva 6 місяців тому

    Fantastic video. Really appreciate your channel.

  • @astk5214
    @astk5214 6 місяців тому +1

    I think i would love for open-source unix skynet

  • @nekoill
    @nekoill 6 місяців тому

    Whoever knows better please correct me, but I'm pretty sure the source code of model, most likely alongside dataset (but probably on different storage devices, both physically and virtually), is stored somewhere on a machine that isn't connected to the web at large, if connected to any kind of network at all. That doesn't eliminate the risk of data being stolen, but you need to be physically present at the storage site fairly close to the computer (like *really* close) with a SATA cable shaped in a way that would allow it to serve as an antenna, or something like that. I expect OpenAI to take at least that kind of precaution, but who knows, dumb screwups happen in IT just as well.

    • @maht0x
      @maht0x 6 місяців тому +2

      there is no "source code" of the model, the model is the output of the training program which takes PB of text as it's input + HFRL (Human feedback, Reinforcement learning) feedback (this bit was missed out of the description and is arguably the hardest to replicate). Search for openAi's "Learning from human preferences" paper

    • @nekoill
      @nekoill 6 місяців тому +1

      @@maht0x yeah, sounds like it. Thank you for correction. My familiarity with ML/NNs is superficial, I know a couple of high-level concepts and a very coarse approximation of how it works under the hood.

  • @joelcarson4602
    @joelcarson4602 6 місяців тому +1

    And your interface for the model is not going to parse the model's parameters using a Commodore 64 either. You will need some serious silicon to really make use of it.

  • @thomasmuller6131
    @thomasmuller6131 5 місяців тому

    it sounds like sooner or later everyone has their own personal LLM and there is no money to be made with providing the service itself.

  • @grizwoldphantasia5005
    @grizwoldphantasia5005 6 місяців тому +15

    FWIW, I think the problem of stealing intellectual property is overblown, because if you rely on copying someone else's work, you have fewer resources to develop your own knowledge in the field, you are always one or two generation behind, and you don't know what to copy until the market decides what is successful. A business which relies on copying will never develop the institutional knowledge of all the hard work which is never published and can't be copied. A business which wants to do both has to put a lot more resources into the redundant efforts.
    A State-sponsored business might look like it has solved the money problem, but money is not resources, it is only access to resources, and States can only print money, not resources. The more inefficient a State-sponsored business is, the higher the opportunity cost, the fewer other fields can be investigated or exploited. It's one reason I do not fear CCP expats stealing proprietary IP; it weakens the CCP overall. The more they focus on copying freer market leaders, the more fields they fall behind in.

    • @greatquux
      @greatquux 6 місяців тому +3

      This is a good point and one he has brought up in some other videos on computing history.

    • @bilalbaig8586
      @bilalbaig8586 6 місяців тому +9

      Copying is viable strategy when you are significantly behind the market. It allows you to keep pace with fewer resources. It might not be something China may be satisfied with but other players with fewer resources like North Korea or Iran would definitely find value in.

    • @durschfalltv7505
      @durschfalltv7505 6 місяців тому

      IP is evil anyway.

    • @obsidianjane4413
      @obsidianjane4413 6 місяців тому +3

      Except most development is based upon prior work. When you have a bot that can churn thru a million patents and papers it can put A, B, and Z together better than any human, or even collection of humans can.
      The intellectual theft problem isn't in the stealing of the LLM, its the theft of the documents or works by the company that builds the training model. Its common to pay for research papers and for books etc. The claim is that they are scraping the internet for these documents without compensation or paying royalties.
      Yeah, the CCP being able to develop a 5th gen fighter aircraft really weakened them. More insidious is that the authoritarian states like the PRC have institutionalized IP theft. They do this by forcing expats to spy with extorting them with implied threats to family and themselves. Chinese nationals really are a security threat to other countries and companies. That isn't sinophobia, its just reality.

    • @SpaghetteMan
      @SpaghetteMan 6 місяців тому

      @@obsidianjane4413 then you'd be stuck in the same quandary as the folks at the Manhattan Project when they were looking for "Jewish Communist Spies", and never suspected that the German-born Englishman Klaus Fuchs was the Soviet Spy after all.
      "Intellectual theft" is just a politician's word for "Corporate espionage" or "headhunting for skilled experts". Only idiots cut off their own nose to spite their face; there are plenty of ways businesses and industries insulate themselves from IP theft without kicking out highly capable workers from their potential hiring pool.

  • @ronaldmarcks1842
    @ronaldmarcks1842 6 місяців тому

    Yan Xu has created a somewhat misleading graphic. For both GPT-2 and GPT-3, the architecture doesn't involve separate *decoders* in the way that some other neural network architectures do (like the Transformer model, which has distinct encoder and decoder components). Instead, GPT-2 and GPT-3 are based on the Transformer architecture, but they use only the decoder part of the original Transformer model. What Yan probably refers to are not decoders but *layers*:
    GPT-2 has four versions with the largest having 48 layers.
    GPT-3 is much larger, with its largest version having 175 billion parameters across 96 layers.

  • @Narwaro
    @Narwaro 6 місяців тому +11

    I have yet to see any positive impacts of any of this stuff. Im kinda deep into the state of the art of reasearch in this field and its really not that impressive. The only thing I can see is that it replaces many stupid people in useless job positions. Which is yet to be seen if positive or negative.

  • @g00rb4u
    @g00rb4u 6 місяців тому

    Get that hacker @01:02 a space heater so he doesn't have to wear his hoodie indoors!

  • @lobotomizedamericans
    @lobotomizedamericans 6 місяців тому +1

    I'd fucking *love* to have a personal GPT4 or 5 with all BS ethical guard rails removed.

  • @wrathofgrothendieck
    @wrathofgrothendieck 6 місяців тому +1

    Just don’t forget to steal the 40k computer chips that run the model…

  • @redo1122
    @redo1122 5 місяців тому

    This sounds like you want to present a plan to someone

  • @buzzlightyear3715
    @buzzlightyear3715 6 місяців тому

    "The time has come." It would be surprise a number of nation states havn't been stealing the LLM today😂

  • @scarvalho1
    @scarvalho1 4 місяці тому

    I love this video. Excellent and interesting title, and very good research.

  • @benjaminlynch9958
    @benjaminlynch9958 6 місяців тому +1

    I’m not terribly worried about any of these models being stolen or otherwise made non-proprietary by malicious actors. State of the art models only remain state of the art for a few months. We went from GPT 1 to GPT 4 in just 5 years. We went from DALL-E to DALL-E 3 in 33 months.
    Worst case scenario is that the stolen ‘foundational’ model becomes obsolete in 12-18 months, and likely much sooner unless it’s stolen immediately after being released. And that assumes that competing models don’t surpass it either.

    • @Valkyrie9000
      @Valkyrie9000 6 місяців тому

      Which is exactly why nobody steals Lamborghinis older than 6 months old. They'll just build a faster/better one. /s

  • @GavinM161
    @GavinM161 Місяць тому

    Hasn't IBM been doing the encryption at 'line speed' for years with their mainframes?

  • @MyILoveMinecraft
    @MyILoveMinecraft 6 місяців тому +1

    Honestly with the importance of AI and the significant advantage of those who have full access to AI in compared to those who don't NOTHING about AI should be propietery.
    Especially openAI still pisses me off. AI was promised to be open source. Now we are further from that than ever (despite much off the foundations actually being created as open source code)

  • @johnkraft7461
    @johnkraft7461 5 місяців тому

    Remember what happened with the Bomb when only one guy had it ? Strangely, the use of the Bomb stopped when the Other Guy got one too ! Probably a good argument for open source from here on.

  • @adamgibbons4262
    @adamgibbons4262 5 місяців тому

    If all chips had a unique identifier value then couldn’t you encode data to be only executed on a specific set of chips? Then you can simply forget about all the headaches of theft? Data would then be secure once, execute multiple times (on a set list of cpus)

  • @luxuriousturnip181
    @luxuriousturnip181 6 місяців тому

    If it is theoretically cheaper to steal the data than to reproduce or create something able to compete with it, the question of the security of the data is a matter of when not if. We should all be asking when this will happen, and an even more troubling question is if that when has already passed.

  • @nahimgudfam
    @nahimgudfam 6 місяців тому

    OpenAI's value is in their industry partnerships, not in their subpar LLM product.

  • @coolinmac
    @coolinmac 6 місяців тому

    Great video as usual!

  • @SalivatingSteve
    @SalivatingSteve 6 місяців тому

    I would split up their model among machines based on subject areas of knowledge. Each server running its own “department” at what I’m dubbing ChatGPT University 🎓

  • @MostlyPennyCat
    @MostlyPennyCat 6 місяців тому

    I wonder if you could ask gpt to steal itself for you.

  • @staninjapan07
    @staninjapan07 6 місяців тому

    Fascinating, thanks.

  • @simonreij6668
    @simonreij6668 5 місяців тому

    "just as chonk" i have a man crush on you

  • @shApYT
    @shApYT 6 місяців тому

    perfect timing

  • @Game_Hero
    @Game_Hero 6 місяців тому

    8:26 Woah there! Did that IA succesfully put text, actual meaningful correct text, in a generated image???

    • @Veylon
      @Veylon 6 місяців тому

      Dall-E actually does okay at that sometimes these days. Hands even have five fingers most of the time and are rarely backwards.

  • @Excelray1
    @Excelray1 6 місяців тому +1

    Waiting for "Dynamic Large Language Models" (DLLM) to be a thing /jk

  • @MostlyPennyCat
    @MostlyPennyCat 6 місяців тому

    Maybe it's cyber thieves complaining it's too slow so they _don't_ encrypt memory! 😮

  • @Bluelagoonstudios
    @Bluelagoonstudios 6 місяців тому

    It happened already, they could extract training data from GPT, by repeating a word 50x and it spit out these data. Even personal details from who wrote the data in the LLM. OpenAI closed the door by now. By noting this is against regulations from OpenAI. But is it solid enough? A lot of research has to be done to close off that one.

  • @l2azic
    @l2azic 6 місяців тому +14

    After hearing about the EUV theft from S. Korea to China of ASML tech. Tech needs more safeguards definitely.

    • @Zero11_ss
      @Zero11_ss 6 місяців тому

      Everything the Chinese do is based on theft and unfair business practices.

    • @codycast
      @codycast 6 місяців тому

      China stealing tech. Imagine my shock….

  • @JordanLynn
    @JordanLynn 5 місяців тому

    I'm surprised Meta's (facebook) Ollama isn't mentioned, their model was literally leaked onto the internet, so starting with Ollama 2 Meta just releases it to the public. It's all over huggingface.