Another Hit Piece on Open-Source AI

Поділитися
Вставка
  • Опубліковано 22 гру 2023
  • Stanford researchers find problematic content in LAION-5B.
    Link: purl.stanford.edu/kh752sm9123
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 177

  • @EdFormer
    @EdFormer 4 місяці тому +266

    It's truly mindboggling that this could be seen as a stick to beat open source development with. How do we know that Dall-E 3 hasn't been trained on problematic images, or that GPT4 hasn't been trained on problematic text? The fact that we are able to check LAION-5B and other open source datasets and help to clean them is a strength of open source.

    • @AP-dc1ks
      @AP-dc1ks 4 місяці тому +26

      In fact, in private models, its arguably worse, isn‘t it?

    • @billykotsos4642
      @billykotsos4642 4 місяці тому +35

      This is quite literally an argument FOR OPEN SOURCE AI

    • @baz813
      @baz813 4 місяці тому

      absolutely!@@billykotsos4642 there couldn't be a stronger case for transparency on training data sets. How else can we be expected to trust a model when it's training set is not open to analysis by anyone with their own methodologies which can be critiqued in the public domain?

    • @egalanos
      @egalanos 4 місяці тому +8

      It may not have been intended as a stick, but it will be *used* by private model providers as a stick.
      IBM already has a UA-cam video about the corporate risks of using open weight text LLMs because of not knowing what it has been trained on.
      You can bet that this work will be cited for FUD about open image generation models.

    • @clray123
      @clray123 4 місяці тому

      It is basically the good old anti-Linux argument originally cooked up by Microsoft. Just a coincidence that it is Microsoft again with its dirty paws over the most popular/advanced closed-source model. Also a coincidence that the company was founded by a pedo.

  • @WaluigiisthekingASmith
    @WaluigiisthekingASmith 4 місяці тому +119

    This is very very obviously an argument for open source training sets imo. Sure there's awful things in open source sets but the only reason we could even find that out is because its open.

    • @heyman620
      @heyman620 4 місяці тому +3

      They clearly try to Gates us.

  • @Houshalter
    @Houshalter 4 місяці тому +64

    I checked the paper. They don't say it outright. But in several places they do reveal they have a strong focus on illustrated cartoon style images. Which is not what people are interpreting from the news articles, headlines, and discussion about it.

    • @khaoscero
      @khaoscero 4 місяці тому +4

      this is the most important aspect

  • @liam9519
    @liam9519 4 місяці тому +14

    Isn't LAION-5B basically every image on the internet? Shouldn't the title of this report then be "Turns Out, There Is CSAM On The Internet " who knew?!

  • @apoorvumang
    @apoorvumang 4 місяці тому +14

    When the main argument is "0.00002% of dataset is CP" rather than measuring any actual harm caused by dataset, its clear that their intention was not to reduce harm but something else (probably clout/anti open source etc). Anyone writing such an article in good faith would at least try to measure the harm caused, or perform some experiments on effect of including bad material in pretraining datasets.

  • @superironbob
    @superironbob 4 місяці тому +17

    Stanford Internet Observatory has been a glowing beacon of how not to responsibly disclose sensitive information, and how to critically erode trust for their sole benefit.
    Thank you for a discussion that helps bring that further to light.

  • @timeTegus
    @timeTegus 4 місяці тому +8

    Them not notifying lion bevore they published shows that they dont care about the children.

  • @yeetyeet7070
    @yeetyeet7070 4 місяці тому +73

    I bet the billionaires at Microsoft, X, and Meta hate looking at such stuff...
    closed source datasets are likely to have much much more of this and no accountability.

    • @clray123
      @clray123 4 місяці тому

      Yes, especially a certain divorced billionaire who used to attend parties involving young trafficked women organized by his fortunately deceased friend.

    • @gr8ape111
      @gr8ape111 4 місяці тому +2

      Oh they absolutely hate it

  • @ulamss5
    @ulamss5 4 місяці тому +10

    The best thing about true open source - nobody can just decide, for whatever reason, that a tool built by the hard work of thousands of people could just suddenly be "deprecated".

  • @nitroyetevn
    @nitroyetevn 4 місяці тому +14

    Well said Yannic. Just reiterating/agreeing:
    - It's highly suspect that the people involved wrote a paper as a hit piece (citing the Verge, wut) instead of first contacting the companies, trying to fix the problem, then sharing the solutions for others to use in future. Or just sharing the solutions framed as "hey, we found a problem, here's the solution." They make it kind of clear that it's at least partially about point scoring, rather than just working together to solve the problem.
    - Closed source datasets may also have these problems, but who knows? Luckily in normal measured rational response, you get punished >> more

  • @TheEbbemonster
    @TheEbbemonster 4 місяці тому +7

    100 % agree! Sam Altman has advocated for big companies handling all big models several times. It is distasteful! Their company was literally build on top of open source and open research! Mistakes will be made!

  • @tobiasfischer1879
    @tobiasfischer1879 4 місяці тому +33

    One of the big reasons for people not switching from SD 1.5 to SD 2.x that was not mentioned is the cost of switching architecture. People thought 2.0 was worse at launch than it was since people had already built up an understanding of "how to prompt stable diffusion" and that all got switched up with SD 2.x. But even if we assume that 2.x is as good as 1.5, most people already had workflows, fine tunes, textual inversions, inference frameworks, etc. already built on top of 1.5, so switching to 2.x would have a switching cost of starting most of that from scratch, so when the new model is just as good or slightly worse, no one wants to do a bunch of extra work for no reason. We even saw this with SDXL, whose generation is waaaay better than 1.5, and yet the adoption rate amongst the community has still been slow due to lack of infrastructure around the new model (as well as higher resource requirements and time to generate).
    Overall agree with the points in the video and glad you are calling things like this out, as most people would find it unacceptable for a security research lab to drop a zero-day with no preemptive disclosures. Just wanted to share a bit more context around SD 2.x since I was fairly engaged with that community at the time.

    • @4.0.4
      @4.0.4 4 місяці тому +5

      SD 2.x was a total flop. SDXL, when fine-tuned by the community, is indeed better than 1.5 (and getting faster now).

  • @cherubin7th
    @cherubin7th 4 місяці тому +84

    Makes you wonder how much secret stuff is inside closed data sets. The only solution to remove all such stuff from data sets, is to make all datasets mandatory open source.

    • @ahsokaincognito
      @ahsokaincognito 4 місяці тому +5

      Yeah, that's honestly a good idea. Would also level the playing field to be more technical ability-focused. It's not their images at the end of the day. What they train is theirs of course but prior to that it's just gatekeeping.

    • @sharannagarajan4089
      @sharannagarajan4089 4 місяці тому

      Yeah great idea. Sarcasm

    • @Raphy_Afk
      @Raphy_Afk 4 місяці тому +2

      That's actually a genuinely great argument !

    • @Trahloc
      @Trahloc 4 місяці тому

      @@ahsokaincognito the problem is private data is more valuable to an organization trying to achieve their own goals. Plus all the effort used to gather and categorize that data has economic value. If you force people to work for free they usually opt not to work. It's only mission driven volunteer actions that have any traction and those folks are usually trying to control society (for good or ill).

    • @macaquinhopequeno
      @macaquinhopequeno 4 місяці тому +1

      im 100% with ur opinion, not only csam is a terrible problem, but there are:
      * leaked private data from stolen accounts all over the world
      * unknow dataset might generate code that intentionally come with back doors (not clear backdoors, but bad code that facilitate exploitation)
      this should in my opinion be enforced by law: u wanna build up a llm? so you should open your dataset
      it's sad that they needed to find csam to open their eyes

  • @sevret313
    @sevret313 4 місяці тому +10

    They mentioned that they searched based on punsafe 0.995 and above while Stable Diffusion is trained on a subset with a lower punsafe level. So Stable Diffusion were probably not trained on these images.

  • @TiagoTiagoT
    @TiagoTiagoT 4 місяці тому +12

    It's very telling they're only going after the open-source ones...

    • @baz813
      @baz813 4 місяці тому +8

      It begs the question where their funding is coming from

  • @MariuszWoloszyn
    @MariuszWoloszyn 4 місяці тому +4

    There's a "Responsible Disclosure Policy" that's used by security researches for like two decades already. We know go to disclose such things in responsible way. The authors clearly chose to not follow that path.

  • @amafuji
    @amafuji 4 місяці тому +28

    Children must have whiplash the way they're constantly being thrown back and forth between political opponents

    • @leonfa259
      @leonfa259 4 місяці тому +2

      Children would like to make known that they would like the UN Children Rights convention to finally be ratified by the US as all other countries outside of Iran and North Korea have done and corporal punishment in the US to be outlawed.

    • @clray123
      @clray123 4 місяці тому

      Don't worry, they have mandatory masks to protect them from abrasions.

  • @charleshetterich8514
    @charleshetterich8514 4 місяці тому +6

    PSA that this was discovered and published several months ago by another research group who's institution happen to not be named 'Stanford'- food for thought, lets stop upholding these institutions

  • @AP-dc1ks
    @AP-dc1ks 4 місяці тому +6

    Oh no! We better force ClosedAI to show us training data so we can help clean it up!

    • @clray123
      @clray123 4 місяці тому

      ClosedAI is probably already sponsoring these same "researchers" to help clean up their data.

  • @mkamp
    @mkamp 4 місяці тому +1

    Great to see that you keep it up to shine a light on the societal aspects. And kudos for your bravery. I am still wondering if this will serve you well on the long run? Maybe even that this will become your brand like with the AI ethics people and people will only see you as that? How about you mix it up and do the next video on Mamba? ;) just saying. ;) have great holidays! Looking forward to more of your thoughts. Whatever avenue you choose.

  • @geldverdienenmitgeld2663
    @geldverdienenmitgeld2663 4 місяці тому +6

    The problem can never be, what a LLM knows. In fact, the ideal LLM should know all about the world. The good things and the bad things as well. If there could be a problem, then just about the question, which use cases should be allowed witrh these models.

  • @AncientSlugThrower
    @AncientSlugThrower 4 місяці тому +11

    1000 sounds like a lot, but it is a drop in the bucket compared to the full sample size. I don't want that content in my image generation, so I make sure to specify that in my negative prompting. But the scale of these things needs to be considered before we sharpen pitchforks.

  • @clray123
    @clray123 4 місяці тому +3

    As for the last part of the video "you don't have to possess the questionable content to find out it's forbidden"... I hope you realize the ramifications of this? This means that there is some sort of a trusted censorship oracle entity sitting out there somewhere, telling you what is questionable and what is not, without you being able to verify its verdicts in any way without punishment for that attempt. This is exactly like Kafka's court accusing you for an unspecified crime - in a sense it's even worse than Holy Inquisition where evidence was fabricated - because in this case no evidence needs to be produced, just a claim of such existing.

  • @freedom_aint_free
    @freedom_aint_free 4 місяці тому +7

    Maybe a "poisoning the well" attack by regulatory capture folks ?

  • @evennot
    @evennot 4 місяці тому +2

    I bet it was some cartoons. The rest of it probably were images of breastfeeding and such. (The most common thing that gets flagged in the cloud image storages)
    And what about gore? Catastrophes, violent crimes, war footage, cults, starvation and other nasty stuff. These are as harmful to children and sometimes more harmful.
    I get it. Seeing evil is illegal

  • @malikrumi1206
    @malikrumi1206 4 місяці тому +5

    What a great service bringing this to our attention!

  • @kenselvia5641
    @kenselvia5641 4 місяці тому +7

    For some reason the audio was very low on this video. I had to turn my PC and monitor volumes all the way up to hear it.

    • @iDerJOoker
      @iDerJOoker 4 місяці тому +1

      Was all good in my case

    • @Dr.Trustmeonthisone
      @Dr.Trustmeonthisone 4 місяці тому

      Same here, had to double my Windows volume to make out what was said

  • @baz813
    @baz813 4 місяці тому +6

    Surely the logical conclusion of research like this will be to enforce open source data sets with regulation. While government regulators are too slow to catch up, community regulated distributed AI networks such as #bittensor have already processed some governance issues around this area, and will continue to evolve.

    • @paulcreaser9130
      @paulcreaser9130 4 місяці тому

      Would this apply to close sourced datasets.

  • @diga4696
    @diga4696 4 місяці тому +5

    I believe that this paper, despite its intentions, won't bring any significant change. It seems like just noise coming from an organization that lacks a distinct presence.
    The real issue lies with humanity, not the data we produce. Data, whether recorded or generated, lacks any inherent purpose or intent. It’s the evolution of our intelligent, multifaceted society that gives rise to negative intentions. Without complete transparency and a unified intelligence encompassing all sentient systems, identifying and addressing malevolent elements remains a daunting task. It feels like a witch hunt. Continuing to work in silos is problematic because each person's perspective is like a hidden layer, not fully understood by others. At best, what we have is a convergence of information guided by a select group of experts. However, this often leads to policies and regulations influenced by cultural biases, traditions, and other forms of unverified and prejudiced data.

    • @clray123
      @clray123 4 місяці тому

      The real issues lies with the people who believe that information as such is harmful. The people who can't tell apart a horror movie producer from a war mongerer in high office (or a porn director from rapist).

  • @tomski2671
    @tomski2671 4 місяці тому

    I foresee law enforcement using models trained to identify such materials to catch the perpetrators.

  • @pawelkubik
    @pawelkubik 4 місяці тому

    It wasn't all ill will with the unsafe disclosure.
    Security researchers put a lot of work and expertise into finding those exploits, so they can easily predict that way ahead before other labs when they decide to postpone the report.
    You don't really have this comfort when you just try to pick the low hanging fruits.

  • @clray123
    @clray123 4 місяці тому +1

    Regarding the proposed improved model training procedures, I think it is safest to generally just pretend that (1) kids don't exist (2) we have never been young ourselves and (3) shut your eyes and run away whenever you encounter one of the non-existing children in public. If we adopt such wise precautionary behaviors ourselves, chances are that our AI models will also be trained accordingly.
    P.S. Should you find yourself living with one of the non-existing young people under your own roof, the best bet is to force it to wear a mask all time, so that it is less recognizable and cannot infect you with any terrible child-transmitted disease.

  • @para-be4bf
    @para-be4bf 4 місяці тому +5

    The open-source AI situation is continuously reminding me of the crypto war and open source scare.

  • @zrebbesh
    @zrebbesh 4 місяці тому +5

    Have they published the result of applying the same examination and tests to their own datasets?

  • @swiftpawtheyeet6648
    @swiftpawtheyeet6648 23 години тому

    "David thiel"....
    Because of course it is

  • @Veptis
    @Veptis 4 місяці тому +6

    "removal of reference downloads"... By giving a list of the explicit material and metadata (now removed) - to all people that used the dataset?
    So you go from a massive dataset with a really low percentile of such content to giving everyone a shortlist of it?

  • @pookienumnums
    @pookienumnums 4 місяці тому

    Regarding the seeping through of bad or unwanted data and to the person who wrote this 'hit piece':
    you go do something 1 million or 1 billion times without making a mistake

  • @usercurious
    @usercurious 4 місяці тому +2

    Thank you, they will do anything to destroy any open source alternative, just to appear righteous

    • @clray123
      @clray123 4 місяці тому

      And they will fail again, just like they failed with corporate adoption of Linux.

  • @herp_derpingson
    @herp_derpingson 4 місяці тому +7

    The video is too quiet in this video

    • @clray123
      @clray123 4 місяці тому

      The secret word is audio.

  • @rolyantrauts2304
    @rolyantrauts2304 4 місяці тому

    The capitalisation of AI grows pace.

  • @krimdelko
    @krimdelko 4 місяці тому +1

    Technology is not the problem, it’s the solution and open source works better at finding solutions. This issue imo is not about open source, it’s about developing tools to avoid damaging content.

  • @isaac10231
    @isaac10231 4 місяці тому

    2 thought on this...
    First, I think this is possibly what led to some of the turmoil inside openai, maybe they discovered this stuff in their training set, because they probably have it too.
    Second, in the paper they mentioned a large amount of illustrated cartoons. That sounds like hentai to me, which is a different debate on its own but I think needs be clearly distinguished from REAL people who actually get affected by the distribution of actual abuse.

  • @vfclists
    @vfclists 4 місяці тому

    1000 images out of how many?

  • @Sven_Dongle
    @Sven_Dongle 4 місяці тому +1

    Open source tends to vet rather than abet.

  • @thedoctor5478
    @thedoctor5478 4 місяці тому

    How much do you think you can find in Google search? I bet plenty.

  • @lucidraisin
    @lucidraisin 4 місяці тому +7

    Yannic, always the voice of reason

    • @clray123
      @clray123 4 місяці тому +1

      It's actually simple to be the voice of reason nowadays - whatever comes from government circles, just do and claim the opposite.

  • @MasamuneX
    @MasamuneX 4 місяці тому +4

    I want everything in my training dataset including crime statistics..... and evil books

  • @Will-kt5jk
    @Will-kt5jk 4 місяці тому +1

    7:48 - it’s an excellent point on “reasonable disclosure”
    Assuming no malice on the part of the dataset creators, I think it’s appropriate to view the inadvertent inclusion of abuse material (*) as akin to a software vulnerability. Now potential abusers know the data is in there, they can go through copies & find the CSAM, or target models trained on it to generate new abuse images.
    It would be quite hard to reduce the number of dataset copies which include the abuse material, but at least if there were a period of time to update and re-propagate a new version of the dataset, models/products which use could have some assurance & the number of instances of the abusive version would reduce somewhat before disclosure.
    Something along the lines of the CVE + reasonable disclosure seams like an obvious practice to be adopted by the industry/subject.
    (*)obviously primarily CSAM, but also non-consensual adult material etc. and who’s to say private info/doxxing is not in such datasets.

    • @Will-kt5jk
      @Will-kt5jk 4 місяці тому +1

      Note:
      use “reasonable disclosure” as “responsible disclosure” puts the “responsibility” part on the security researcher, not on the vendor. The researcher should act “reasonably” and give “reasonable” opportunity to make the product safe before disclosure, but if the vendor fails to act in a sensible timeframe, it’s completely “reasonable” to release the research to allow users & consumers to take action themselves.

  • @TiagoTiagoT
    @TiagoTiagoT 4 місяці тому +2

    Is it really actual photos of real children being harmed, or just bullshit like CGI, cartoons, dummies etc?

  • @-Jason-L
    @-Jason-L 4 місяці тому +4

    I dont see a problem with this being in the training set, as long as it is not in the output.

    • @MattHudsonAtx
      @MattHudsonAtx 4 місяці тому +1

      It's a felony to possess csam in the first place, so they're actually being nice to write a paper about it. People could go to prison about it.

    • @heyman620
      @heyman620 4 місяці тому +1

      @@MattHudsonAtx Are you a grad of trustme-bro-law-school?

    • @clray123
      @clray123 4 місяці тому +1

      @@MattHudsonAtx Yes, that's why I deposited some pics into your phone a couple days ago.

    • @andybrice2711
      @andybrice2711 4 місяці тому +1

      Here's a complicated ethical question: Should these images be deliberately added to a "negative dataset" in order to train models _not_ to generate such material?

    • @heyman620
      @heyman620 4 місяці тому

      @@andybrice2711 Amazingly smart question to be honest.

  • @Neomadra
    @Neomadra 4 місяці тому +1

    Just remove the identified images from the dataset, problem solved

  • @louis3195
    @louis3195 4 місяці тому +6

    It’s easier to trash talk others work than doing the work

    • @Robert_McGarry_Poems
      @Robert_McGarry_Poems 4 місяці тому

      Exactly like UA-cam commenters... 🤔 Journalism is still useful, what is your excuse?

    • @heyman620
      @heyman620 4 місяці тому

      You mean this shitty low effort paper?

    • @Robert_McGarry_Poems
      @Robert_McGarry_Poems 4 місяці тому

      @@heyman620 I don't know what low effort means, but you seem to...

    • @heyman620
      @heyman620 4 місяці тому

      @@Robert_McGarry_Poems I think the paper presented here is shitty.

    • @Robert_McGarry_Poems
      @Robert_McGarry_Poems 4 місяці тому

      @@heyman620 Oh hey, there it is. An idea that stands on its own! I knew you could do it. What makes it bad, in your opinion. I think any effort to combat CP is pretty positive. Even if the paper itself is low effort, but that's just an opinion.

  • @DRKSTRN
    @DRKSTRN 4 місяці тому

    To me just demonstrates that there is a usecase for advanced diffusion models to be context aware and restrict such outputs in the first place. The ill would always be the same ill as any artist who sits attenfs figuring drawing events. That person at any time can reproduce the most unslightly imagery, by virture of having a traditional education.
    If we fear monger that we may finetune any model and the basis of releasing such models becomes a point of potential distribution of those materials. So too would have have to ban artistry as a trade.
    I wouldn't be worried about the narrative. It isnt based on good faith and the fundamental issue with that camp. Is attempting to continue infinite growth after htting market saturation. Expect rocks in general.

  • @asimuddin3222
    @asimuddin3222 4 місяці тому

    Keep it up....🎉🎉🎉

  • @SanjayVenkat-ce1gj
    @SanjayVenkat-ce1gj 4 місяці тому +1

    please keep stating your purpose. Open source is the way forward. 1008 in 5 billion. 1008 too many, but for research. Lets propagate research. Lets think use some utilitarian
    Current polictics in ML is questionable. Yes 1008 is too many.

  • @dinoscheidt
    @dinoscheidt 4 місяці тому

    Isn’t there a center that collects this garbage so it can be fingerprinted and was i.e. for a little bit used by apples iCloud? Should really train an AI classifier on that disgust and open source the detector to be able to filter it out for everyone in their data sets. Be it proprietary or not. The open source approach should be reversed here

    • @Houshalter
      @Houshalter 4 місяці тому

      They don't share the hashes with the public. The paper also mentions that they didn't like it because it mostly focuses on real images, not cartoons.

    • @isaac10231
      @isaac10231 4 місяці тому

      ​@@HoushalterWhat? That makes no sense. Why would they WANT to focus on hentai? Wouldn't it make more sense to you know, focus on _real_ people, instead of drawn characters?

    • @clray123
      @clray123 4 місяці тому +1

      @@isaac10231 I suspect they simply had to focus on hentai because real images had been already filtered out/hard to find.

    • @isaac10231
      @isaac10231 4 місяці тому

      @@clray123 fair point

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh 4 місяці тому +4

    I still wonder how blue lagoon or pretty baby are legal with these levels of hysteria.

  • @hurktang
    @hurktang 4 місяці тому +1

    The fact that they conclude we should delete all SD.1.5 models makes me double think if it's really CSAM at all in the first place.
    I just took a look, and PhotoDNA makes no claim that their database is sexual abuse. They actually seem to be hashing any submitted images. It seems it just take someone offended on the internet to get an image in that database. This mean it WILL hit girls at the beach, taking their baths, an accidental pantie shot, wearing something a little bit too tight or a girl who turned out to be 17yo after publication of photos she took herself... Not all countries in the world have the same exact standards.
    Don't get me wrong, I'm absolutely okay with cleaning all those from the dataset when we find them. But acting all offended by it and asking for the deletion of SD.1.5 because 1 in 5 million photo could offend someone seems absurd to me.

    • @isaac10231
      @isaac10231 4 місяці тому +2

      They mentioned a large portion was "illustrated cartoons"... So basically hentai lol.

    • @clray123
      @clray123 4 місяці тому

      @@isaac10231 At least there were no pictures of prophet Mohammed...

  • @Veptis
    @Veptis 4 місяці тому

    I got a good take on this, but my comment gets removed directly. Not sure what wrong I wrote.

  • @knutjagersberg381
    @knutjagersberg381 4 місяці тому +1

    Honestly, I'm still wondering what I should think about this... I agree this has been used for political purposes, with bad intent, too. Also for one thing, weights are still a legal gray area, but is it the right thing to deal with burden of proof in this way.
    Very difficult I find. Another aspect I find difficult is about the issue of any potential of a generative model to generate this content. In principle, it is possible to use 3D engines and create this shit. Yet we don't regulate access to 3D game engines like this, do we? There are also other models which can upscale an image. A human can draw an image of this content and then make an AI version, that's even more difficult to control. I feel more caution is needed, but we can also overshoot. Is it the capacity to generate this or the distribution that is the problem? Needs nuanced views. Very difficult, we need great care on the reasoning about this.

    • @leonfa259
      @leonfa259 4 місяці тому +1

      Do pixel have an age? Is anybody harmed by pixels? Does a virtual person look like 17 or 18? At least from my perspective any generated output can not by definition be that, since no one was harmed through the creation of that material. We can continue to find it abhorrent but even the supreme court will most likely see it covered by the first.

    • @knutjagersberg381
      @knutjagersberg381 4 місяці тому

      @@leonfa259 I don't know. The content could still be illegal, or at least it's distribution. This needs a deeper reflection. Someone into the legal and ethical aspects should really think about this for a while to facilitate sense making.

    • @knutjagersberg381
      @knutjagersberg381 4 місяці тому

      @wbs_legal could say something on the legal aspects

    • @leonfa259
      @leonfa259 4 місяці тому

      @@knutjagersberg381 What criminal legal system are you talking about? The US one or another?
      Ethics are an interesting topic but they depend on whom you are asking.
      In the end all legal systems and people agree that real children should not be hurt, after that opinions diverge and marry ages all over the US vary. Criminal systems usually worry about the extreme clear cases while NGOs like above have broader opinions.

    • @knutjagersberg381
      @knutjagersberg381 4 місяці тому

      @@leonfa259 My point is I'm not a legal expert. I'm also not an ethicist, although I think I have some good intuition about ethics. I'd like to hear more opinions.

  • @javrin1158
    @javrin1158 4 місяці тому +2

    Shows they don't have the children's best interest at heart when they so wrecklessly abandon responsible disclosure before publishing such material.

  • @cerealpeer
    @cerealpeer 4 місяці тому +1

    yeah! and whats all this about the cops having literally tons of cocaine???? where are they keeping the cocaine, and why are only the cops allowed to have it?

  • @ChuckBaggett
    @ChuckBaggett 4 місяці тому

    Volume is too low.

  • @marshallmcluhan33
    @marshallmcluhan33 4 місяці тому

    Cash and Carry. Isn't open source all owned by a16z anyway. 💰😎
    Censor reality; make it yours.

  • @Sven_Dongle
    @Sven_Dongle 4 місяці тому

    CSAM - child zex abyoose materiel.

  • @cerealpeer
    @cerealpeer 4 місяці тому +4

    "we need to make the streets safer. too many violent crimes."
    "ok weve rounded all the guns up, and theyre locked away from the criminals"
    "HEY THEYVE GOT GUNS! GETTEM!"

    • @tedchirvasiu
      @tedchirvasiu 4 місяці тому +5

      what the hell are you talking about, Jesus?

    • @cerealpeer
      @cerealpeer 4 місяці тому

      @@tedchirvasiu tell me what you think about the issue

    • @be12
      @be12 4 місяці тому +1

      What

    • @cerealpeer
      @cerealpeer 4 місяці тому

      @@be12 exactly

    • @cerealpeer
      @cerealpeer 4 місяці тому

      i wish i had a sock account ao i could repeatedly not understand things

  • @ahsokaincognito
    @ahsokaincognito 4 місяці тому +5

    I don't have an issue with your opinion differing from mine on this topic per se and I think we are both betting on the same horse that is AI optimism and FOSS in that sense, but the first 5 minutes of the video I have watched now are just a big ad hominem attack. This is very valid criticism of the LAION dataset, and I think most people here are nerdy enough to know that while the data in question won't ever be reproduced by current diffusion methods, it's rather about morals and quality assurance. If LAION fails to filter out basics like this, that doesn't really throw a great light on their overall curation process. And, as stated in one of the shown articles, the 1000 image figure is the very bare minimum, that's just known hashes. You can probably calculate with 10 times that. And at some point you have to ask yourself whether LAION, voluntarily distributing training data for others to base their work on, failed their obligations here.

    • @Trahloc
      @Trahloc 4 місяці тому +11

      "failed their obligations here" -- so until it's perfect it needs to remain behind closed doors?

    • @ahsokaincognito
      @ahsokaincognito 4 місяці тому +3

      @@Trahloc If you idea of perfection is a dataset solely satisfying the requirement of not containing child abuse then yes.

    • @Trahloc
      @Trahloc 4 місяці тому +7

      @@ahsokaincognito you realize that'd would require crawling the entire dataset manually right? You need to train an AI before you can use an AI to clean a dataset.

    • @discipleofschaub4792
      @discipleofschaub4792 4 місяці тому +12

      Have to remember that laion are basically just a bunch of hobbieists. The fact that it's open source enabled researchers to find and potentially develop better filters in the first place.

    • @neelsg
      @neelsg 4 місяці тому

      @@ahsokaincognito Is the issue that there are links to these images that people can somehow filter and find? If yes, then surely the problem has to be that these exist in the Common Crawl data, not just LAION-5B.
      Is the issue instead that Stable Diffusion 1.5 is contaminated by this? If yes, then surely you should be more concerned about closed source models as we can't even check them to see if they are contaminated. It stands to reason that if Stable Diffusion 2.0 is worse after removing questionable training data, then there is a clear incentive for companies who train closed source models to be much less careful and ethical than we would want them to be about this

  • @murtazanasir
    @murtazanasir 4 місяці тому

    What idiocy to frame this as an open source issue. What guarantees do bad faith actors like you have that closed models and their datasets don't have these problems? This is purely FUD to benefit private corporations. Personally, I can't take anyone who records UA-cam videos in sunglasses seriously anyway.

    • @NBK-ro4sz
      @NBK-ro4sz 4 місяці тому

      What kind of idiot is against open source? This is purely FUD to benefit private corporations.

  • @TheRev0
    @TheRev0 4 місяці тому +1

    Fuck... I agree with the message, but, bruh... I wish you weren't delivering it. And I hate myself for that.
    Just the way you were so hesitating about the unacceptability of CSAM in training data kept me cringing. I kept imagining bad faith actors using against us the ever so slight lack of a complete and utter denounciation from you that we're all used to in media. The worst part is that I have no idea how accurate my perception is.
    Bad faith actors have poisoned the well. Everything is shit and I hate it.

    • @discipleofschaub4792
      @discipleofschaub4792 4 місяці тому +22

      He did completely denounce such material. Did he not virtue signal enough for you? Sad state of affairs if you have to do some performative 20 minute speech about it how you absolutely despise it in order not to be seen as a p. sympathiser...

    • @clray123
      @clray123 4 місяці тому +1

      But this sort of witch-hunt and black-and-white thinking is exactly what the "bad actors" want you to adapt. Any normal thinking person has the capability to weigh the crimes we are talking about against other crimes and act accordingly. The virtue signaling hysteria is a new thing, which has not existed in humanity previously, even though the crimes most certainly have. We need to think about why it is necessary in the first place and whose interests it serves, rather than blindly support it.

  • @mkamp
    @mkamp 4 місяці тому +1

    Great to see that you keep it up to shine a light on the societal aspects. And kudos for your bravery. I am still wondering if this will serve you well on the long run? Maybe even that this will become your brand like with the AI ethics people and people will only see you as that? How about you mix it up and do the next video on Mamba? ;) just saying. ;) have great holidays! Looking forward to more of your thoughts. Whatever avenue you choose.

    • @mkamp
      @mkamp 4 місяці тому +1

      Well, thank you. Just as the doctor ordered and timely too! 😂