That was a real Parker Twisty Square. The sponsor is Jane Street. Find out about their internship at: jane-st.co/internship-stevemould NOTE THE URL ON SCREEN IS INCORRECT! This is the correct URL. I'd call it a Parker URL but Matt got it right.
Okay, hear me out. THIS is AI art. Not people using AI to just generate whatever they put in a prompt. But actual human creativity and ingenuity using AI as a tool to create something which previously would have been extremely difficult, if not impossible. There are a lot of ethical and aesthetic problems with generative AI in its current state, but this is the first time I've seen something made with AI and thought "that's beautiful".
it is interesting, yeah, I will argue that in this specific case AI is DEFINITELY used as a tool to find a solution. My problem from day one always was with people who say they are AI artists. But that's clearly not what this video is about
Too bad the cover image of the video was edited to make the transformation more dramatic. The left rabbit ear on the second cube was basically erased on the duck image...
15:14 Bias and hallucination in the context of generative AI aren't simply human fallibilities, they're the mechanism by which it functions: you're handing an algorithm a block of random noise and hoping it has such strong biases that it can tell you exactly what the base image looked like even though there never was a base image.
Well said. Also: bias and hallucination are so commonplace in our own neural networks (our brains) that we even given them categories and names, such as “over generalization”, “confirmation bias”, “sunk cost fallacy”, or the catch-all “brain fart”. All neural networks (including our own) apply learned patterns in contexts where the learned pattern shouldn’t be applied. That’s why (to your point) the neural network driving diffusion can denoise noise that was never there in the first place.
это переход к изображению которое было в другой параллельной реальности и ваш мозг может существовать сразу в нескольких таких если его тренировать к непредвзятости, а то что это можно воспроизвести на компьютере впечатляет меньше чем древнекитайский язык в котором эта опция обязательна к применению, вы просто зациклены на вашем языке и это делает вас способными к удивлению
@@truejim Very true. The ability for humans to recognise faces, even in places where there is no face, can be said to be one of our biases, yet a useful one at that, which makes me wonder whether hallucination and bias in reasoning is not merely a flaw, but something that may have inadvertently assisted in our survival throughout history.
Hey Steve and Matt, thank you guys for featuring our research - it was a lot of fun working with you! I'm Ryan Burgert, the author of Diffusion Illusions - I'll try to answer as many questions as I can in the comments!
One thing I wasn't clear on. They describe taking the first images of two iterative prompt responses, flipping and layering them, and then using that single image as the first step in two different prompts (in this case, for penguin and giraffe). But how do you end up with a single image, rather than two different images that just used the same starting point?
@Neptutron hey, I’m just wondering from an artist perspective, how this might be used to make artworks. I’ve made previous comment about it. I just wanted to say your work sounds amazing and looks amazing! Although 😅I’m a little worried about people wanting to steal and profit from the other artist’s artwork. 👍
I'm not sure that would work because these images can be based on something that vaguely sort of kind of resembles a penguin or giraffes, but I don't think our brains give us the same leeway for sounds. I don't think there's a pareidolia for sounds, is there?.
I would buy so many of them for real (I like to have a basket of fidgets, puzzles, and tactile art pieces on my coffee table and these would fit right in)
@@WitchOracle I think there is a method you can 3D print it in-place (no assembly), and also transfer a color to the the first layer from a piece of ink-jet printed paper (it was on TeachingTech channel I think)
A Mould-Parker crossover video about double image illusions in which you create several of them and you didn't do one that morphed from Parker to Mould?
This wasn't a video about how diffusion models work and are trained... but you still managed to explain both better than the majority of videos on YT about the subject. Can you make a video explaining how you became so damn good at explaining things? Oh, and this is the coolest application of image generators I've seen to date. Brilliant idea leveraging the intermediate diffusion steps to sneakily steer the result into multiple directions simultaneously!
This was the _least_ illuminating Steve Mould video I have ever seen. Most of them are exceptionally lucid, even in a single pass. I lost my bearings past the "keep adding noise..." stage.
@@-danR Can't blame you, it's a weird process that seems completely backwards the first time you learn about it. It sounds so stupid that first they make this giant model that can remove noise only to put most of it back in, but it's the only way to iterate enough to get a clear picture. Anyway, it was only background information, and it doesn't really matter if it didn't become crystal clear how all of it works - the important thing is that the model is trained to be good at removing noise from a grainy picture. If you then start from a random mess and tell the model it's an extremely noisy picture of a cat, it will make into a picture of a cat by taking the supposed noise away. And because it happens in steps, you can alternate the subject between a cat and a dog in every other step, and it becomes both a cat and a dog in the end (obviously oversimplified)
This is a really educational video on AI which _should_ help most people understand and realise that these LLM and diffusion models are not General AI (ie; "truly intelligent") and just simple mathematical models. I studied AI and ML long before LLMs became a thing and have always been aware of this but convincing people of it is very hard in a short timeframe.
Honestly, as long as this shit continues to be trained by stealing work from actual human artists I don't care. I'm genuinely disappointed in Matt and Steve.
@Zutia what is and isn't stealing in this context is something that still needs to be established. As the training images are not directly used, but just statistics on them (the training images are not actually stored in the final model, it's therefore impossible for it to "copy-paste" parts of them into the output image), so it doesn’t conflict with the current copyright. And if we change copyright in that regard, we also need to consider what that implies for artists being inspired by each other
I don't see many people calling them general AI, but I do run into hordes of people on the internet vehemently claiming that an LLM is not even a type of AI at all.
@@Zutia being disappointed in Steve for covering an extremely interesting and relevant application of a novel technology is quite frankly nuts. touch grass
So a person could do this too - rough outline sketch of penguin, of a giraffe; flip one, work out an average rough from both; flip one back, do more detail on both, flip one. Repeat till you're happy or you give up. But some people just do it in their head - amazing!
Was thinking the same thing. With enough trial and error with both your original image and whatever secondary image that sort of manifests itself, this seems absolutely doable. It feels like an artist expression that humans could absolutely be trained in, but just haven't really ever largely pursued.
It's a common trend to do this with names or words in fancy script, so that it reads the same flipped upside down. I've seen a bunch on UA-cam and he does it in a few seconds (I couldn't tell you what it's called, it was something I saw in passing).
The reason some text models struggle with counting the number of r characters in a word like strawberry is because they don't see the word, they receive a vector which was trained to represent the different meanings of the word when looked at through different filters, similar to these illusions, which is what attention QKV projections do (extracting information from the vector which is layered in there). Sometimes the vector would have managed to store information about a word such as spelling and rhyming which the model can use, but oftentimes not, it depends on chance with how often things appear in the training data. The model could count it if the word was split into individual letters with spaces between them, because each would encode into a unique vector.
not quite, the model receives a stream of tokens which are not semantically meaningful. a model whose tokens mapped 1-1 with english characters would have no problem counting the number of r characters in strawberry. what you are referring to is a part of the model that converts chunks of the tokens stream into token embeddings
4:30 Minor nit. I don’t think the token embedding is really embedding based on semantics. It’s embedding based on how humans have used tokens in our writing. Since we tend to use semantically similar tokens in linguistically similar ways, the embedding does tend to cluster semantically similar tokens near each other. But it will also cluster tokens that aren’t semantically similar, merely because they’re used in the same way linguistically. For example “the” and “his” will be near each other in the embedding space not because they’re similar in meaning, but because they’re interchangeable in many sentences.
@@muschgathloosia5875 A purely semantic embedding would cluster tokens based only on similar *meaning*. Embeddings such as Word2Vec cluster tokens based on how the token is used in written English. So two tokens can be embedded near each other because they have similar meaning, *or* because they’re interchangeable in a sentence. “I ate his pie” vs “I ate that pie”. The words ‘his’ and ‘that’ don’t mean similar things, yet they’re still clustered near each other. The neural network is being trained on how words are used, not what they mean. It just so happens that words with similar meaning are also often interchangeable in a sentence.
@@muschgathloosia5875it’s not understanding the semantics because the way it arranges things has nothing to do with semantics and everything to do with frequency of use together, if for every string of words I had a dice I could roll that would give me a a word to write down then I could generate sentences, if that dice was weighted via analysis of how often words are used together then my writing would look human and because of how language works it would look like semantic understanding, but I don’t understand anything I’m just rolling a dice based on frequency of word use
@@marigold2257LLMs are not Markov chains. They capture very complex and subtle relations between words. An LLM works by analyzing it's training data and representing it numerically in a way so it can reuse it to satisfy prompts. But the training process forces the model to be efficient with its organisation. The model is unable to learn all word patterns, so it has to instead find and learn subtle higher order concepts that are simple to memorize but can be used to satisfy many prompts. It's like getting a kid to solve a thousand exam questions. They can't possibly learn all answers, so they will be forced to pick up patterns in answers, allowing them to answer questions they haven't seen before. These patterns will be artificacts of the way the questions are framed, as well as real knowdlege about the subject of the test. It's difficult to examine exactly what the model knows, but it's possible to show that at it organizes it's knowledge in a way that encodes concepts similar to our semantic concepts. For example age may be represented as a geometric direction where words further along that direction are semantically older. Does that mean the model "understands" age? That's a philosophical question. But it means the model can use the concept of age in ways similar to what we do. People often take poor math abilities as an example that the LLM isn't actually reasoning like we are. I think that's mostly training artefacts. There is not enough pressure on the model to learn mathematical concepts, so it instead learns shortcuts to produce plausible answers. However, concepts like age, sex, size are quite well represented because they are very useful useful to answer the types of prompts the model was trained for.
@@marigold2257 I'm not claiming it has any 'understanding' I'm just saying that the vector of tokens created is probably relevant to semantics more than just happenstance. I'm not putting any merit on the output of a generative model just the intermediary organization of the data.
In the settings of automatic1111, you can enable a clipskip slider right up top next to your model, vae, etc. Very useful if you're playing around with CLIP, especially when you've got novel length prompts. Doesn't really help you understand how the vector spaces really work, but it does help you to pretend to understand how they work.
Oh the overlap with mundane cryptography could be interesting. The order of words could be scrambled between two outputs. The idea of synthesizing sound that says different things if you understand different languages is kinda horrifying.
Just don't forget that "but it eorks either way" means actually that scientists have tried I would assume thousands of ideas regarding the network architectures, hyperparameters etc. and only some ideas have worked so well that they allowed for the next step. Showcasing results is one thing, developing the models another. It's hard work.
Steve! This video actually taught me how text-to-image AI works. I've seen many videos about it but it still seemed like magic to me. Now, I actually understand the underlying process. Thank you so much!!!
Salvador Dali has a painting which looks like a woman in a dress going through a door in some kind of cubic world. When you go to take a picture of it, it looks like a pixelated Abraham Lincoln
That's basically a highpass / lowpass image (close up you see the fine details, further away you only see the big blocks). They're not hard to make. There's one in this video at 14:36 (not pixelated, but it's still the same highpass / lowpass concept). P.S. - I'm pretty sure the woman in Dali's Lincoln painting isn't in a dress, unless the fabric is incredibly thin. 😉
@@RFC3514 I agree with you but "they're not hard to make" is misleading. Some are hard to make. One example I like from Dali is The Hallucinogenic Toreador, where the same effect is used but with a smaller scale difference. I believe that's much harder to make, and that's without even considering the artistic aspect.
Seeing the pair have 3 different images (maybe a 4th) depending on the other squares orientation absolutely Blew, my, mind. And I would love to buy some.
Wow! That was incredible! It went from the most mind bending optical puzzles, to such a fantastic explanation of the whole thing. This is what UA-cam is truly meant for.
The idea of generating images by removing noise is just as crazy as LLMs that generate text by predicting the next word (these are gross simplifications, but that's basically what it is).
It's even weirder than text prediction because the image model is trained to predict what noise was *added* to an image to make it noisier, and then by running that "backwards" on random noise you just happen to get an unreasonably efficient image generator.
@@nio804 это не предсказание а угадывание общих шаблонов загрязнения сигналов, уверен - работает только в заранее заданных условиях, обычный саморекламный трюк
I'm a software engineer and a midjourney user, and I've watched maybe 50 - 100 videos on LLM and generative AI. In 17 minutes you managed to provide the best simple explanation for how generative AI works with LLMs to produce images from prompts. Steve, you should teach a paid course on this stuff.
I'm NOT a software engineer (I can barely string a string together), and yet it still made sense to me! I was left with one burning question though: Where can I buy these things to show other people?
Amazing stuff. A word about your video editing. You have to give viewers enough time to assimilate the starting image before progressing to the secondary. Probably an extra second would do. When editing, you know what you are looking at, but a viewer doesn't. Wanted to rewind and pause all the time.
I've been doing drawings that do this for years; this was really cool to see. _This_ is what AI is meant to be used for. It's not gonna take over every human job, because humans will always find ways to use it that it couldn't think of on its own.
The topic puts me in mind of ambigrams. I've created a few and it's all about getting enough features to trigger the word recognition in one orientation without destroying the recognition in the other direction. And vice versa.
Which is what I hate, hate, Hate, Hate, HATE about AI. What used to be a clever thing is now something you can make just by writing the appropriate prompt.
I suspect our own minds are filtering noise from those images to make sense of them. Then from another perspective that same noise becomes signal, yielding a different perceived image. Fascinating stuff reminiscent of Hofstadter's Godel Escher Bach.
I wonder how my cat sees the world. Sometimes i think very different from me since they don't have the higher level concepts to make sense of nearly all of the human artifacts around them; i.e. it doesn't fit into their umwelt. I think the closest i came to understanding what that was like was when i overdosed on edibles and tried using my smart phone but nothing on it made any sense (I was trying to google what to do if you overdose on edibles, but I couldn't tell the app icons apart from one another).
The cover of which is what I was reminded when looking at the 3D robot dog: the book cover art is a 3D figure that appears as a 'G' in one orientation, an 'E' in another, and a 'B' in another.
But it's not art. The root word for "art" is the same as that for "artifact" and "artificial", which means (to me) that for something to be art, it must be man-made. Which makes the AI itself art, but not the picture. Sort of.
Great video! I consider this video to be mostly about creative visual hack that depends on human visual understanding but it also happens to be one of the best introductions to noise diffusion image generators, too.
I actually made one of those once. Didn't even take too long to design, and each of the possible ways to assemble the puzzle resulted in a unique image. Granted, it was only a 1 piece puzzle. But hey, it's a proof of concept, right?
@@bobbob0507 Well no because normal puzzles only make it possible for you to have one solution so you don’t get confused why your picture doesn’t look right. Basically the pieces only fit with certain pieces even if you try to jam them into a different one it will be slightly off size.
You'd probably need to limit the number of pictures to two, but it would still be considerably more challenging since you'd need to determine which picture the pieces you've assembled are intended for.
@@BrightBlueJim people who respect the wishes of exploited women whose images were used without their consent is a pretty good stand in for the word "we" in this context
@@BrightBlueJimpeople who understand that there are a significant number of better test images, including those which are made and distributed with the permission of the subject of the photograph. Lenna was publicly fine with it for a while IIRC but now she thinks it's unnecessary for a variety of reasons
Your description of diffuser, large language and clip models, and how they relate/interact was the best I've heard so far. I can only imagine the enlightening journey it took to explain this so succinctly.
THANK YOU. This was really mindbreaking and inspiring. Love your channel, and loved this video. Love this kind of reflexion+ type where you take a very complex AI subject and decompose it bit by bit.
0:40 Those are SKEWBITS!, by Make Anything! Well, the auxetic cube he first modeled that led to SKEWBITS. Your original 'Self-assembling material' video inspired him to try and make an auxetic cube that he could 3D print. He made the files available for download, someone else then printed them, used it for this purpose, and now they are in this video. UA-cam is amazing!
That rotating set that created 3 or more images is interesting. Could ai generate a bunch of layers where rotated coukd show an animated scene. That could make a really interesting sign or clock with a mechanical animation.
@SteveMould Would you mind editing in a 'flashing imagery warning' at the start of the video. UA-cam's editor should allow a text box to be input ahead of the section with flashing, and shouldn't require you to re-upload. Thanks @maxlibz kudos for putting the warning up. UA-cam showed the comment just before the flashing began. Though I'm not epileptic flashing imagery can trigger or worsen my migraines. Your effort has made a difference already. Thank you!
You mentioned not training with human data to eliminate bias, but I have seen mathematical arguments that bias is unavoidable. There were several papers and videos, but the only one I remember was an episode of Nova discussing how use of AI in predictive law enforcement in Oakland, California led to heavy handed responses in one neighborhood while ignoring rising crime in another. Admittedly, the math was way over my head, but it seemed pretty convincing. The problem basically lies not in the training data itself, but in the selection of training data. Something along the lines of having university students select a set of images of men. The students unconsciously biased the data set by selecting a majority of younger more attractive and apparently more affluent white men by 58%. Another example was Google’s AI refusing to show any white men in images of the founding fathers of the USA. (Which is confusing because they were all old white men. Talk about bias!) Trying to select the data completely randomly only proved that we can only generate pseudo random numbers, yielding pseudo random sets. The bias can be minimized, but never completely eliminated. In the end, any AI will be a reflection of us, both the good and the bad in all of us. That is what is scary about AI.
I think this overstates the severity of the problem. Sometimes AI is thought of as a really sophisticated calculator, and indications that its answers might be incorrect are an existential threat. But AI is maybe more like... Marketing. We get iteratively better at creating AI that will achieve our goals, and with time we will build more and more expertise at accelerating that process. The fact that AI in its current form is not capable of solving certain problems perfectly is scary in the sense that we can't cure cancer with medicine. It's unfortunate, but not necessarily unsolvable and certainly not intrinsic (except to specific approaches).
Your very first point is "It isn't a problem with the training data, it's just a problem with the training data"... Maybe think a bit longer on your argument.
The google thing was likely instructional bias tbh rather than something trained into it. But that really just points into bias on both parts, what you put in and what is already inside of it.
@@Gabu_ Human training data is different, it means random quality datasets that humans have a 100% hand in creating, it matters what you put in but every dataset even if it isn't explicitly human consolidated is biased. Even if a LLM were to create its own dataset it would still be human based as it inherited a human bias.
I remember, back in the 70's, there was a drawing of a "prom queen" with crown and all but when turned upside down was a picture of an old woman. It was a classic. Very simplistic compared to this but the same idea.
oh yeah, and if you put it on the side you can see the beatles and aleister crowley riding a whale on the pyramid of Tolotsin the ancient god of fire and the sun and if you fold it at 33 degrees you get the masonic token to unlock the next level
I remember that. Still used as an optical illusion example. Except that you didn't have to turn it around, did you? Just took a shift in perspectives to suddenly start seeing the other one.
@@ranjitkonkar9067 There are two commonly used optical illusions that show a young/old woman. One involves rotating the image (the one OP was talking about) and it often comes with text that says "before 6 beers/afer 6 beers". The other it the one you are probably remembering (you can see a profile of an old woman or a young woman looking away from the picture).
Steve, your clear explanation makes me want to try and make such a puzzle myself. My idea is I could model something and animate it so I can easily switch between two different states and paint digitally. Like painting on 4 separate cards while seeing them all juxtaposed. It seems possible to do manually with digital painting. Way, way harder to do with purely physical tools I guess. I'd wager an artist could make these, maybe even better than the AI can. The drawings that portray one thing, and then another thing when upside down, have been made by human artists already. The process you've described on how AI does it makes it seem to me like I could do it, even being mediocre at painting/drawing.
@@dibbidydoo4318 Could you tell me how, please? I have a friend who's obsessed with ducks, so anything that changes from a duck to something else and back would be amazing. I'd really like to make one for them
In the storied traditions of computational neuroscience, this video is a competent procedural explanation for the process of visual imagination. I wrote about this in my Master's thesis because I have aphantasia, and wanted to understand what other people could do, that I struggle with. In most people, the brain can generate real visual images in the occipital lobe based on words from the temporal lobe, eyes closed, no visual data. This process is how people have visual hallucinations - the brain generating visual data based on low-quality information. This is also why hallucinations are more common in one's peripheral vision and low light. People with aphantasia, including some hyperverbal autistic people, often require high quality visual data, so they can't imagine anything with their eyes closed, even picturing something that happened earlier that day, or their loved one's face. But the process of visual imagination works very much like diffusion. If a person pictures an apple, they may get a fuzzy red blob at first, and then the brain fills in more and more details based on previous experiences with apples. if I try this, I just think of the definition of an apple. Weirdly, I'm an abstract surrealist painter and art teacher - no visual imagination. I can't remember what my mom looks like.
You would have been a great case study for Oliver Saks or V. S. Ramachandran. Both have written fascinating books about neuroscience and the many divergent ways the brain functions in certain individuals. May I ask, if you can’t “picture” your mother visually when you two are apart, with what cues do you rely on to establish that relationship? Do you “hear” or recall her voice? Are there behavioral mannerisms of hers that reinforce your relationship with her when you two are apart? Thank you for sharing your experience. 🙏🏼
04:00 Sorry, Steve, but this is a very misleading explanation of Large Language Models (LLMs). LLMs do _not_ 'understand' text, and they _don't_ have semantic knowledge (e.g. that 'blue boat' means that the boat is blue). The model doesn't know what a boat is, or what blue is, or what it means for a boat to be blue. All it knows is that certain words (actually tokens, which might be words, parts of words, or combinations of words) go together at certain frequencies. LLMs do not have 'meanings', just probabilities of tokens occurring together.
@Singularity606 Unsure why you think I feel "so strongly about this". I just thought Steve, who generally likes to give accurate information, might want to, you know, give accurate information. He can't correct errors if no-one points them out. Also unsure why you're giving misinformation about LLMs, which do _not_ have semantic knowledge. The fact that a prompt like 'blue boat' can be used to generate an image of a blue boat does not mean that either the LLM or the diffusion model has any semantic knowledge. No more than a checkout recognising a barcode as belonging to a banana and displaying a price means that the till knows what a 'banana' is or has any concept of either food or money.
@Singularity606 No, I'm talking about meaning not 'qualia' (which is a silly concept invented by a philosopher who doesn't understand cognitive neuroscience or psychology). You know what a boat is, what it does, how it works, where you're likely to find one, what it's used for, and so on. To you, 'boat' is not just a token that appears in some sentences, it _means_ something. LLMs don't have that. In an LLM 'boat' is just a token, that is statistically associated with other tokens.
@Singularity606 The word is literally just a token in the LLM's data set. The LLM has no understanding of meaning, it only (1) calculates statistical associations between tokens in training and then (2) uses them to generate output. This is not controversial, it's very basic, fundamental stuff about how LLMs work.
@@Grim_Beard It's very basic, fundamental stuff about how LLMs are *trained.* That does not necessarily tell us anything about how it actually performs that task internally within the model. AIs are often called a black box for this reason, and we are perpetually confused as to just *how* they perform so well. Perhaps the reason for this is that understanding is not so difficult to achieve as we'd expect. If you ask the LLM what a boat is it will tell you. If you ask the LLM what will happen if a broken boat is placed in water it will tell you. If you ask the LLM what a good tool for moving items over seas is it will tell you (it's a boat). These imply understanding of some form to me, even if it is not the exact same as the understanding we have. Yes internally it's "just a token." But it knows the relationship of that token to other tokens and how they can be put together to form coherent messages, and it can derive information about the world from these relationships. That is language, and (to me) that is understanding. Even if it is not a language any human speaks, being more numerical in nature, it remains a language with meaningful syntax and the ability to perform the task of any human language. The LLM understands this language, and we simply translate for it on either side of the process. Words in the human brain are "just electrical signals" that we know the relationship of to other electrical signals and how they interact with each other to allow us to form coherent messages, and we can derive information about the world from these electrical signals. We have more types of data than the AI, but that doesn't inherently mean that we understand and they don't, just that they understand less or differently. Ultimately the only way you can claim that AI doesn't understand (or does, my above statement that they do is just as subjective as your statement that they don't) is to first provide a solid definition of what you mean by "understanding." The word has no set definition, so unless you tell people what specifically you mean when you say that you are not communicating your thoughts in their full form. And in any case you cannot state this not understanding as being a known fact that others are incorrect about. They are simply using a different definition of this ill defined word to you. They are not wrong.
@@alansmithee419Thank you very much, that's exactly what I wanted to respond to this comment and yours saved me quite some time ! I find it weird that people will go and "correct" people like that, while being so horribly confident in their "knowledge", saying things like "this is basic knowledge/facts about LLMs". This guy even has 10 likes wtf, how can anyone not think a minute about defining what is "semantics", "understanding" or even "knowing" before arguing if current LLMs have such things. Guys, please define the terms you are using before asking if LLMs have those !
It's not really an illusion in my opinion. It's just a fancy way of putting images together creatively. An illusion would imply there is some sort of visual trickery involved to make you think what you're seeing is something else, or that it exploits the visual cortex to produce hallucinatory artifacts. This does not do either.
i think we conclude that all visual perception is an illusion because of our object recognition meat “software.” i don’t think it’s such a radical conclusion.
Hey @SteveMould! Thank you for everything you do! There's is something I'd like to know about. I've no idea if it's an area of research. To start with an example, water is a good subject for this behaviour. I think that you already made some video around the subject I'm about to describe. So water is flowing along a river and you put some kind module in the course of the water. The river will obviously show changes downstream but upstream as well (eg. making pond/lake, or taking another route altogether). To give you another example and you may find a pattern there. I seem to remember seeing somewhere that somehow, a ray of light may take an entirely different path based on what is on its way. The subject is not about taking a different path but more generally how something downstream can affect something upstream. Hopefully it'll reach your eyes and you'll find it interesting enough to make a video about it.
I wonder what Plato would think of the fact that we are quite literally creating a Theory of Forms where abstract ideas are no longer merely figments of human imagination, but destinations in a multidimensional vector space that can be visited repeatedly and used in increasingly novel ways. I’m sure Aristotle would need to think on it for a while given his views on Plato’s theory.
Hey! Just a heads up that this video uses the Lenna image at 6:14. This is a playboy centerfold that was used for decades as a test image in digital image processing, but it's generally frowned upon to use it now, because it's a vestige of misogyny from the 1970s in tech. Its use has also historically privileged lighter skin tones over darker ones. It's worth going and reading about the history of this image and how it got into such wide use, and why folks consider it harmful in this day and age if you want to know more.
@@VitorMiguell I have videos of prototypes. I was living outdoors when I was making these, and they ended up only lasting for a few days each time I made them because if the temperature or humidity or something changes then the little boxes I made these in changed shape just enough to mess it up. I plan on making a better one soon, though, as now I live in a house and I can potentially make one large enough to put your head inside of the get the proper morphing effect. The problem with looking at it from outside of something is that when you move, something blocks your view, so there is an interruption in the morph. Keep that in mind as you look at this prototype. ua-cam.com/video/-stSuKmsee8/v-deo.html
I'm a little bit annoyed that the thumbnail is so obviously edited. The duck on the left has part of the square erased to make it look like the tool was better than it actually was.
15:00 its about how its handled , if its handled by turning words into tokens it litterly cant see what the word is made of and will just rely on probability of what the input text thought it
This video recaptures the fascination I had for AI before the investment bubble killed it, one day the bubble will pop and we'll be back to this kind of applications
I hit the bell on your channel years ago and watch every video, but this one didn't show up in my notifications, nor was it recommended alongside other videos like your videos usually are to me. I'm glad this was a collab with Matt or I may have gone quite a while without seeing it.
That was a real Parker Twisty Square.
The sponsor is Jane Street. Find out about their internship at: jane-st.co/internship-stevemould NOTE THE URL ON SCREEN IS INCORRECT! This is the correct URL. I'd call it a Parker URL but Matt got it right.
ok
I like your video Steve Mould. Keep up the good work. ^w^
Nice how you and Matt uploaded your linked videos at the same time
Server not found. Maybe AI was not so much of a hype. XD Joking :P
@@yuyurolfer Indeed.
Okay, hear me out. THIS is AI art. Not people using AI to just generate whatever they put in a prompt. But actual human creativity and ingenuity using AI as a tool to create something which previously would have been extremely difficult, if not impossible. There are a lot of ethical and aesthetic problems with generative AI in its current state, but this is the first time I've seen something made with AI and thought "that's beautiful".
I agree!
it is interesting, yeah, I will argue that in this specific case AI is DEFINITELY used as a tool to find a solution. My problem from day one always was with people who say they are AI artists. But that's clearly not what this video is about
yass queenses this is totes the stuff
A novel solution to a novel problem. Well put.
@@bl4cksp1d3r I AGREE
The rabbit/duck illusion got a serious glow-up
Too bad the cover image of the video was edited to make the transformation more dramatic. The left rabbit ear on the second cube was basically erased on the duck image...
@@oliviervancantfort5327 oof now that you pointed it out 😢
15:14 Bias and hallucination in the context of generative AI aren't simply human fallibilities, they're the mechanism by which it functions: you're handing an algorithm a block of random noise and hoping it has such strong biases that it can tell you exactly what the base image looked like even though there never was a base image.
Well said. Also: bias and hallucination are so commonplace in our own neural networks (our brains) that we even given them categories and names, such as “over generalization”, “confirmation bias”, “sunk cost fallacy”, or the catch-all “brain fart”. All neural networks (including our own) apply learned patterns in contexts where the learned pattern shouldn’t be applied. That’s why (to your point) the neural network driving diffusion can denoise noise that was never there in the first place.
это переход к изображению которое было в другой параллельной реальности и ваш мозг может существовать сразу в нескольких таких если его тренировать к непредвзятости, а то что это можно воспроизвести на компьютере впечатляет меньше чем древнекитайский язык в котором эта опция обязательна к применению, вы просто зациклены на вашем языке и это делает вас способными к удивлению
@@istinaanitsi3342 I think that’s the premise of Blake Crouch’s novel “Dark Matter”. 😀
@@truejim не читал, но словосочетание темная материя просто говорит о неспособности науки понять мир, поэтому они заменяют знания темными словами
@@truejim Very true. The ability for humans to recognise faces, even in places where there is no face, can be said to be one of our biases, yet a useful one at that, which makes me wonder whether hallucination and bias in reasoning is not merely a flaw, but something that may have inadvertently assisted in our survival throughout history.
Hey Steve and Matt, thank you guys for featuring our research - it was a lot of fun working with you! I'm Ryan Burgert, the author of Diffusion Illusions - I'll try to answer as many questions as I can in the comments!
One thing I wasn't clear on. They describe taking the first images of two iterative prompt responses, flipping and layering them, and then using that single image as the first step in two different prompts (in this case, for penguin and giraffe). But how do you end up with a single image, rather than two different images that just used the same starting point?
@Neptutron hey, I’m just wondering from an artist perspective, how this might be used to make artworks. I’ve made previous comment about it. I just wanted to say your work sounds amazing and looks amazing! Although 😅I’m a little worried about people wanting to steal and profit from the other artist’s artwork. 👍
I just wished they hadn’t used Midjourney pics.
That company is pretty exploitive, both towards copyright holders AND to their customers.
For the diffusion array: could I put in a bunch of images and a "goal" image and have the machine output the correct arrays?
Hi, important ethical question.
Can you say with 100% certainty that your copy of Stable Diffusion is entirely divorced from stolen artwork?
Loved the Matt Parker jumpscare in the image sequence
I literally pushed pause right on that frame and lost it 💀
Not a jumpscare but easter egg. Also its Maths Parker²
2:52 at 0.25x speed
Ah yes, the Parker Scare
I thought I saw him
I would love to hear this kind of illusion done with audio, such as reversing the audio file and hearing different text, or a piece of music!
Or something with the Yanny or Laurel thing but on purpose.
4 different sounds when overlayed making a completely different one would be cool
@@blackwing1362 if u increase/decrease the pitch of that audio, you will be able to hear each word on purpose
I'm not sure that would work because these images can be based on something that vaguely sort of kind of resembles a penguin or giraffes, but I don't think our brains give us the same leeway for sounds. I don't think there's a pareidolia for sounds, is there?.
@@jasondashneywords, sentences, we can derive words from really distorted sounds
Those blocks would sell really well in gift shops. Especially in Zoos.
I would buy so many of them for real (I like to have a basket of fidgets, puzzles, and tactile art pieces on my coffee table and these would fit right in)
@@WitchOracle I think there is a method you can 3D print it in-place (no assembly), and also transfer a color to the the first layer from a piece of ink-jet printed paper (it was on TeachingTech channel I think)
A Mould-Parker crossover video about double image illusions in which you create several of them and you didn't do one that morphed from Parker to Mould?
This is pushing it too far imho
@@hundredfireify nah😭😭😭. we need that
I don't think the tech is currently up to that, since the models don't have a concept of steve or Matt.
@@megaing1322one word: embeddings.
@@dside_ru One word: Models
More random words you want to throw at me with no real relation to what I said?
This wasn't a video about how diffusion models work and are trained... but you still managed to explain both better than the majority of videos on YT about the subject. Can you make a video explaining how you became so damn good at explaining things?
Oh, and this is the coolest application of image generators I've seen to date. Brilliant idea leveraging the intermediate diffusion steps to sneakily steer the result into multiple directions simultaneously!
Im not him but I'll guess it's due to how many years he has been explaining such a variety of topics.
This was the _least_ illuminating Steve Mould video I have ever seen. Most of them are exceptionally lucid, even in a single pass.
I lost my bearings past the "keep adding noise..." stage.
@@-danR Can't blame you, it's a weird process that seems completely backwards the first time you learn about it. It sounds so stupid that first they make this giant model that can remove noise only to put most of it back in, but it's the only way to iterate enough to get a clear picture.
Anyway, it was only background information, and it doesn't really matter if it didn't become crystal clear how all of it works - the important thing is that the model is trained to be good at removing noise from a grainy picture. If you then start from a random mess and tell the model it's an extremely noisy picture of a cat, it will make into a picture of a cat by taking the supposed noise away. And because it happens in steps, you can alternate the subject between a cat and a dog in every other step, and it becomes both a cat and a dog in the end (obviously oversimplified)
This is a really educational video on AI which _should_ help most people understand and realise that these LLM and diffusion models are not General AI (ie; "truly intelligent") and just simple mathematical models. I studied AI and ML long before LLMs became a thing and have always been aware of this but convincing people of it is very hard in a short timeframe.
Honestly, as long as this shit continues to be trained by stealing work from actual human artists I don't care. I'm genuinely disappointed in Matt and Steve.
How do you know that we aren’t simply somewhat more complex mathematical models? 😉
@Zutia what is and isn't stealing in this context is something that still needs to be established.
As the training images are not directly used, but just statistics on them (the training images are not actually stored in the final model, it's therefore impossible for it to "copy-paste" parts of them into the output image), so it doesn’t conflict with the current copyright. And if we change copyright in that regard, we also need to consider what that implies for artists being inspired by each other
I don't see many people calling them general AI, but I do run into hordes of people on the internet vehemently claiming that an LLM is not even a type of AI at all.
@@Zutia being disappointed in Steve for covering an extremely interesting and relevant application of a novel technology is quite frankly nuts. touch grass
this is absolutely the best explanation of the u-net and text encoder and how they work together i've ever heard
So a person could do this too - rough outline sketch of penguin, of a giraffe; flip one, work out an average rough from both; flip one back, do more detail on both, flip one. Repeat till you're happy or you give up.
But some people just do it in their head - amazing!
Was thinking the same thing. With enough trial and error with both your original image and whatever secondary image that sort of manifests itself, this seems absolutely doable. It feels like an artist expression that humans could absolutely be trained in, but just haven't really ever largely pursued.
It's a common trend to do this with names or words in fancy script, so that it reads the same flipped upside down. I've seen a bunch on UA-cam and he does it in a few seconds (I couldn't tell you what it's called, it was something I saw in passing).
@@rayscotchcoultonWith enough trial and error, a monkey can write the Hamlet
@@cmmartti those are called ambigrams
@@seav80 There we go!
The reason some text models struggle with counting the number of r characters in a word like strawberry is because they don't see the word, they receive a vector which was trained to represent the different meanings of the word when looked at through different filters, similar to these illusions, which is what attention QKV projections do (extracting information from the vector which is layered in there). Sometimes the vector would have managed to store information about a word such as spelling and rhyming which the model can use, but oftentimes not, it depends on chance with how often things appear in the training data. The model could count it if the word was split into individual letters with spaces between them, because each would encode into a unique vector.
write words the way they sound
so the AI can say them easier
@@LarryFain-y9w Phonetic consistency in english language would be great news for all non-english native speakers.
Wouldn't work because english has too many different accents and dialects, unfortunately.
not quite, the model receives a stream of tokens which are not semantically meaningful. a model whose tokens mapped 1-1 with english characters would have no problem counting the number of r characters in strawberry. what you are referring to is a part of the model that converts chunks of the tokens stream into token embeddings
@@Pandora_The_Panda Only one accent and dialect would be standarad others would not. Or each country would have their own standard.
It's like a sci fi version of the old Mad Magazine 'fold-in' pictures, if anyone remembers them.
73 years old and remember them well 😊
I 'member!
4:30 Minor nit. I don’t think the token embedding is really embedding based on semantics. It’s embedding based on how humans have used tokens in our writing. Since we tend to use semantically similar tokens in linguistically similar ways, the embedding does tend to cluster semantically similar tokens near each other. But it will also cluster tokens that aren’t semantically similar, merely because they’re used in the same way linguistically. For example “the” and “his” will be near each other in the embedding space not because they’re similar in meaning, but because they’re interchangeable in many sentences.
What else is semantics then? The model is essentially doing what linguists do but using raw statistics instead of pattern recognition.
@@muschgathloosia5875 A purely semantic embedding would cluster tokens based only on similar *meaning*. Embeddings such as Word2Vec cluster tokens based on how the token is used in written English. So two tokens can be embedded near each other because they have similar meaning, *or* because they’re interchangeable in a sentence. “I ate his pie” vs “I ate that pie”. The words ‘his’ and ‘that’ don’t mean similar things, yet they’re still clustered near each other. The neural network is being trained on how words are used, not what they mean. It just so happens that words with similar meaning are also often interchangeable in a sentence.
@@muschgathloosia5875it’s not understanding the semantics because the way it arranges things has nothing to do with semantics and everything to do with frequency of use together, if for every string of words I had a dice I could roll that would give me a a word to write down then I could generate sentences, if that dice was weighted via analysis of how often words are used together then my writing would look human and because of how language works it would look like semantic understanding, but I don’t understand anything I’m just rolling a dice based on frequency of word use
@@marigold2257LLMs are not Markov chains. They capture very complex and subtle relations between words. An LLM works by analyzing it's training data and representing it numerically in a way so it can reuse it to satisfy prompts. But the training process forces the model to be efficient with its organisation. The model is unable to learn all word patterns, so it has to instead find and learn subtle higher order concepts that are simple to memorize but can be used to satisfy many prompts.
It's like getting a kid to solve a thousand exam questions. They can't possibly learn all answers, so they will be forced to pick up patterns in answers, allowing them to answer questions they haven't seen before. These patterns will be artificacts of the way the questions are framed, as well as real knowdlege about the subject of the test.
It's difficult to examine exactly what the model knows, but it's possible to show that at it organizes it's knowledge in a way that encodes concepts similar to our semantic concepts. For example age may be represented as a geometric direction where words further along that direction are semantically older. Does that mean the model "understands" age? That's a philosophical question. But it means the model can use the concept of age in ways similar to what we do.
People often take poor math abilities as an example that the LLM isn't actually reasoning like we are. I think that's mostly training artefacts. There is not enough pressure on the model to learn mathematical concepts, so it instead learns shortcuts to produce plausible answers. However, concepts like age, sex, size are quite well represented because they are very useful useful to answer the types of prompts the model was trained for.
@@marigold2257 I'm not claiming it has any 'understanding' I'm just saying that the vector of tokens created is probably relevant to semantics more than just happenstance. I'm not putting any merit on the output of a generative model just the intermediary organization of the data.
In the settings of automatic1111, you can enable a clipskip slider right up top next to your model, vae, etc. Very useful if you're playing around with CLIP, especially when you've got novel length prompts. Doesn't really help you understand how the vector spaces really work, but it does help you to pretend to understand how they work.
Oh the overlap with mundane cryptography could be interesting. The order of words could be scrambled between two outputs.
The idea of synthesizing sound that says different things if you understand different languages is kinda horrifying.
Or sounds which mean the same thing in multiple languages.
What a time to be alive!
That's already a real thing!
We could create infinite laurel/yanny prompts or images that have hidden details for color blind individuals
это будет проблема для английского языка, в русском языке заложена защита от такой глупости, дети умеют пользоваться этим в играх на русском языке
@@minhuang8848 просто ваш мозг вас разыгрывает
Just don't forget that "but it eorks either way" means actually that scientists have tried I would assume thousands of ideas regarding the network architectures, hyperparameters etc. and only some ideas have worked so well that they allowed for the next step. Showcasing results is one thing, developing the models another. It's hard work.
Dude, was deep and understandable. Thanks!
Steve! This video actually taught me how text-to-image AI works. I've seen many videos about it but it still seemed like magic to me. Now, I actually understand the underlying process. Thank you so much!!!
Salvador Dali has a painting which looks like a woman in a dress going through a door in some kind of cubic world. When you go to take a picture of it, it looks like a pixelated Abraham Lincoln
That is an example of a hybrid image
That's basically a highpass / lowpass image (close up you see the fine details, further away you only see the big blocks). They're not hard to make. There's one in this video at 14:36 (not pixelated, but it's still the same highpass / lowpass concept).
P.S. - I'm pretty sure the woman in Dali's Lincoln painting isn't in a dress, unless the fabric is incredibly thin. 😉
Canadian artist Bob Gonsalves used to do that kind of paintings. Search for his works on internet.
@@RFC3514 I agree with you but "they're not hard to make" is misleading. Some are hard to make. One example I like from Dali is The Hallucinogenic Toreador, where the same effect is used but with a smaller scale difference. I believe that's much harder to make, and that's without even considering the artistic aspect.
@user-gt5df8yt1v What painting is this?
Seeing the pair have 3 different images (maybe a 4th) depending on the other squares orientation absolutely Blew, my, mind.
And I would love to buy some.
The one where you combine the four transparencies together is a very cool new form of steganography. Excellent!
Wow! That was incredible! It went from the most mind bending optical puzzles, to such a fantastic explanation of the whole thing. This is what UA-cam is truly meant for.
The idea of generating images by removing noise is just as crazy as LLMs that generate text by predicting the next word (these are gross simplifications, but that's basically what it is).
так работает резчик по камню или дереву, что в этом особенного
AI is just mathematical magic. It's amazing.
@@vectoralphaSec математика основа мира, а для вас просто мусор видимо
It's even weirder than text prediction because the image model is trained to predict what noise was *added* to an image to make it noisier, and then by running that "backwards" on random noise you just happen to get an unreasonably efficient image generator.
@@nio804 это не предсказание а угадывание общих шаблонов загрязнения сигналов, уверен - работает только в заранее заданных условиях, обычный саморекламный трюк
I'm a software engineer and a midjourney user, and I've watched maybe 50 - 100 videos on LLM and generative AI.
In 17 minutes you managed to provide the best simple explanation for how generative AI works with LLMs to produce images from prompts.
Steve, you should teach a paid course on this stuff.
I was going to comment the same thing. Such a compact and simple yet comprehensive explanation. Well done.
ой ой ой
Yes, same. I’m a software engineer also. Exactly as Steve says, I feel satisfied with that explanation.
I'm saving this video for the next time someone calls generative AI a "collage tool cut-and-pasting other people's images."
I'm NOT a software engineer (I can barely string a string together), and yet it still made sense to me! I was left with one burning question though: Where can I buy these things to show other people?
Amazing stuff. A word about your video editing. You have to give viewers enough time to assimilate the starting image before progressing to the secondary. Probably an extra second would do. When editing, you know what you are looking at, but a viewer doesn't. Wanted to rewind and pause all the time.
I've been doing drawings that do this for years; this was really cool to see.
_This_ is what AI is meant to be used for. It's not gonna take over every human job, because humans will always find ways to use it that it couldn't think of on its own.
I saw both video thumbnails pop up in my feed, noting the similarities, and I loved the opportunity you had to collab with Stand-up Maths!
The topic puts me in mind of ambigrams. I've created a few and it's all about getting enough features to trigger the word recognition in one orientation without destroying the recognition in the other direction. And vice versa.
Which is what I hate, hate, Hate, Hate, HATE about AI. What used to be a clever thing is now something you can make just by writing the appropriate prompt.
@BrightBlueJim Well...doing normally time-consuming tasks extremely quickly is pretty much what computers were created for...
Whoa whoa @ 11:01 you just gonna gloss over that?! That was awesome! I wanna see more of that, that was wild!
Yea that was the part that really blew my mind and was only briefly mentioned. Just so much cool stuff that just was out of reach before
Love how this video describes generative ai images so well! Appreciate the video!
I suspect our own minds are filtering noise from those images to make sense of them. Then from another perspective that same noise becomes signal, yielding a different perceived image. Fascinating stuff reminiscent of Hofstadter's Godel Escher Bach.
I wonder how my cat sees the world. Sometimes i think very different from me since they don't have the higher level concepts to make sense of nearly all of the human artifacts around them; i.e. it doesn't fit into their umwelt. I think the closest i came to understanding what that was like was when i overdosed on edibles and tried using my smart phone but nothing on it made any sense (I was trying to google what to do if you overdose on edibles, but I couldn't tell the app icons apart from one another).
The cover of which is what I was reminded when looking at the 3D robot dog: the book cover art is a 3D figure that appears as a 'G' in one orientation, an 'E' in another, and a 'B' in another.
12:10 dear god, a jigsaw puzzle with multiple answers!!!
This is awesome. This is art. Something awe inspiring and flips how you look at things. Forces a new perspective. Nicely done!
But it's not art. The root word for "art" is the same as that for "artifact" and "artificial", which means (to me) that for something to be art, it must be man-made. Which makes the AI itself art, but not the picture. Sort of.
1:01 poor rabbit being called trash by Steve
😢
It's cute. Kind of looks like a stained glass window.
5:12 sports…..what??!??
...jewish people...?????
And again at 7:44
I saw that hidden matt parker at 2:52
Highlight of the video.
@@standupmaths It's him!
@standupmaths there better be a steve mould in your video somewhere 😉
@@standupmathswhy don't you have that tick?
This is the comment I was looking for
Steve, it takes extraordinary talent to break down complex ideas into digestible pieces. Respect! Fascinating stuff.
Great video! I consider this video to be mostly about creative visual hack that depends on human visual understanding but it also happens to be one of the best introductions to noise diffusion image generators, too.
You could have a puzzle that's a different picture no matter how you put it together
I actually made one of those once. Didn't even take too long to design, and each of the possible ways to assemble the puzzle resulted in a unique image. Granted, it was only a 1 piece puzzle. But hey, it's a proof of concept, right?
@@jimburton5592 nice! Not every arrangement has to work, you could even "seek" different solutions
In other words, a normal puzzle
@@bobbob0507 Well no because normal puzzles only make it possible for you to have one solution so you don’t get confused why your picture doesn’t look right. Basically the pieces only fit with certain pieces even if you try to jam them into a different one it will be slightly off size.
You'd probably need to limit the number of pictures to two, but it would still be considerably more challenging since you'd need to determine which picture the pieces you've assembled are intended for.
6:14 Heeeyyy… I thought we weren't using Lenna anymore?!
What do you mean, "we"?
@@BrightBlueJim people who respect the wishes of exploited women whose images were used without their consent is a pretty good stand in for the word "we" in this context
@@BrightBlueJimpeople who understand that there are a significant number of better test images, including those which are made and distributed with the permission of the subject of the photograph. Lenna was publicly fine with it for a while IIRC but now she thinks it's unnecessary for a variety of reasons
Your description of diffuser, large language and clip models, and how they relate/interact was the best I've heard so far.
I can only imagine the enlightening journey it took to explain this so succinctly.
Positively fascinating! I know it was simplified, but your explanation of generative AI was really great.
Too bad the cover image was edited. The left rabbit ear has basically disappeared on the duck image...
At 2:00 I have never understood generative AI more. I love this explanation.
0:27 can we get the link to that please?
Its github
I want the link too (._.)
Get this comment up
Thanks!
THANK YOU. This was really mindbreaking and inspiring. Love your channel, and loved this video. Love this kind of reflexion+ type where you take a very complex AI subject and decompose it bit by bit.
You need to get Vi Hart in on this action with a hexaflexagon that has actual images on each orientation
Wow, completely agree! That would be so cool!
I wish her brain hadn't melted years ago :(
Are we ignoring that the rotated first draft giraf at 11:38 was already the most stereotypical penguin image one would think of? :o
And the reverse penguin was also a giraffe
I think it's an editing mistake they swapped the 2 images without noticing because it was too blurry xD
0:40 Those are SKEWBITS!, by Make Anything! Well, the auxetic cube he first modeled that led to SKEWBITS. Your original 'Self-assembling material' video inspired him to try and make an auxetic cube that he could 3D print. He made the files available for download, someone else then printed them, used it for this purpose, and now they are in this video. UA-cam is amazing!
This is my Video Of The Year!
Excellent explanation of generative image AI with a pretty neat application too. Loved it! 💛
I learned so much from this video. I understood things about AI text to image, I never understood before. Thank you.
Stopping it half way is exactly how you would do it with physical media
Do a sketch, re orient, edit sketch, repeat
Right??? A real artist could have done it, but they were too lazy to
@@skilletborne An artist could do it too, but none of them did.
can you do a epilepsy warning for 2:46. i'm not particularly sensitive to rapid light changes but i know some that are.
That rotating set that created 3 or more images is interesting. Could ai generate a bunch of layers where rotated coukd show an animated scene. That could make a really interesting sign or clock with a mechanical animation.
I had the same idea, I’d love to make a rotating layer display.
I don't know how you do it, but every video of yours I see is fantastic.
What an excellent video that explains quite accurately (enough) how generative models work at a fundamental level.
Is it just me, or around 1:40 does it really look like the illusion is going to resolve into Yoda for a moment?
Nyoda Cat?
1:24 cool one, that is.
6:35 You've accidentally created a demon cat XD
Great explanation of diffusion models and how text prompts work! One of your better videos of late!
now thats a great explanation of how diffusion models work with the noise! Felt like i learned something new
2:46 epilepsy warning
Yes, PLEASE PIN this comment.
Thanks
Thank you! That gave me a freaking headache!
@SteveMould Would you mind editing in a 'flashing imagery warning' at the start of the video. UA-cam's editor should allow a text box to be input ahead of the section with flashing, and shouldn't require you to re-upload. Thanks
@maxlibz kudos for putting the warning up. UA-cam showed the comment just before the flashing began. Though I'm not epileptic flashing imagery can trigger or worsen my migraines. Your effort has made a difference already. Thank you!
@@oliparkhouse epilepsy is not a fashion accessory for you to wear to make yourself more interesting. Shut up
Thank you, this should be Pinned.
You mentioned not training with human data to eliminate bias, but I have seen mathematical arguments that bias is unavoidable.
There were several papers and videos, but the only one I remember was an episode of Nova discussing how use of AI in predictive law enforcement in Oakland, California led to heavy handed responses in one neighborhood while ignoring rising crime in another.
Admittedly, the math was way over my head, but it seemed pretty convincing.
The problem basically lies not in the training data itself, but in the selection of training data.
Something along the lines of having university students select a set of images of men. The students unconsciously biased the data set by selecting a majority of younger more attractive and apparently more affluent white men by 58%.
Another example was Google’s AI refusing to show any white men in images of the founding fathers of the USA. (Which is confusing because they were all old white men. Talk about bias!)
Trying to select the data completely randomly only proved that we can only generate pseudo random numbers, yielding pseudo random sets.
The bias can be minimized, but never completely eliminated.
In the end, any AI will be a reflection of us, both the good and the bad in all of us. That is what is scary about AI.
I think this overstates the severity of the problem.
Sometimes AI is thought of as a really sophisticated calculator, and indications that its answers might be incorrect are an existential threat.
But AI is maybe more like... Marketing. We get iteratively better at creating AI that will achieve our goals, and with time we will build more and more expertise at accelerating that process. The fact that AI in its current form is not capable of solving certain problems perfectly is scary in the sense that we can't cure cancer with medicine. It's unfortunate, but not necessarily unsolvable and certainly not intrinsic (except to specific approaches).
Your very first point is "It isn't a problem with the training data, it's just a problem with the training data"... Maybe think a bit longer on your argument.
The internet, like history, art, and pretty much any human cultural artifact, are all humanity's Caliban's mirror.
The google thing was likely instructional bias tbh rather than something trained into it. But that really just points into bias on both parts, what you put in and what is already inside of it.
@@Gabu_ Human training data is different, it means random quality datasets that humans have a 100% hand in creating, it matters what you put in but every dataset even if it isn't explicitly human consolidated is biased.
Even if a LLM were to create its own dataset it would still be human based as it inherited a human bias.
I remember, back in the 70's, there was a drawing of a "prom queen" with crown and all but when turned upside down was a picture of an old woman. It was a classic. Very simplistic compared to this but the same idea.
oh yeah, and if you put it on the side you can see the beatles and aleister crowley riding a whale on the pyramid of Tolotsin the ancient god of fire and the sun and if you fold it at 33 degrees you get the masonic token to unlock the next level
I remember that. Still used as an optical illusion example. Except that you didn't have to turn it around, did you? Just took a shift in perspectives to suddenly start seeing the other one.
@@ranjitkonkar9067 There are two commonly used optical illusions that show a young/old woman. One involves rotating the image (the one OP was talking about) and it often comes with text that says "before 6 beers/afer 6 beers". The other it the one you are probably remembering (you can see a profile of an old woman or a young woman looking away from the picture).
Steve, your clear explanation makes me want to try and make such a puzzle myself.
My idea is I could model something and animate it so I can easily switch between two different states and paint digitally.
Like painting on 4 separate cards while seeing them all juxtaposed.
It seems possible to do manually with digital painting.
Way, way harder to do with purely physical tools I guess.
I'd wager an artist could make these, maybe even better than the AI can.
The drawings that portray one thing, and then another thing when upside down, have been made by human artists already. The process you've described on how AI does it makes it seem to me like I could do it, even being mediocre at painting/drawing.
the upside thing is the classic but there's more complicated tricks you can do with Genai.
@@dibbidydoo4318 Could you tell me how, please? I have a friend who's obsessed with ducks, so anything that changes from a duck to something else and back would be amazing. I'd really like to make one for them
A wild unfa spotted
This was incredibly fascinating. Thank you Steve and Matt.
Sneaking in a Matt Parker pic in the images there 😂👌
In the storied traditions of computational neuroscience, this video is a competent procedural explanation for the process of visual imagination. I wrote about this in my Master's thesis because I have aphantasia, and wanted to understand what other people could do, that I struggle with. In most people, the brain can generate real visual images in the occipital lobe based on words from the temporal lobe, eyes closed, no visual data. This process is how people have visual hallucinations - the brain generating visual data based on low-quality information. This is also why hallucinations are more common in one's peripheral vision and low light.
People with aphantasia, including some hyperverbal autistic people, often require high quality visual data, so they can't imagine anything with their eyes closed, even picturing something that happened earlier that day, or their loved one's face. But the process of visual imagination works very much like diffusion. If a person pictures an apple, they may get a fuzzy red blob at first, and then the brain fills in more and more details based on previous experiences with apples. if I try this, I just think of the definition of an apple. Weirdly, I'm an abstract surrealist painter and art teacher - no visual imagination. I can't remember what my mom looks like.
You would have been a great case study for Oliver Saks or V. S. Ramachandran. Both have written fascinating books about neuroscience and the many divergent ways the brain functions in certain individuals.
May I ask, if you can’t “picture” your mother visually when you two are apart, with what cues do you rely on to establish that relationship? Do you “hear” or recall her voice? Are there behavioral mannerisms of hers that reinforce your relationship with her when you two are apart?
Thank you for sharing your experience. 🙏🏼
Fascinating!
Thumbnail is a bit misleading... the duck image was altered.
The rabbit image was altered too! Look at the duck beak
This content is always full of useful and practical knowledge.
Super interesting Steve! Thanks for explaining this.
Your sponsor sounds like insider trading with extra steps 😂
AI thoughts and comments aside, the angel-statue-to-Yoda transformation at 1:24 is absurdly clean and made me laugh out loud
What about a generative ai song that sounds legible and good played forward and in reverse?
JOIN THE NAVY!
YVA NETH NIAJ
it's quite similar to that sora video where you can choose the end frame of the video and the start frame of the video so you can create a loop.
I've seen this trick in Matt Parker's channel some time ago but I completly forgotted about it. Very interesting concept 👍
Thanks!
04:00 Sorry, Steve, but this is a very misleading explanation of Large Language Models (LLMs). LLMs do _not_ 'understand' text, and they _don't_ have semantic knowledge (e.g. that 'blue boat' means that the boat is blue). The model doesn't know what a boat is, or what blue is, or what it means for a boat to be blue. All it knows is that certain words (actually tokens, which might be words, parts of words, or combinations of words) go together at certain frequencies. LLMs do not have 'meanings', just probabilities of tokens occurring together.
@Singularity606 Unsure why you think I feel "so strongly about this". I just thought Steve, who generally likes to give accurate information, might want to, you know, give accurate information. He can't correct errors if no-one points them out.
Also unsure why you're giving misinformation about LLMs, which do _not_ have semantic knowledge. The fact that a prompt like 'blue boat' can be used to generate an image of a blue boat does not mean that either the LLM or the diffusion model has any semantic knowledge. No more than a checkout recognising a barcode as belonging to a banana and displaying a price means that the till knows what a 'banana' is or has any concept of either food or money.
@Singularity606 No, I'm talking about meaning not 'qualia' (which is a silly concept invented by a philosopher who doesn't understand cognitive neuroscience or psychology). You know what a boat is, what it does, how it works, where you're likely to find one, what it's used for, and so on. To you, 'boat' is not just a token that appears in some sentences, it _means_ something. LLMs don't have that. In an LLM 'boat' is just a token, that is statistically associated with other tokens.
@Singularity606 The word is literally just a token in the LLM's data set. The LLM has no understanding of meaning, it only (1) calculates statistical associations between tokens in training and then (2) uses them to generate output. This is not controversial, it's very basic, fundamental stuff about how LLMs work.
@@Grim_Beard
It's very basic, fundamental stuff about how LLMs are *trained.*
That does not necessarily tell us anything about how it actually performs that task internally within the model. AIs are often called a black box for this reason, and we are perpetually confused as to just *how* they perform so well. Perhaps the reason for this is that understanding is not so difficult to achieve as we'd expect.
If you ask the LLM what a boat is it will tell you.
If you ask the LLM what will happen if a broken boat is placed in water it will tell you.
If you ask the LLM what a good tool for moving items over seas is it will tell you (it's a boat).
These imply understanding of some form to me, even if it is not the exact same as the understanding we have. Yes internally it's "just a token." But it knows the relationship of that token to other tokens and how they can be put together to form coherent messages, and it can derive information about the world from these relationships. That is language, and (to me) that is understanding. Even if it is not a language any human speaks, being more numerical in nature, it remains a language with meaningful syntax and the ability to perform the task of any human language. The LLM understands this language, and we simply translate for it on either side of the process.
Words in the human brain are "just electrical signals" that we know the relationship of to other electrical signals and how they interact with each other to allow us to form coherent messages, and we can derive information about the world from these electrical signals. We have more types of data than the AI, but that doesn't inherently mean that we understand and they don't, just that they understand less or differently.
Ultimately the only way you can claim that AI doesn't understand (or does, my above statement that they do is just as subjective as your statement that they don't) is to first provide a solid definition of what you mean by "understanding." The word has no set definition, so unless you tell people what specifically you mean when you say that you are not communicating your thoughts in their full form. And in any case you cannot state this not understanding as being a known fact that others are incorrect about. They are simply using a different definition of this ill defined word to you. They are not wrong.
@@alansmithee419Thank you very much, that's exactly what I wanted to respond to this comment and yours saved me quite some time !
I find it weird that people will go and "correct" people like that, while being so horribly confident in their "knowledge", saying things like "this is basic knowledge/facts about LLMs". This guy even has 10 likes wtf, how can anyone not think a minute about defining what is "semantics", "understanding" or even "knowing" before arguing if current LLMs have such things.
Guys, please define the terms you are using before asking if LLMs have those !
With the Duck and Rabbit, I can see both and where both transforms in each form. But these overlays are crazy.
It's not really an illusion in my opinion. It's just a fancy way of putting images together creatively. An illusion would imply there is some sort of visual trickery involved to make you think what you're seeing is something else, or that it exploits the visual cortex to produce hallucinatory artifacts. This does not do either.
I agree with you completely. But what do we call it instead? I can't think of another word.
i think we conclude that all visual perception is an illusion because of our object recognition meat “software.” i don’t think it’s such a radical conclusion.
@@JimCa double image? Idk
You explain even the most difficult concepts so well.
Hey @SteveMould! Thank you for everything you do!
There's is something I'd like to know about. I've no idea if it's an area of research.
To start with an example, water is a good subject for this behaviour. I think that you already made some video around the subject I'm about to describe.
So water is flowing along a river and you put some kind module in the course of the water. The river will obviously show changes downstream but upstream as well (eg. making pond/lake, or taking another route altogether).
To give you another example and you may find a pattern there. I seem to remember seeing somewhere that somehow, a ray of light may take an entirely different path based on what is on its way.
The subject is not about taking a different path but more generally how something downstream can affect something upstream.
Hopefully it'll reach your eyes and you'll find it interesting enough to make a video about it.
12:04 You're welcome
I wonder what Plato would think of the fact that we are quite literally creating a Theory of Forms where abstract ideas are no longer merely figments of human imagination, but destinations in a multidimensional vector space that can be visited repeatedly and used in increasingly novel ways. I’m sure Aristotle would need to think on it for a while given his views on Plato’s theory.
why there is no "AI" in the title!! this is one of the best explainations of AI diffusion
Probably because AI and images is usually a bad thing and would minimize views, like putting nfts into the title or something like that
You know, I never want to watch these videos, but when I do, I’m mesmerized, fascinated and happier. Thank you!
This video made everything clearer and easier to grasp.
Hey! Just a heads up that this video uses the Lenna image at 6:14. This is a playboy centerfold that was used for decades as a test image in digital image processing, but it's generally frowned upon to use it now, because it's a vestige of misogyny from the 1970s in tech. Its use has also historically privileged lighter skin tones over darker ones.
It's worth going and reading about the history of this image and how it got into such wide use, and why folks consider it harmful in this day and age if you want to know more.
Bring back lenna
Which folks?
i'm here for the flippy image stuff, not this woke BS
@@morphentropicyou know, "folks". Same ones who don't mind your cat getting eaten.
Nobody asked
As long as all the data that gets scraped gets due credit or paid as necessary, not a problem.
Yes, that’s the biggest problem with GenAI in its current state. It’s created from mostly pirated copyrighted works or sensitive personal data.
Considering one of the illusions had Yoda something tells me even the most ethical proponents of the tech aren't interested in that.
Your old world ideas of ownership of visuals is long gone.
I made images that morph into each other using mirrors and anamorphism. As the viewer changes their position, the image morphs.
I feel like I had to take a similar approach to these robots.
You posted it somewhere?
@@VitorMiguell I have videos of prototypes. I was living outdoors when I was making these, and they ended up only lasting for a few days each time I made them because if the temperature or humidity or something changes then the little boxes I made these in changed shape just enough to mess it up. I plan on making a better one soon, though, as now I live in a house and I can potentially make one large enough to put your head inside of the get the proper morphing effect. The problem with looking at it from outside of something is that when you move, something blocks your view, so there is an interruption in the morph. Keep that in mind as you look at this prototype. ua-cam.com/video/-stSuKmsee8/v-deo.html
BANGER.
I LEARNED A LOT. THANK YOU!!!
Great video. Loved the explanation of diffusion models!
I'm a little bit annoyed that the thumbnail is so obviously edited. The duck on the left has part of the square erased to make it look like the tool was better than it actually was.
15:00 its about how its handled , if its handled by turning words into tokens it litterly cant see what the word is made of and will just rely on probability of what the input text thought it
This video recaptures the fascination I had for AI before the investment bubble killed it, one day the bubble will pop and we'll be back to this kind of applications
This stuff is like 1 year old. Nothing has changed, you can still use SD on your PC.
I hit the bell on your channel years ago and watch every video, but this one didn't show up in my notifications, nor was it recommended alongside other videos like your videos usually are to me. I'm glad this was a collab with Matt or I may have gone quite a while without seeing it.
The one @14:37 (room/pig) that works at different zoom levels also works (understandably) when squinting.