How AI 'Understands' Images (CLIP) - Computerphile

Computerphile

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 24 кві 2024
With the explosion of AI image generators, AI images are everywhere, but how do they 'know' how to turn text strings into plausible images? Dr Mike Pound expands on his explanation of Diffusion models.
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com
Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com

КОМЕНТАРІ • 212

@michaelpound9891 9 днів тому ⁺²²⁷
As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!
@adfaklsdjf 9 днів тому ⁺⁷
we gotcha 💚
@harpersneil 8 днів тому ⁺¹
Phew, for a second there I thought you were dramatically more intelligent then I am!
@ArquimedesOfficial 6 днів тому ⁺²
Omg, I’m your fan since spiderman 😆, thanks for the lesson!
@edoardogribaldo1058 9 днів тому ⁺¹²³
Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers
@joker345172 7 днів тому
Dr Pound is just amazing. I love all his videos
@adfaklsdjf 9 днів тому ⁺⁴⁶
thank you for "if you want to unlock your face with a phone".. i needed that in my life
@alib8396 7 днів тому ⁺⁹
Unlocking my face with my phone is the first thing I do when I wake up everyday.
@pyajudeme9245 9 днів тому ⁺⁴⁶
This guy is one of the best teachers I have ever seen.
@keanualves7977 9 днів тому ⁺²⁵⁸
I'm a simple guy. I see a Mike Pound video, I click
@jamie_ar 9 днів тому ⁺¹³
I pound the like button... ❤
@Afr0deeziac 9 днів тому ⁺¹
@@jamie_arI see what you did there. But same here 🙂
@BooleanDisorder 9 днів тому ⁺⁴
I like to see Mike pound videos too.
@kurdm1482 9 днів тому ⁺¹
Same
@MikeUnity 9 днів тому ⁺²
Were all here for an intellectual pounding
@eholloway 9 днів тому ⁺⁶¹
"There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024
@rnts08 9 днів тому ⁺⁵
Understatement of the century, even for a brit.
@aprilmeowmeow 9 днів тому ⁺⁵⁷
Thanks for taking us to Pound town. Great explanation!
@pierro281279 9 днів тому ⁺²
Your profile picture reminds me of my cat ! It's so cute !
@pvanukoff 9 днів тому ⁺⁵
pound town 😂
@rundown132 9 днів тому ⁺⁶
pause
@aprilmeowmeow 8 днів тому ⁺¹
@@pierro281279 that's my kitty! She's a ragdoll. That must mean your cat is pretty cute, too 😊
@BrandenBrashear 3 дні тому
Pound was hella sassy this day.
@skf957 9 днів тому ⁺⁶
These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.
@letsburn00 8 днів тому
UA-cam is like you got the best teacher in school. The world has hundreds or thousands of experts. Being able to explain is really hard to do as well.
@MichalKottman 9 днів тому ⁺³⁴
9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?
@michaelpound9891 9 днів тому ⁺³⁰
Absolutely yes! I definitely should have added “the distance” or similar :)
@ScottiStudios 7 днів тому
Yes it should have been *minimise* the diagonal, not maximise.
@TheRealWarrior0 8 днів тому ⁺⁷
A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)!
After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt).
You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).
@or1on89 5 днів тому ⁺²
That’s pretty much what he said after explaining how the LLM infers an image from written text. Did you watch the whole video?
@TheRealWarrior0 4 дні тому
@@or1on89 What? Inferring an image from written text? Is this a typo? You mean image generation?
Anyway, did he make my same point? I must have missed it. Could you point to the minute he roughly says that? I don't think he ever said something like "projective layer" and/or talked about how multimodality in LLMs is "bolted-on". It felt to me like he was talking about the actual CLIP paper rather than how CLIP is used on the modern systems (like Copilot).
@exceptionaldifference392 3 дні тому
I mean the whole video was about how to align the embeddings of the visual transformer with LLM embeddings of captions of the images.
@TheRealWarrior0 3 дні тому
@@exceptionaldifference392 to me, the whole video seems to be about the CLIP paper which is about “zero-shot labelling images”. But that is a prerequisite to make something like LLaVa which is able to talk, ask questions about the image and execute instruction based on the image content! CLIP can’t do that!
I described the step from going to having a vision encoder and an LLM to have a multimodal-LLM. That’s it.
@TheRealWarrior0 3 дні тому
@@exceptionaldifference392 To be exceedingly clear: the video is about how you create the "vision encoder" in the first place, (which does require you also train a "text encoder" for matching the image to the caption), not how to attach the vision encoder to the more general LLM.
@beardmonster8051 8 днів тому ⁺⁷
The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.
@JohnMiller-mmuldoor 7 днів тому
Been trying to unlock my face for 10:37 and it’s still not working!
@bluekeybo 7 днів тому
The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.
@Shabazza84 День тому
Excellent. Could listen to him all day and even understand stuff.
@wouldntyaliktono 9 днів тому ⁺¹
I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.
@musikdoktor 9 днів тому ⁺²
Love seeing AI problems explained on fanfold paper. Classy!
@AZTECMAN 9 днів тому ⁺⁵
Clip is fantastic.
It can be used as a 'zero-shot' classifier.
It's both effective and easy to use.
@RupertBruce 8 днів тому ⁺¹
One day, we'll give these models some high resolution images and comprehensive explanations and their minds will be blown! It's astonishing how good even a basic perceptron can be given 28x28 pixel images!
@rigbyb 9 днів тому ⁺⁴
6:09
"There isn't red cats"
Mike is hilarious and a great teacher lol
@orange-vlcybpd2 День тому
The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.
@codegallant 9 днів тому ⁺²
Computerphile and Dr. Pound ♥️✨ I've been learning AI myself these past few months so this is just wonderful. Thanks a ton! :)
@IOSARBX 9 днів тому
Computerphile, This is great! I liked it and subscribed!
@sebastianscharnagl3173 День тому
Awesome explanation
@xersxo5460 5 днів тому
Just writing this to crystallize my understanding: (and for others to check me for accuracy)
So by circumventing the idea of trying to instill “true” understanding (which is a hard incompatibility in this context, due to our semantics); On a high level it’s substituting case specific discrepancies (like how a digital image is made of pixels, so only pixel related properties are important: like color and position) and filtering against them, because it happens to be easier to tell what something isn’t than what it is in this case (like there are WAAAY more cases where a random group of pixels isn’t an image of a cat, so your sample size for correction is also WAAY bigger.) And if you control for the specific property that disqualifies the entity (in this case, of the medium: discrete discrepancies), as he stated with the “ ‘predisposed noise’ subtraction to recreate a clean image’“ training, you can be even more efficient and effective by starting with already relevant cases. Once again because a smattering of colors is not a cat so it’s easier to go ahead and assume your images will already be in some assortment of colors similar to a cat to train on versus the near infinite combinations of random color pixel images.
And then in terms of the issue of accuracy through specificity versus scalability, it was just easier to use the huge sample size as a tool to approximate accuracy between the embedded images and texts because as a sample size increases, precision also roughly increases given a rule, (in crude terms). And that it’s also a way to circumvent “ mass hard coding” associations to approximate “meaning” because the system doesn’t even have to deal directly with the user inputs in the first place, just their association value within the embedded bank.
I think that’s a clever use of the properties of a system as limitations to solve for our human “black box” results. Because the two methods, organic and mathematical, converge due to a common factor:
The fact that digital images in terms of relevance to people are also useful approximations, because we literally can only care about how close an “image” is to something we know, not if it actually is or not, which is why we don’t get tripped up over individual pixels in determining the shape of a cat in the average Google search. So in the same way by relying on pixel resolution and accuracy as variables you can quantify the properties so a computer can calculate a useable result. That’s so cool!
@sukaina4978 9 днів тому ⁺⁹
i just feel 10 times smarter after watching any computerphile video
@Stratelier 8 днів тому ⁺¹
When they say "high dimensional" in the vector context, I like to imagine it like an RPG character stat sheet, as each independent stat on that sheet can be considered its own dimension.
@user-dv5gm2gc3u 9 днів тому ⁺⁴
i'm an it-guy & programmer, but this is kinda hard to understand. thanks for the video, gives a little idea about the concepts!
@aspuzling 9 днів тому
I'd definitely recommend the last two videos on GPT from 3blue1brown. He explains the concept of embeddings in a really nice way.
@sbzr5323 9 днів тому
The way he explains is very interesting.
@zxuiji 9 днів тому ⁺¹
Personally I woulda just did the colour comparison by putting the 24bit RGB integer colour into a double (the 64bit fpn type) and divided one by the other. If the result is greater than 0.01 or less than -0.01 then they're not close enough to deem the same overall colour and thus not part of the same facing of a shape.
**Edit:** When searching for images it might be better to use simple line path (both a 2d and 3d one) matching the given text of what to search for and compare the shapes identified in the images to those 2 paths. If at least 20% of the line paths matches a shape in the image set then it likely contains that what was searching for.
Similarly when generating images the line paths should then traced for producing each image then layered on to one image. Finally for identifying shapes in a given image you just iterate through all stored line paths. I believe this is how our brains conceptualise shapes in the 1st place given how our brains have nowhere to draw shapes to compare to. Instead they just have connections between...cells? neurons? Someone will correct me. Anyways they just have connections between what are effectively physical functions that equate to something like this in C:
int neuron( float connections[CHAR_BIT * sizeof(uint)] );
Which tells me the same subshapes share neurons for comparisons which means a bigger shape will likely be just something initial nueron to visit, how many neurons to vist, and what angle to direct the path at to identify the next neuron to visit. In other words every subshape would be able to revisit a previous subshapes neruon/function. There might be an extra value or 2 but I'm no neural expert so a rough guess should be accurate enough to get the ball rolling.
@barrotem5627 4 дні тому
Brilliant mike !
@zzzaphod8507 9 днів тому
4:35 "There is a lot of stuff on the internet, not all of it good." Today I learned 😀
6:05 I enjoyed that you mentioned the issues of red/black cats and the problem of cat-egorization
Video was helpful, explained well, thanks
@stancooper5436 7 днів тому
Thanks Mike, nice clear explanation. You can still get that printer paper!? Haven't seen that since my Dad worked as a mainframe engineer for ICL in the 80s!
@VicenteSchmitt 8 днів тому
Great video!
@Misiok89 5 днів тому
6:30 if for LLM you have nodes of meaning then you could look fof "nodes of meaning" in description and make classes based on those "nodes", if you are able to represent every language based on same "nodes of meaning" that is even better to translate text from one language to another then average translator that is not LLM, then you should be able to use it also for clasification.
@Funkymix18 8 днів тому
Mike is the best
@jonyleo500 9 днів тому ⁺⁵
At 9:30, doesn't a distance of zero mean the image and caption have the same "meaning", therefore, shouldn't we want to minimize the diagonal, and maximize the rest?
@michaelpound9891 9 днів тому ⁺⁷
Yes! We want to maximise the similarity measure on the diagonal - I forgot the word similarity!
@romanemul1 9 днів тому
@@michaelpound9891 Cmon. Its Mike Pound !
@FilmFactry 9 днів тому
When will wee see the multimodal LLMs be able to answer a question with a generated image. Could be how do you wire an electric socket, and it would generate either a diagram or illustration of the wire colors and position. Should be able to do this but it can't yet. Next would be a functional use of SORA rendering a video how you install a starter motor in a Honda.
@pickyourlane6431 5 днів тому
i was curious, when you are showing the paper from above, are you transforming the original footage?
@jonathan-._.- 9 днів тому
approx how many samples do i need when i just want to do image categorisation (but with multiple categories per image)
@thestormtrooperwhocanaim496 9 днів тому ⁺¹⁴
A good edging session (for my brain)
@brdane 9 днів тому ⁺¹
Oop. 😳
@Foxxey 9 днів тому ⁺³
14:36 Why can't you just train a network that would decode the vector in the embedded space back into text (being either fixed sized or using a recurrent neural network)? Wouldn't it be as simple as training a decoder and encoder in parallel and using the text input of the encoder as the expected output in the decoder?
@or1on89 5 днів тому
Because that’s a whole different class of problem and would make the process highly inefficient. There are better ways just to do that using a different approach.
@IceMetalPunk 9 днів тому ⁺¹
For using CLIP as a classifier: couldn't you train a decoder network at the same time as you train CLIP, such that you now have a network that can take image embeddings and produce semantically similar text, i.e. captions? That way you don't have to guess-and-check every class one-by-one?
Anyway, I can't believe CLIP has only existed for 3 years... despite the accelerating pace of AI progress, we really are still in the nascent stages of generalized generative AI, aren't we?
@GeoffryGifari 9 днів тому ⁺⁴
Can AI say "I don't know what I'm looking at"? Is there a limit to how much it can recognize parts of an image?
@throttlekitty1 9 днів тому ⁺¹
No, but it can certainly get it wrong! Remember that it's looking for a numerical similarity to things it does know, and by nature has to come to a conclusion.
@OsomPchic 9 днів тому ⁺³
Well in some way. It would say that picture have this embedings: cat:0.3, rainy weather: 0.23, white limo 0.1 every number representing a percentage how "confident" it is. So with a lot of tokens below 0.5 you can say it have no idea what's on that picture
@ERitMALT00123 9 днів тому ⁺¹
Monte-Carlo dropout can produce confidence estimations of a model. If the model doesn't know what it's looking at then the confidence should be low. CLIP natively doesn't have this though
@el_es 8 днів тому
The 'i don't know ' answer is not very evenly treated along users and therefore there is an understandable hate of it embedded into the model;) possibly because it also means more work for the programmers... Therefore it would rather hallucinate than say it doesn't know something.
@MilesBellas 9 днів тому
Stable Diffusion 3 = potential topic
Optimum workflow strategies using Control Nets, LORAS, VEAs etc....?
@el_es 8 днів тому
@dr Pound: sorry if this is off topic here but, i wonder if the problem of hallucinations in AI comes from us not treating the 'i don't know what I'm looking at ' answer of a model, as a very negative outcome? If it was treated by us as a valid neutral answer, could it reduce the rate if hallucinations?
@aleksszukovskis2074 4 дні тому ⁺¹
there is stray audio in the background that you can faintly hear at 0:05
@utkua 9 днів тому
How do you go from embedings to text of something never been see. before?
@JT-hi1cs 9 днів тому
Awesome! I always wondered how the hell does the AI “gets” that an image is made with a certain type of lens or film stock. Or how the hell AI generates objects that were never filmed in a way, say, The Matrix filmed on fisheye and Panavision in the 1950s.
@lancemarchetti8673 4 дні тому
Amazing. Imagine the day when AI is able to detect digital image steganography. Not by vision primarily, but by bit inspection.... iterating over the bytes and spitting out the hidden data. I think we're still years away from that though.
@zurc_bot 9 днів тому ⁺¹
Where did they get those images from? Any copyright infringement?
@quonxinquonyi8570 2 дні тому
Internet is a huge public repository since its inception
@StashOfCode 5 днів тому
There is a paper on The Gradient about reverting embeddings to text ("Do text embeddings perfectly encode text?")
@j3r3miasmg 7 днів тому
I didn't read the cited paper, but if I understood correctly, the 5 billion images need to be labeled for the training step?
@genuinefreewilly5706 9 днів тому
Great explainer. Appreciated. I hope someone will cover AI music next
@suicidalbanananana 9 днів тому ⁺¹
In super short:
Most "AI music stuff" is literally just running stable diffusion in the backend, they train a model on the actual images of spectrograms of songs, then ask it to make an image like that & then convert that spectrogram image back to sound.
@genuinefreewilly5706 8 днів тому
@@suicidalbanananana Yes I can see that, however AI music has made a sudden marked departure in quality of late.
Its pretty controversial among musicians.
I can wrap my head around narrow AI applications in music ie mastering, samples etc.. Its been a mixed bag of results until recently.
@or1on89 5 днів тому ⁺¹
It surely would be interesting…I can see a lot of people embracing it for pop/trap music and genres with “simple” compositions…my worry as a musician is that it would make the landscape more boring than boy bands in the 90s (and somewhat already is without AI being involved).
As a software developer I would love instead to explore the tool to refine filters, corrections and sampling during the production process…
It’s a bit of a mixed bag…the generative aspect is being marketed as the “real revolution” and that’s a bit scary…knowing more the tech and how ML can help improve our tools would be great…
@LupinoArts 8 днів тому
3:55 As someone born in the former GDR, I find it cute to label a Trabi as "a car"...
@LukeTheB 9 днів тому
Quick question from someone outside computer science:
Does the model actually instill "meaning" into the embedded space?
What I mean is:
Is the Angel between "black car" and "Red car" smaller than "black car" and "bus" and that is smaller than "black car" and "tree"?
@suicidalbanananana 9 днів тому ⁺¹
Yeah that's correct, "black car" and "red car" will be much closer to each other than "black car" and "bus" or "black car" and "tree" would be. It's just pretty hard to visualize this in our minds because we're talking about some strange sort of thousands-of-dimensions-space with billions of data points in it. But there's definitely discernable "groups of stuff" in this data.
(Also, "Angle" not "Angel" but eh, we get what you mean ^^)
@nenharma82 9 днів тому ⁺¹
This is as simple as it’s ingenious and it wouldn’t be possible without the internet being what it is.
@IceMetalPunk 9 днів тому
True! Although it also requires Transformers to exist, as previous AI architectures would never be able to handle all the varying contexts, so it's a combination of the scale of the internet and the invention of the Transformer that made it all possible.
@Retrofire-47 7 днів тому
@@IceMetalPunk the transformer, as someone who is ignorant, what is that? I only know a transformer as a means of converting electrical voltage from AC - DC
@NeinStein 9 днів тому
Oh look, a Mike!
@ianburton9223 9 днів тому
Difficult to see how convergence can be ensured. Lots of very different functions can be closely mapped over certain controlled ranges, but then are wildly different outside those ranges. What I have missed in many AI discussions is these concepts of validity matching and range identities to ensure that there's some degree of controlled convergence. Maybe this is just a human fear of the unknown.
@GeoffryGifari 9 днів тому ⁺¹
How can AI determine the "importance" of parts of an image? why would it output "people in front of boat" instead of "boat behind people" or "boat surrounded by people"?
Or maybe the image is a grid of square white cells. One cell then get its color progressively darken to black. Would the AI describe these transitioning images differently?
@michaelpound9891 9 днів тому ⁺²
Interesting question! This very much comes down to the training data in my experience. For the network to learn a concept such as "depth ordering", where something is in front of another, what we are really saying is it has learnt a way to extract features (numbers in grids) representing different objects, and then recognize that an object is obscured or some other signal that indicates this concept of being in front of. For this to happen in practice, we will need to see many examples of this in the training data, such that eventually such features occurring in an image lead to a predictable text response.
@GeoffryGifari 9 днів тому
@@michaelpound9891 The man himself! thank you for your time
@GeoffryGifari 9 днів тому ⁺¹
@@michaelpound9891 I picked that example because... maybe its not just depth? maybe there are myriad of factors that the AI summarized as "important"
For example the man is in front of the boat, but the boat is far enough behind that it looks somewhat small.... Or maybe that small boat has a bright color that contrasts with everything else (including the man in front).
But your answer makes sense, that its the training data
@bennettzug 8 днів тому
13:54 you actually probably can, at least to an extent
there’s been some recent research on the idea of going backwards from embeddings to text, maybe look at the paper “Text Embeddings Reveal (Almost) As Much As Text” (Morris et al)
the same thing has been done with images from a CNN, see “Inverting Visual Representations with Convolutional Networks” (Dosovitsky et al)
neither of these are with CLIP models so maybe future research? (not that it’d produce better images than a diffusion model)
@or1on89 5 днів тому
You can, using a different type of network/model. We need to remind that all he said is in the context of a specific type of model and not in absolute terms, otherwise the lesson would go very quickly out of context and hard to follow.
@bennettzug 4 дні тому
@@or1on89 i don’t see any specific reason why CLIP model embeddings would be especially intractable though
@eigd 9 днів тому
9:48 Been a while since I did machine learning class... Anyone care to tell me why I'm thinking of PCA? What's the connection?
@donaldhobson8873 9 днів тому
Once you have a clip, can't you train a diffusion on pure images, just by putting an image into clip, and training the diffusion to output the same image?
@charlesgalant8271 9 днів тому ⁺¹
The answer given for the "we feed the embedding into the denoise process" still felt a little hand-wavey to me as someone who would like to understand better, but overall good video.
@michaelpound9891 9 днів тому ⁺³
Yes I'm still skipping things :) The process this uses is called attention, which basically is a type of layer we use in modern deep networks. The layer allows features that are related to share information amongst themselves. Rob Miles covered attention a little in the video "AI Language Models & Transformers", but it may well be time to revisit this since attention has become quite a lot more mainstream now, being put in all kinds of networks.
@IceMetalPunk 9 днів тому
@@michaelpound9891 It is, after all, all you need 😁 Speaking of attention: do you think you could do a video (either on Computerphile or elsewhere) about the recent Infini-Attention paper? It sounds to me like it's a form of continual learning, which I think would be super important to getting large models to learn more like humans, but it's also a bit over my head so I feel like I could be totally wrong about that. I'd appreciate an overview/rundown of it, if you've got the time and desire, please 💗
@unvergebeneid 9 днів тому
But confusing to say that you want to maximise the distances on the diagonal. Of course you can define things however you want but usually you'd say you want to maximise the cosine similarity and thus minimise the cosine distance on the diagonal.
@hehotbros01 6 днів тому
Poundtown.. sweet...
@ginogarcia8730 8 днів тому
I wish I could hear Professor Brailsford's thoughts on AI these days man
@proc 9 днів тому
9:48 I didn't quite get how do similar embeddings end up close to each other if we maximize the distances to all other embeddings in the batch? Wouldn't two images of dogs in the same batch will be pulled further away just like an image of a dog and a cat would? Explain like Dr. Pound please.
@drdca8263 9 днів тому
First: I don’t know.
Now I’m going to speculate:
Not sure if this had a relevant impact, but: probably there are quite a few copies of the same image with different captions, and of the same caption for different images?
Again, maybe that doesn’t have an appreciable effect, idk.
Oh, also, maybe the number of image,caption pairs is large compared to the number of dimensions for the embedding vectors?
Like, I know the embedding dimension is pretty high, but maybe the number of image,caption pairs is large enough that some need to be kinda close together?
Also, presumably the mapping producing the embedding of the image, has to be continuous, so, images that are sufficiently close in pixel space (though not if only semantically similar) should have to have similar embeddings.
Another thing they could do, if it doesn’t happen automatically, is to use random cropping and other small changes to the images, so that a variety of slightly different versions of the same image are encouraged to have similar embeddings to the embedding of the same prompt.
@klyanadkmorr 9 днів тому ⁺¹
Heyo, a Pound dogette here!
@EkShunya 8 днів тому
I thought diffusion models had VAE and not ViT
Correct me if I m wrong
@quonxinquonyi8570 2 дні тому
Diffusion model is an upgraded version of vae with limitation in sampling speed
@MikeKoss 8 днів тому
Can't you do something analogous to stable diffusion for text classification? Get the image embedding, and then start with random noisy text, and iteratively refine it in the direction of the image's embedding to get a progressively more accurate description of the image.
@quonxinquonyi8570 2 дні тому
Image manifolds are of huge dimension compare to text manifolds….so guided diffusion from a low dimension manifold to a very high dimension manifold would have a less information and more noise, basically information theoretic bounds still hold when you transform from high dimensional space to low dimension embedding but the other way around isn’t seems that intuitive…might be some prior must be taken into an account..but it still is a hard problem
@bogdyee 9 днів тому
I'm curios about a thing. If you have a bunch of millions of photos of cats and dogs and they are also correctly labeled (with descriptions) but all these photos have the cats and dogs in the bottom half of the image, will the transformer be able to correctly classify them after training if they are put in the upper half of the image? (or images are rotated, color changed, filtered, etc..).
@Macieks300 9 днів тому
Yes, it may learn it wrong. That's why scale is necessary for this. If you have a million of photos of a cats and dogs it's very unlikely that all of them are in the bottom half of the image.
@bogdyee 9 днів тому
@@Macieks300 That's why for me it pose a philosophical question. Will these things actually solve intelligence at some point? If so, what exactly might be the difference between a human brain an an artificial one.
@IceMetalPunk 9 днів тому
@@bogdyee Well, think of it this way: humans learn very similarly. It may not seem like it, because the chances of a human only ever seeing cats in the bottom of their vision and never anywhere else is basically zero... but we do. The main difference between human learning and AI learning, with modern networks, is the training data: we're constantly learning and gathering tons of data through our senses and changing environments, while these networks learn in batches and only get to learn from the training data we curate, which tends to be relatively static. But give an existing AI model the ability to do online learning (i.e. continual learning, not "look up on the internet" 😅) and put it in a robot body that it can control? And you'll basically have a human brain, perhaps at a different scale. And embodied AIs are constantly being worked on now, and continual learning for large models... I'm not sure about. I think the recent Infini-Attention is similar, though, so we might be making progress on that as well.
@suicidalbanananana 8 днів тому
@@bogdyee Nah they won't solve intelligence at some point when going down this route they are currently going down, AI industry was working on actual "intelligence" for a while but all this hype about shoving insane amounts of training data into "AI" has reduced the field to really just writing overly complex search engines that sort of mix results together... 🤷‍♂
Its not trying to think or understand (as is the actual goal of AI field) anything at all at this stage, it's really just trying to match patterns. "Ah the user talked about dogs, my training data contains the following info about dog type a/b/c, oh the user asks about trees, training data contains info about tree type a/b/c", etc.
Actual AI (not even getting to the point of 'general ai' yet but certainly getting to somewhere much better than what we have now) would have little to no training data at all, instead it would start 'learning' as its running, so you would talk to it about trees and it would go "idk what a tree is, please tell me more" and then later on it might have some basic understanding of "ah yes, tree, i have heard about them, person x explained them to me, they let you all breathe & exist in type a/b/c, right? please tell me more about trees"
Where the weirdness lies is that the companies behind current "AI" are starting to tell the "AI" to respond in a similar smart manner, so they are starting to APPEAR smart, but they're not actually capable of learning. All the current AI's do not remember any conversation they have had outside of training, because that makes it super easy to turn Bing (or whatever) into yet another racist twitter bot (see microsoft's history with ai chatbots)
@suicidalbanananana 8 днів тому
@@IceMetalPunk The biggest difference is that we (or any other biological intelligence) don't need insanely large amounts of training data, show a baby some spoons and forks and how to use them and that baby/person will recognize and be able to use 99.9% of spoons and forks correctly for the rest of its life, current overhyped AI's would have to see thousands of spoons and forks to maybe get it right 75% of the time & that's just recognizing it, we're not even close yet to 'understanding how to use'
Also worth noting is how we (and again, any other biological intelligence) are always "training data" and much more versatile when it comes to new things, if you train an AI to recognize spoons and forks and then show it a knife it's just going to classify it as a fork or spoon, where as we would go "well that's something i've not seen before so it's NOT a spoon and NOT a fork"
@nightwishlover8913 8 днів тому
5.02 Never seen a "boat wearing a red jumper" before lol
@MattMcT 8 днів тому
Do any of you ever get this weird feeling that you need to buy Mike a beer? Or perhaps, a substantial yet unknown factor of beers?
@fredrik3685 7 днів тому
Question 🤚
Up until recently all images of a cat on internet were photos of real cars and the system could use them in training.
But now more and more cat images are AI generated.
If future systems use generated images in training it will be like a blind leading a blind. More and more distortion will be added. Or? Can that be avoided?
@quonxinquonyi8570 2 дні тому
Distortion and perceptual qualities are the tradeoff we make when we use generative ai
@Rapand 9 днів тому
Each time I watch one of these videos, I could might as well watch Apocalypto without subtitles. My brain is not made for this 🤓
@MedEighty 5 днів тому
10:37 "If you want to unlock a face with your phone". Ha ha ha!
@bryandraughn9830 9 днів тому
I wonder, if every cat image has specific "cat" types of numerical curves, textures, eyes and so on. So a completely numerical calculation would conclude that the image is of a cat.
There's only so much variety of pixel arrangements at some resolution, it seems like images could be reduced to pure math. Im probably so wrong.
Just curious.
@quonxinquonyi8570 2 дні тому
You are absolutely right….images are of very high dimension but their image manifold is still considered to cover and filled a very low dimension of their whole image hyper space….the only way to manipulate or tweak that image manifold is by adding noise…but noise is of very low dimension compare to that high dimension image manifold…so that perturbation or guidance to image manifold in form of noise disturb it into one of its direction from many of its inherent direction….this is similar to find slope of a curve ( manifold) by linearly approximate it with a line ( noise)…this is the method you learn in your high school maths….if want to discuss more,I will clarify it further…
@MuaddibIsMe 9 днів тому ⁺²
"a mike"
@creedolala6918 9 днів тому
'and we want an image of foggonstilz'
me: wat
'we want to pass the text of farngunstills'
me: u wot m8
@AngelicaBotelho-he1hb 5 днів тому ⁺⁸
Crypto Bull run is making waves everywhere and I have no idea on how it works. What is the best step to get started please,,
@roseypasha1706 5 днів тому ⁺¹
Am facing the same challenges right now and I made a lots of mistakes trying to do it on my own even this video doesn't give any guidelines
@GiseleLuz-rm6vd 5 днів тому
I will advise you to stop trading on your own if you continue to lose. I no longer negotiate alone, I have always needed help and assistance
@brandonkim4554 5 днів тому
You're right! The current market might give opportunities to maximize profit within a short term, but in order to execute such strategy, you must be a skilled practitioner.
@heleisy5110 5 днів тому
Inspiring! Do you think you can give me some advice on how to invest like you do now?
@CreachterZ 9 днів тому ⁺¹
How does he stay on top of all of this technology and still have time to teach? …and sleep?
@MilesBellas 9 днів тому
Stable Diffusion needs a CEO BTW
....just saying ...
😅
@babasathyanarayanathota8564 8 днів тому
Me: added to resume ai expert
@Ginto_O 9 днів тому
a yellow cat is called red cat
@RawrxDev 9 днів тому ⁺⁷
Truly a marvel of human applications of mathematics and engineering, but boy do I think these tools have significantly more cons than pros in practical use.
@aprilmeowmeow 9 днів тому ⁺³
agreed. The sheer power required is an ethical concern
@suicidalbanananana 9 днів тому ⁺²
We're currently experiencing an "AI bubble" that will pop within 2-3 years or less, no doubt about that at all. Companies are wasting money and resources trying to be the first to make something crappy appear less crappy than it actually is, but they don't fully realize yet that it's that's a harder task then it might seem & it's going to be extremely hard to monetize the end result.
We need to move back to AI research trying to recreate a biological brain, somehow the field has suddenly been reduced to people trying to recreate a search engine that mixes results or something, which is just ridiculous & running in the opposite direction that AI field should be heading.
@RawrxDev 8 днів тому
@@suicidalbanananana That's my thought as well, I even recently watched a clip from sam altman saying they have no idea how to actually make money from AI without investors, and that he is just going to ask the AGI how to make a return once they achieve AGI, which to me seems..... optimistic.
@willhart2188 8 днів тому
AI art is great.
@FLPhotoCatcher 9 днів тому
At 16:20 the 'cat' looks more like a shower head.
@djtomoy 8 днів тому
Why is there always so much mess and clutter in the background of these videos? Do you film them in abandoned buildings?
@MagicPlants 9 днів тому ⁺²
the gorilla camera moving around all the time is making me dizzy
@SkEiTaDEV 9 днів тому
Isn't there an AI that fixed shaky video by now?
@creedolala6918 9 днів тому
Isn't that a problem that's been solved without AI already? Someone can ride on a mountain bike that's violently shaking, down a forest trail, with a GoPro on his helmet, and we get perfect smooth video of it somehow.
@grantc8353 9 днів тому
I swear that P took longer to come up than the rest.
@JeiShian 9 днів тому
The exchange at 6:50 made me laugh out loud and I had to show that part of the video to the people around me😆😆
@creedolala6918 9 днів тому
Normally this guy is great for explaining things and super clear, but in this case it feels like he's kind of assuming some prior knowledge or understanding, and not really giving us the 'explain it like I'm five years old' version. And for a subject this complicated, you kind of need that.
@quonxinquonyi8570 2 дні тому
Want to learn gen ai….here is a sat problem for you…adding one percent to an amount and removing that same one percent would never ever give you the original amount…that percentage is “ noise” and removing that percentage is “ generation”…and automation of this process through computation power is called generative ai….wanna know more,I will teach you like a fifth grader about adding and removing noise in an image coz that is the only piece of art in generative ai
@YouTubeCertifiedCommenter 5 днів тому
This must have been the entire purpose of googles Picasa.
@ysidrovasquez4591 9 днів тому
se han ddo cuenta que estos son un grupito de federicos a los que les patila el cloche.. ellos hacen videos qué no dicen nada
@artseiff 2 дні тому
Oh, no… All cats are missing a tail 😿
@RupertBruce 8 днів тому
Cat picture needs a tail!
@MichaelPetito 8 днів тому
Your cat needs a tail!
@sogwatchman 9 днів тому
10:36 When you unlock a face with your phone... Umm what?
@planesrift 4 дні тому
AI now understand what I cannot understand
@justina208 3 дні тому
Support request: My face is locked and I can’t find my phone. I need to be at work in an hour. Please respond ASAP.
@quonxinquonyi8570 2 дні тому
You needed to be at slavery with in an hour…but your phone has set you free for a day…thank your phone now

Наступне

Автоматичне відтворення

How Branch Prediction Works in CPUs - Computerphile