This was definitely one of the better episodes - covered a lot of ground in some good detail with excellent content and good guiding questions and follow-up questions.
Great episode and discussion! I think this discussion should also include GAN latent discovery discussion. Unsupervised learning is what every DS nirvana in production. On a side note, modern GAN can potentially span multi-domain though current works mainly are centered on single domain dataset area like Face, Bedroom etc. The latent variables or feature spaces are discovered in an unsupervised fashion by the networks though much work remains to be discovered for better encoder and generator/discriminator architecture. Current best model can reconstruct scene with different view angles, different lightings, different colours etc BUT they still CANNOT conjure up a structurally meaningful texture/structure of the scene, e.g. bed, table, curtain gets contorted beyond being a bed, table. ... It will be interesting to see if latent features discovered in GAN can help in unsupervised learning too.
An agent always has a goal. No matter how broad or big, the data samples that it will collect from real world will be skewed towards that broader goal. So data samples collected by a such an agent will also have an inductive bias. Therefore collection of data is never completely disentangled from the task. So even if you pose a camera on a monkey or a snail, there will be a pattern to the data (i.e.. bias) that is collected. On the contrary to this approach of say taking completely random samples of images, say generated by a camera, which is parameterized by it's position (in the world) and view direction which are generated by a random number generator, will have very uniform distribution. But it that sense, is that even intelligence ? I think any form of intelligence ultimately imbues some sort of intrinsic bias. Humans beings being the most general intelligence machines and our goals (which is also learnt over time), also collect visual data in a converging fashion with age. Though still very general, humans too have a direction. PS. Excellent Video. Thanks for picking this up.
Great discussion! A follow-up question, one thing I didn't quite understand (perhaps I'm missing something obvious)..... with ref. to 6:36, from what I heard/read through the video/paper, these attention masks were gathered from the last self-attention layer of a VIT. DINO paper showed that one of the heads in the last self-attention layer is paying attention to areas that correspond to actual objects in the original image. Kinda seems weird, I'd think that by the time you reach the last few layers, the image representation would have been altered in ways that would make the original image irrecoverable. Would it be accurate to say this implies the original image representation either makes it through to the last layer(s) or it's somehow recovered?
Hi, awesome episode! Can I ask which paper's is the figure in 1:15:51? It's supposed to be DINO but I can't find it in the DINO paper. Thanks in advance!
You mean the statement from Ishan that you could randomly initialise a CNN and it would already know cats are more similar to each other than dogs? Hmm. The first paper which comes to mind is this arxiv.org/abs/2003.00152 but I think there must be something more fundamental. Can anyone think of a paper?
Is a "cartoon banana" and a "real banana" subtypes of the same category, namely a "banana"? There's obviously some relation between the two, but Ishan Misra is absolutely right, a "cartoon banana" is a different category and is not a subtype of a "banana" (it cannot be eaten, it does not smell or taste like a banana, etc...) Interesting episode, as usual, Tim Scarfe
what if you create a simulation about a first world (when there is no technology etc) and then create an Agent that learn about the environtment make the Agent and World rule as close as possible in real world and then try to learn like the monster architecture of Tesla, but it's unlabelled, it's kinda super duper hard to make, but i think that the best approach to create an Artificial General Intelligence :v
It's interesting how useful simple edits like crop, rotation, contrast, edge and curve + the Appearance of dirty pixels within intentionally low resolution images are, while Self learning is being applied. 🍌🍌🍌😂So true 💓 the Map is not the territory.
Theres a lot of emphasis on this "us vs them" "Humans vs the machine" themes in your introduction, which i think is excessive and biased . Its not man and machine. It's just us. They are us. We're them.
For those who want to learn more from Ishan and more academic detail on the topics covered in the show today, Alfredo Canziani just released another show twitter.com/alfcnz/status/1409481710618693632 😎
I can never get enough of the Epic Tim intros! :D
❤
It was like a small literature review section in itself.
This is what youtube for. Clear explanations and a beautiful intro! Tim intro is fundamental for understanding latter
Thanks!
Thanks, this episode is 🔥! You ask many questions I had in mind lately.
This was definitely one of the better episodes - covered a lot of ground in some good detail with excellent content and good guiding questions and follow-up questions.
You guys are so incredible. Thank you so much. We appreciate this every single second. ☺️☺️☺️
Thank you for acknowledging the serious problems of calling images from Instagram "random", as is claimed in the SEER paper!
Diving deep into this topic myself! So complex yet elegant… 🤔🤩
Such high quality content, so happy I found this channel!
Here from Lex Fridman's shout out in his latest interview with Ishan Misra.
❤
My gawd. I love this episode!!!
What a very interesting topic! It's amazing to know why these vision algorithms actually work!
Great episode and discussion! I think this discussion should also include GAN latent discovery discussion. Unsupervised learning is what every DS nirvana in production. On a side note, modern GAN can potentially span multi-domain though current works mainly are centered on single domain dataset area like Face, Bedroom etc. The latent variables or feature spaces are discovered in an unsupervised fashion by the networks though much work remains to be discovered for better encoder and generator/discriminator architecture. Current best model can reconstruct scene with different view angles, different lightings, different colours etc BUT they still CANNOT conjure up a structurally meaningful texture/structure of the scene, e.g. bed, table, curtain gets contorted beyond being a bed, table. ... It will be interesting to see if latent features discovered in GAN can help in unsupervised learning too.
GANs are unsupervised learning algorithms that use a supervised loss as part of the training :)
I was wondering if quantum computing will help with the latent variables mentioned at 1:24:54
Loved the episode. :)
23:00 The tendency of mass to clump together and increase spatial and temporal continuity...
An agent always has a goal. No matter how broad or big, the data samples that it will collect from real world will be skewed towards that broader goal. So data samples collected by a such an agent will also have an inductive bias. Therefore collection of data is never completely disentangled from the task. So even if you pose a camera on a monkey or a snail, there will be a pattern to the data (i.e.. bias) that is collected.
On the contrary to this approach of say taking completely random samples of images, say generated by a camera, which is parameterized by it's position (in the world) and view direction which are generated by a random number generator, will have very uniform distribution. But it that sense, is that even intelligence ?
I think any form of intelligence ultimately imbues some sort of intrinsic bias. Humans beings being the most general intelligence machines and our goals (which is also learnt over time), also collect visual data in a converging fashion with age. Though still very general, humans too have a direction.
PS. Excellent Video. Thanks for picking this up.
Great discussion!
A follow-up question, one thing I didn't quite understand (perhaps I'm missing something obvious).....
with ref. to 6:36, from what I heard/read through the video/paper, these attention masks were gathered from the last self-attention layer of a VIT. DINO paper showed that one of the heads in the last self-attention layer is paying attention to areas that correspond to actual objects in the original image. Kinda seems weird, I'd think that by the time you reach the last few layers, the image representation would have been altered in ways that would make the original image irrecoverable. Would it be accurate to say this implies the original image representation either makes it through to the last layer(s) or it's somehow recovered?
It is recovered. It traces back where are the inputs which trigger the most attention.
@@dmitryplatonov thanks.
What software are you using for annotating/presenting the papers?
Amazing video 😍
Hi, awesome episode! Can I ask which paper's is the figure in 1:15:51? It's supposed to be DINO but I can't find it in the DINO paper. Thanks in advance!
Page 2 of the DINO paper. Note "DINO" paper full title is "Emerging Properties in Self-Supervised Vision Transformers" arXiv:2104.14294v2
@@MachineLearningStreetTalk Thanks! I was looking to another DINO paper (arXiv:2102.09281
).
44:23 Is there a paper somewhere that I can read on this?
You mean the statement from Ishan that you could randomly initialise a CNN and it would already know cats are more similar to each other than dogs? Hmm. The first paper which comes to mind is this arxiv.org/abs/2003.00152 but I think there must be something more fundamental. Can anyone think of a paper?
Correction: In 13:29, you said BYOL as Bring Your Own Latent. Actually, it should be Bootstrap Your Own Latent (BYOL) Augmentation technique
Yep sorry
Splendid video!
Really like the intro music. Would anyone happen to know where to find the music used?
soundcloud.com/unseenmusic/sets/ambient-electronic-1
It doesn't matter when I am not around, i.e. what happens in 100 years. - Modified from Mishra.
Is a "cartoon banana" and a "real banana" subtypes of the same category, namely a "banana"? There's obviously some relation between the two, but Ishan Misra is absolutely right, a "cartoon banana" is a different category and is not a subtype of a "banana" (it cannot be eaten, it does not smell or taste like a banana, etc...) Interesting episode, as usual, Tim Scarfe
what if you create a simulation about a first world (when there is no technology etc) and then create an Agent that learn about the environtment make the Agent and World rule as close as possible in real world and then try to learn like the monster architecture of Tesla, but it's unlabelled, it's kinda super duper hard to make, but i think that the best approach to create an Artificial General Intelligence :v
It's interesting how useful simple edits like crop, rotation, contrast, edge and curve + the Appearance of dirty pixels within intentionally low resolution images are, while Self learning is being applied.
🍌🍌🍌😂So true 💓 the Map is not the territory.
here from lex.
Fake blur is very irritating. Hurts to see
Lex gang
We are humbled to get the shout-out from Lex!
Theres a lot of emphasis on this "us vs them" "Humans vs the machine" themes in your introduction, which i think is excessive and biased . Its not man and machine. It's just us. They are us. We're them.
Radix sort O(n)
When k < log(n) it's fantastic.
For a cube root of bits in range a 6n FILO stack list sort time is indicated.
We meant that O(N log N) is the provably fastest comparison sort but great call out on Radix 😀
machines are just an extension of nature just like a tree, a beehive, or a baby
For those who want to learn more from Ishan and more academic detail on the topics covered in the show today, Alfredo Canziani just released another show twitter.com/alfcnz/status/1409481710618693632 😎