🎯 Key Takeaways for quick navigation: 00:05 🎓 This lecture introduces Stanford's CS224N course on NLP with deep learning, covering topics like word vectors, word2vec algorithm, optimization, and system building. 01:32 🤯 The surprising discovery that word meanings can be well represented by large vectors of real numbers challenges centuries of linguistic tradition. 02:29 📚 The course aims to teach deep understanding of modern NLP methods, provide insights into human language complexity, and impart PyTorch-based skills for solving NLP problems. 07:15 🗓️ Human language's evolution is relatively recent (100,000 - 1 million years ago), but it has led to significant communication power and adaptability. 10:59 🧠 GPT-3 is a powerful language model capable of diverse tasks due to its ability to predict and generate text based on context and examples. 14:52 🧩 Distributional semantics uses context words to represent word meaning as dense vectors, enabling similarity and relationships between words to be captured. 18:37 🏛️ Traditional NLP represented words as discrete symbols, lacking a natural notion of similarity; distributional semantics overcomes this by capturing meaning through context. 25:19 🔍 Word embeddings, or distributed representations, place words in high-dimensional vector spaces; they group similar words, forming clusters that capture meaning relationships. 27:15 🧠 Word2Vec is an algorithm introduced by Tomas Mikolov and colleagues in 2013 for learning word vectors from text corpus. 28:11 📚 Word2Vec creates vector representations for words by predicting words' context in a text corpus using distributional similarity. 29:07 🔄 Word vectors are adjusted to maximize the probability of context words occurring around center words in the training text. 31:02 🎯 Word2Vec aims to predict context words within a fixed window size given a center word, optimizing for predictive accuracy. 32:56 📈 The optimization process involves calculating gradients using calculus to adjust word vectors for better context word predictions. 36:33 💡 Word2Vec employs the softmax function to convert dot products of word vectors into probability distributions for context word prediction. 38:51 ⚙️ The optimization process aims to minimize the loss function, maximizing the accuracy of context word predictions. 45:53 📝 The derivative of the log probability of context words involves using the chain rule and results in a formula similar to the softmax probability formula. 49:28 🔢 The gradient calculation involves adjusting word vectors to minimize the difference between observed and expected context word probabilities. 53:34 🔀 The derivative of the log probability formula simplifies into a form where the observed context word probability is subtracted from the expected probability. 58:57 📊 Word vectors for "bread" and "croissant" show similarity in dimensions, indicating they are related. 59:26 🌐 Word vectors reveal similar words to "croissant" (e.g., brioche, baguette), and analogies like "USA" to "Canada" can be inferred. 59:55 ➗ Word vector arithmetic allows analogy tasks, like "king - male + female = queen," and similar analogies can be formed for various words. 01:00:22 🤖 The analogy task shows the ability to perform vector arithmetic and retrieve similar words based on relationships. 01:01:23 🤔 Negative similarity and positive similarity together enable analogies and meaningful relationships among words. 01:03:17 💬 The model's knowledge is limited to the time it was built (2014), but it can still perform various linguistic analogies. 01:04:39 🧠 Word vectors capture multiple meanings and contexts for a single word, like "star" having astronomical or fame-related connotations. 01:05:36 🔄 Different vectors are used for a word as the center and as part of the context, contributing to the overall representation. 01:07:02 🧐 Using separate vectors for center and context words simplifies derivatives calculations and results in similar word representations. 01:11:26 ⚖️ The model struggles with capturing antonyms and sentiment-related relationships due to common contexts. 01:12:44 🎙️ The class primarily focuses on text analysis, with a separate speech class covering speech recognition and dialogue systems. 01:18:06 🗣️ Function words like "so" and "not" pose challenges due to occurring in diverse contexts, but advanced models consider structural information. 01:20:25 🧠 Word2Vec offers different algorithms within the framework; optimization details like negative sampling can significantly improve efficiency. 01:23:18 🔁 The process of constructing word vectors involves iterative updates using gradients, moving towards minimizing the loss function. Made with HARPA AI
I am so grateful that Stanford has given us all this great gift. Thanks to their great machine learning and AI video series, I am able to build a solid foundation of knowledge and have started my PhD based on that.
So many full courses in great quality, great lecturers AND with normal subtitles... Can someone PLEASE give Stanford University some kind of international prize for knowledge sharing?
Thanks for everything Stanford University. As an AI master's student I have to state that having these lectures for free enables me to compare and broaden my ideas for NLP, resulting in deeper intuitive understanding of the subject.
What a great lecturer, he feels students, puts himself in our place and explains material very nicely. This is literally my first piece of material about NLP I have ever seen, and I understood most of it. Thanks a lot
hello Stanford online i started to self-study machine learning my university program does not teach in depth about AI , i feel i have not reached my full potential and i have taught myself about AI for 6 months recently .And i have learned , learned all areas in AI, machine learning, deep learning or reinforcement learning, thank you for this free lecture, i really appreciate it.
Oh my days I love his positive vibes! Also clear explanation of multiple topics. I really appreciate you providing us with such great lectures online for free!
At 51:45, when he says "we need to change the index to x from w, else we'll get into trouble" while taking the inner derivative of exponential term. How can he change the index when the denominator term coming will exactly same as the derivative of exponential term and they should cancel each other. Changing index changed the fundamental definition of P(o|c). Is there something I am missing here.
What we are expecting is a sigma, i.e. a sum over a range. Now we have to look for a way of expressing that sigma, if we choose the index w, then we'll confuse it with the other sigma notation, even though they are completelt different. Hence we will use a different index. And the different index doesnt change the original definition of p(o/c) because its just an index, a way of expressing sigma.
1:10:50 Why would you average both vectors together, wouldn't it be useful to keep both of the vectors depending on the different tasks that need to be done?
Wow this vector idea is interesting. Have we tried getting models to emit nonsense text that nonetheless has a similar vectors to real words and seeing if human brains sort of subconsciously get that same meaning? Computers could be really good at writing poetry o.o
We can treat that as a gradient. The dot product can be viewed as a multivariable function with input (v_c1, ... , v_cd), and therefore we can calculate the gradient of it w.r.t. each of the components of v_c. Since gradient is the direction that v_c should go in order to increase the value of the dot product, this gradient vector can be added to v_c, so they should have the same shape :)
check up your knowledge of single variable calculus (derivative, differentiating, interpretation of a derivative & applications of derivative) and then just basics of multiple variable calculus (functions of several variables, partial derivatives), mit 18.01sc 18.02sc could be good (and free) resources for picking it up. that is if you want to get an understanding of the math under the hood, I'd say that in parallel you definitely could practice with the higher-level applications of it just like this course.
Hi sir , Is it possible to use Neural networks to learn new dialects and translate new words that belong to unknown new dialects for various languages..?
I had a question about "observed - expected" around @55:48. Maybe I misunderstand but isn't the summation of p(x|c)*ux our prediction therefore making it our observed?
Yes, it is our prediction, but because that's our prediction, that would be the expected. The word vector we obtain from uo (our actual observed word vector) would be our observed, then we subtract the sum of p(x|c)ux from it to obtain margin of error. In a perfect case, they would subtract to 0, which he explains at 55:44.
Objective function seeks to maximise the probable likelihood of context word given center word. However should it also not try to minimise the probability of incorrect context words given center word?
Hi there, great question! If you are just beginning to learn about machine learning we recommend starting with this course: www.coursera.org/specializations/machine-learning-introduction
Great knowledge it seems, but give this to an Indian youtuber, and he will make a 3 video series out of a single lecture that is easier to understand. #opinion
Thank you to Stanford and to Prof. Manning for making these lectures available to everyone.
🎯 Key Takeaways for quick navigation:
00:05 🎓 This lecture introduces Stanford's CS224N course on NLP with deep learning, covering topics like word vectors, word2vec algorithm, optimization, and system building.
01:32 🤯 The surprising discovery that word meanings can be well represented by large vectors of real numbers challenges centuries of linguistic tradition.
02:29 📚 The course aims to teach deep understanding of modern NLP methods, provide insights into human language complexity, and impart PyTorch-based skills for solving NLP problems.
07:15 🗓️ Human language's evolution is relatively recent (100,000 - 1 million years ago), but it has led to significant communication power and adaptability.
10:59 🧠 GPT-3 is a powerful language model capable of diverse tasks due to its ability to predict and generate text based on context and examples.
14:52 🧩 Distributional semantics uses context words to represent word meaning as dense vectors, enabling similarity and relationships between words to be captured.
18:37 🏛️ Traditional NLP represented words as discrete symbols, lacking a natural notion of similarity; distributional semantics overcomes this by capturing meaning through context.
25:19 🔍 Word embeddings, or distributed representations, place words in high-dimensional vector spaces; they group similar words, forming clusters that capture meaning relationships.
27:15 🧠 Word2Vec is an algorithm introduced by Tomas Mikolov and colleagues in 2013 for learning word vectors from text corpus.
28:11 📚 Word2Vec creates vector representations for words by predicting words' context in a text corpus using distributional similarity.
29:07 🔄 Word vectors are adjusted to maximize the probability of context words occurring around center words in the training text.
31:02 🎯 Word2Vec aims to predict context words within a fixed window size given a center word, optimizing for predictive accuracy.
32:56 📈 The optimization process involves calculating gradients using calculus to adjust word vectors for better context word predictions.
36:33 💡 Word2Vec employs the softmax function to convert dot products of word vectors into probability distributions for context word prediction.
38:51 ⚙️ The optimization process aims to minimize the loss function, maximizing the accuracy of context word predictions.
45:53 📝 The derivative of the log probability of context words involves using the chain rule and results in a formula similar to the softmax probability formula.
49:28 🔢 The gradient calculation involves adjusting word vectors to minimize the difference between observed and expected context word probabilities.
53:34 🔀 The derivative of the log probability formula simplifies into a form where the observed context word probability is subtracted from the expected probability.
58:57 📊 Word vectors for "bread" and "croissant" show similarity in dimensions, indicating they are related.
59:26 🌐 Word vectors reveal similar words to "croissant" (e.g., brioche, baguette), and analogies like "USA" to "Canada" can be inferred.
59:55 ➗ Word vector arithmetic allows analogy tasks, like "king - male + female = queen," and similar analogies can be formed for various words.
01:00:22 🤖 The analogy task shows the ability to perform vector arithmetic and retrieve similar words based on relationships.
01:01:23 🤔 Negative similarity and positive similarity together enable analogies and meaningful relationships among words.
01:03:17 💬 The model's knowledge is limited to the time it was built (2014), but it can still perform various linguistic analogies.
01:04:39 🧠 Word vectors capture multiple meanings and contexts for a single word, like "star" having astronomical or fame-related connotations.
01:05:36 🔄 Different vectors are used for a word as the center and as part of the context, contributing to the overall representation.
01:07:02 🧐 Using separate vectors for center and context words simplifies derivatives calculations and results in similar word representations.
01:11:26 ⚖️ The model struggles with capturing antonyms and sentiment-related relationships due to common contexts.
01:12:44 🎙️ The class primarily focuses on text analysis, with a separate speech class covering speech recognition and dialogue systems.
01:18:06 🗣️ Function words like "so" and "not" pose challenges due to occurring in diverse contexts, but advanced models consider structural information.
01:20:25 🧠 Word2Vec offers different algorithms within the framework; optimization details like negative sampling can significantly improve efficiency.
01:23:18 🔁 The process of constructing word vectors involves iterative updates using gradients, moving towards minimizing the loss function.
Made with HARPA AI
I am so grateful that Stanford has given us all this great gift. Thanks to their great machine learning and AI video series, I am able to build a solid foundation of knowledge and have started my PhD based on that.
So many full courses in great quality, great lecturers AND with normal subtitles... Can someone PLEASE give Stanford University some kind of international prize for knowledge sharing?
99% of courses are not online and cost money. I would like them to add more.
Yes! TRUE indeed. Thank you Stanford. ❤❤❤❤❤
do you find this course easy to understand?
Do u have tech background or you just a newbie in Tech?
Thanks for everything Stanford University. As an AI master's student I have to state that having these lectures for free enables me to compare and broaden my ideas for NLP, resulting in deeper intuitive understanding of the subject.
Hi Teo, thanks very much for your comment and feedback! Happy to hear these lectures were so helpful to your studies.
Prof. Manning looks so happy explaining all the questions. That is so encouraging and heartwarming!
Amazing lecture it was, thanks to Stanford to make these lectures public.
o
.
.
.p
pünktlich
üüm
00p
Buchhandluääähäjdzk6d.z
m..
Mit freundlichen Grüßen aus
h
ä
9
0ö
Popupsö9äm
üüä9
z
An
üü9
ääh
Absolutely loved what the professor said at 43:47.
What a great lecturer, he feels students, puts himself in our place and explains material very nicely. This is literally my first piece of material about NLP I have ever seen, and I understood most of it. Thanks a lot
Awesome feedback, thanks for your comment!
hello Stanford online i started to self-study machine learning my university program does not teach in depth about AI , i feel i have not reached my full potential and i have taught myself about AI for 6 months recently .And i have learned , learned all areas in AI, machine learning, deep learning or reinforcement learning, thank you for this free lecture, i really appreciate it.
Oh my days I love his positive vibes! Also clear explanation of multiple topics. I really appreciate you providing us with such great lectures online for free!
I got exhausted yet your enthusiasm is what made me stay here amazing session
Math is not magic, but is as beautiful as magic.
Moved from Coursera NLP Specialization to here. Definately amazing to receive such detailed math explanations of all these concepts
Here is better?
which should i do first, the specilization or cs-224n
The result at 55:45 is just beautiful!
Really liked the energy and simplicity of the presentation !
For some reason I am reminded of Grant from 3Blue1Brown. The way he speaks and the way he's excited about the subject, it's so intoxicating.
Can't expect more from a lesson! Thank you all for sharing the class towards all the people🤩
It is great to watch this and don't have to do the homework.
Hahahaha
It's not entirely clear to me why we change the index exept for separting the sums at the end? Anyone knows more? Thanks!!
At 55:28 how did we leave from the first line to the second please
Thank you Stanford and Professor for the excellent lecture!
at 32:46 it's like computing the entropy , but way if any one knows please feel free to comment
51:39 how to get it? I don't understand.
At 51:45, when he says "we need to change the index to x from w, else we'll get into trouble" while taking the inner derivative of exponential term.
How can he change the index when the denominator term coming will exactly same as the derivative of exponential term and they should cancel each other.
Changing index changed the fundamental definition of P(o|c).
Is there something I am missing here.
What we are expecting is a sigma, i.e. a sum over a range. Now we have to look for a way of expressing that sigma, if we choose the index w, then we'll confuse it with the other sigma notation, even though they are completelt different. Hence we will use a different index. And the different index doesnt change the original definition of p(o/c) because its just an index, a way of expressing sigma.
hopefully I will be proud after it's completion
00:56:55 Gensim word vectors example
01:05:16 Student Q&A
1:10:50 Why would you average both vectors together, wouldn't it be useful to keep both of the vectors depending on the different tasks that need to be done?
This might be silly but, at after 55:00 when we take ux out of the derivative, why do we lose the transpose operator?
Same doubt, did you figure out by any chance?
@@Ad-qv7ijI guess there is a little error there. If you try to derive it on your own you will reach the right expression.
Thankyou so much for providing this lectures.
Isn't wt the center word instead of wj on slide 23 (30:52)?
wt is the center word
yes Wt is the center word. j is changing from -m to m.
The best video on nlp
Never seen a beautiful lecture before!.
How are the initial probabilities of the context word vectors calculated? They are mentioned at 55:29 but not how they are determined.
Wow this vector idea is interesting. Have we tried getting models to emit nonsense text that nonetheless has a similar vectors to real words and seeing if human brains sort of subconsciously get that same meaning? Computers could be really good at writing poetry o.o
Like onomatopea and Lewis Carrol dialed up to 11
Calculus noob question: But why don't the two [for w from 1 to V Sum over u_x^T*v_c cancel out at
55:10
Great content, excellent delivery.
When we take chain derivative, why do we lose transpose operation? For example, on 53:06 there is just u_x, not the u_x^T, why?
We can treat that as a gradient. The dot product can be viewed as a multivariable function with input (v_c1, ... , v_cd), and therefore we can calculate the gradient of it w.r.t. each of the components of v_c. Since gradient is the direction that v_c should go in order to increase the value of the dot product, this gradient vector can be added to v_c, so they should have the same shape :)
Do areas of sparsity in the high dimensional word2vec space mean anything: For example, can you say - some word should exist here, but doesn't?
I wonder if words which don't have an equivalent in other languages fit here
The version of the SciPy library seems to be too new for the assignment to work properly. I can't import the triu. if someone knows it ,please commont
Why is the change in variable at 51:38 necessary? Does it not represent the same quantity whether we use uw or ux?
I want to know whether it provide homeworks' answer
It was a great lesson. Hope the sound quality will be better in the future.
Great again!
are these lecture slides available to us??
is there any way to get access to the notebooks shown through out the course? Thanks!
You could explained the probability portion of the vector a little more sir... The differentiation of the vectors is quite straight forward
41:10 Gradient I don't understand. How to get it? Can anyone reading the comments give me advice?🤗🤗🤗
check up your knowledge of single variable calculus (derivative, differentiating, interpretation of a derivative & applications of derivative) and then just basics of multiple variable calculus (functions of several variables, partial derivatives), mit 18.01sc 18.02sc could be good (and free) resources for picking it up. that is if you want to get an understanding of the math under the hood, I'd say that in parallel you definitely could practice with the higher-level applications of it just like this course.
Thankss@@izumiasmr
This is so amazing. Thank you so much for the wonderful explanation
Which are his personal sentences ?
in every sub topic they share their learning experience...
This is amazing, thank you for uploading this online
Hi sir , Is it possible to use Neural networks to learn new dialects and translate new words that belong to unknown new dialects for various languages..?
I had a question about "observed - expected" around @55:48. Maybe I misunderstand but isn't the summation of p(x|c)*ux our prediction therefore making it our observed?
Yes, it is our prediction, but because that's our prediction, that would be the expected. The word vector we obtain from uo (our actual observed word vector) would be our observed, then we subtract the sum of p(x|c)ux from it to obtain margin of error. In a perfect case, they would subtract to 0, which he explains at 55:44.
Good content, but explanation-wise they are missing intutions at some points, especially when formula for word vectors are getting dervied.
I am wondering how the two vectors (Uw, Vw) are determined for each Word?
First of all, thank you so much for this amazing course. I have learned a lot from your lectures. Can I ask when this course will be updated?
Hi Raphael, thanks for your feedback and question! Our team is looking into adding new lectures for this course in the future :)
@@stanfordonline Sounds like it won't be soon :)
On slide 23, the Likelihood is missing a root-T of the double product.
how does Christopher d manning papa think?
Great Lecture, will finish the entire series
I really liked this guy
I have to learn to listen to the professors like editors to your previous self
Reminds me of Sheldon for some reason
Objective function seeks to maximise the probable likelihood of context word given center word.
However should it also not try to minimise the probability of incorrect context words given center word?
I got the answer, the way probabilities have been calculated ensures this happened in the denominator
I dont get what theta(parameters) here?
How can I get solutions for the assignments of this course? I'm looking for solutions for Winter 2021 ver.
github
very fruitful!!
how can i get the ppt ?
哪里有中文版本的呢?中文字幕也可以
bilibili
I loved it!!!
link of textbook?
I have little to no knowledge about machine learning... Can I still start this course? Is it beginner friendly?
Hi there, great question! If you are just beginning to learn about machine learning we recommend starting with this course: www.coursera.org/specializations/machine-learning-introduction
It seems this course is theory based, where can I learn to code these concepts and algorithms?
coursera
27:51 - word2vec
Sir absolutely loved your explanation. Thank you very much
Is this course suitable for beginners?
So NICE!
everytime he says something important video stops. great
watching on 1.5 speed smooths out the stuttering and is still understandable for the most part
Great lecture.
34:49 what is uo and vc
I think vc is the vector representation of the center word and uo is the vector representation of a context word
thx for sharing
It is an amazing lecture
Thank you, great lecture!
lezz go !!
56:55
Thank you.
Great
"um"
35:31
13:55
Great knowledge it seems, but give this to an Indian youtuber, and he will make a 3 video series out of a single lecture that is easier to understand. #opinion
LOL TRUE
Day 1 .
14:10
What am going to get out of this very video, lets see
what is stopping you here?
what is the qualification of this professor ?
Christopher d manning papa
w0o0ord