The Biggest Misconception about Embeddings

Поділитися
Вставка
  • Опубліковано 7 тра 2023
  • The biggest misconception I had about embeddings!
    My Patreon : www.patreon.com/user?u=49277905
    Visuals Created Using Exclidraw:
    excalidraw.com/
    Icon References :
    Bird icons created by Mihimihi - Flaticon
    www.flaticon.com/free-icons/bird
    Whale icons created by Freepik - Flaticon
    www.flaticon.com/free-icons/w...
    Carrot icons created by Pixel perfect - Flaticon
    www.flaticon.com/free-icons/c...
    Kale icons created by Freepik - Flaticon
    www.flaticon.com/free-icons/kale
    Book icons created by Good Ware - Flaticon
    www.flaticon.com/free-icons/book
    Book icons created by Pixel perfect - Flaticon
    www.flaticon.com/free-icons/book
    Sparkles icons created by Aranagraphics - Flaticon
    www.flaticon.com/free-icons/s...
    Flower icons created by Freepik - Flaticon
    www.flaticon.com/free-icons/f...
    Feather icons created by Freepik - Flaticon
    www.flaticon.com/free-icons/f...
    Communication icons created by Freepik - Flaticon
    www.flaticon.com/free-icons/c...
    Student icons created by Freepik - Flaticon
    www.flaticon.com/free-icons/s...
    Lunch icons created by photo3idea_studio - Flaticon
    www.flaticon.com/free-icons/l...

КОМЕНТАРІ • 46

  • @shoaibsh2872
    @shoaibsh2872 Рік тому +19

    It feels like the shorter your video is the more informative it is 😅, you don't only explain what's embedding is but also explain how it can differ based on problem statement in less than 5 minutes

    • @ritvikmath
      @ritvikmath  Рік тому +7

      Thanks! I’m trying to make shorter videos and learning that it can actually be more challenging than making a longer one

    • @johannestafelmaier616
      @johannestafelmaier616 Рік тому

      I'd say Quality > Quantity.
      Time is valuable and that is probably one reason why shorter from videos are becoming so successful.
      Also, I'd also say making shorter educational videos forces you to cut away everything that is not important, which should leave you with a clearer picture of what the essence of that concept is.

    • @xspydazx
      @xspydazx 10 місяців тому +1

      reality , make a base model .. highly tuned and use this as your starting point for new models ... preserve you base at all costs... often online versions are poluted ...

  • @SierraSombrero
    @SierraSombrero Рік тому +4

    I've never commented on any of your videos before but thought it was time to do so after this one.
    Thank you so much for all the great work!
    For me you're the best explaining data science and ML concepts on youtube.
    I also love how broad your range of topics is. I feel like I used your content to understand concepts in NLP and general Data Science but also RL or Bayesian Approaches to Deep Learning.
    Your real life and intuition explanations are really strong. Keep it up!

    • @ritvikmath
      @ritvikmath  Рік тому +2

      Hey I really really appreciate the kind words and would absolutely love more comments and feedback in the future

  • @adaoraenemuo4228
    @adaoraenemuo4228 Рік тому

    Love love your videos! Very clear with meaningful examples!

  • @polikalepotuaileva6006
    @polikalepotuaileva6006 4 дні тому

    Excellent video. Thanks for taking the time to share.

  • @jfndfiunskj5299
    @jfndfiunskj5299 Рік тому +3

    Dude, your videos are so damn mind-opening.

  • @andreamorim6635
    @andreamorim6635 20 днів тому

    Thanks for the explanation! Really easy to understand after watching this video!! keep up the good work

  • @baharrezaei5637
    @baharrezaei5637 5 місяців тому

    best explanation I have seen of embeddings by far, Thanks 🌻

  • @chupacadabra
    @chupacadabra Рік тому +5

    There is also a misconception in the way you describe how embeddings are formed. It is not that words that appear in the same sentences are mapped to a nearby embedding, but rather words that share the same context, i.e. that appear separately with the same neighboring words.

    • @shachafporan8048
      @shachafporan8048 Рік тому

      Well yes, but actually no...
      Practically in the common case - you are right and this is how it is done in word2vec and other models as well - we build each word's embedding by its context.
      But if we take the message of the video and apply it here, you may also decide that this is how you want to define our word embeddings...
      I'm not sure what the benefits of this would be (it might be somehow reminiscent of LDA, for example) -- but we have the freedom to decide how we build our embeddings.

    • @chupacadabra
      @chupacadabra Рік тому +1

      @@shachafporan8048 I agree that you can have different algorithmic flavors how you derive embeddings, just as you can have different corpus you want to specialize the embeddings for (as pointed by the video). Words can have different meanings/embeddings that are useful for different purposes. It's the old no-free-lunch theory.
      However, in all the algorithms words end up with similar embeddings not because they co-occur, but because of the same context they appear in. Words "Monday" and "Tuesday" rarely co-occur, but end up having similar embeddings. This is true even for algorithms such as Glove, which is based on co-occurrence, but derives similar meaning/embedding trough the network effect. It's not only word2vec, most tranformers also use the same idea with masked word predictions.
      I love the video. It's just that co-occurrence is not in the heart of the embeddings. And it's hard later to understand other nice properties of embeddings if you look at it that way.

    • @shachafporan8048
      @shachafporan8048 Рік тому

      @@chupacadabra nah man... I mean, we're not disagreeing... But my feeling was that his example of embeddings in a language model is just a specific manifestation of *embeddings*... If you embed an image how would you relate it to a context? Also, if you embed words e.g. in an explicit contrastive learning problem, you can have plenty of success without context.
      So again... In general for language you are correct, but the world is too diverse to think of embeddings only in this specific manner

  • @alexsischin2107
    @alexsischin2107 Рік тому +2

    Would love to see more about embeddings

    • @ritvikmath
      @ritvikmath  Рік тому +1

      Noted! Thanks for the feedback

  • @gordongoodwin6279
    @gordongoodwin6279 7 місяців тому

    this is a fantastic video. I found myself confused as to why NNs needed an embedding layer each time and why we didn't just import some universal embedding dictionary. This made that super simple! Parrots and carrots and kales and whales and cocks and rocks!

  • @MindLaboratory
    @MindLaboratory Рік тому

    I'm working on embeddings for a very particular application inside a game. Lots of natural language, but also lots of game-specific language. I started by downloading GloVe, find each word that appears in my vocabulary and in GloVe and copying that vector in my model for the matching word, and using a random vector for words that do not appear in GloVe. Then running an update function using a random sample of sentences each loop. Does this sound viable?

  • @Canna_Science_and_Technology
    @Canna_Science_and_Technology 6 місяців тому +1

    In a RAG-based Q&A system, the efficiency of query processing and the quality of the results are paramount. One key challenge is the system’s ability to handle vague or context-lacking user queries, which often leads to inaccurate results. To address this, we’ve implemented a fine-tuned LLM to reformat and enrich user queries with contextual information, ensuring more relevant results from the vector database. However, this adds complexity, latency, and cost, especially in systems without high-end GPUs.
    Improving algorithmic efficiency is crucial. Integrating techniques like LORA into the LLM can streamline the process, allowing it to handle both context-aware query reformulation and vector searches. This could significantly reduce the need for separate embedding models, enhancing system responsiveness and user experience.
    Also, incorporating a feedback mechanism for continuous learning is vital. This would enable the system to adapt and improve over time based on user interactions, leading to progressively more accurate and reliable results. Such a system not only becomes more efficient but also more attuned to the evolving needs and patterns of its users.

  • @randoff7916
    @randoff7916 Рік тому +3

    When the sample size is large, does the embedding for individual words start to converge?

  • @jordiaguilar3640
    @jordiaguilar3640 11 місяців тому

    great teaching.

  • @turkial-harbi2919
    @turkial-harbi2919 Рік тому

    Simply beautiful 🙏

  • @zeroheisenburg3480
    @zeroheisenburg3480 Рік тому +2

    One thing I don't understand is that why are these embeddings learned through deep learning with non-linearity in-between could be compared with linear metrics such as the most commonly used cosine similarity. I can't find a good discussion anywhere.

    • @SierraSombrero
      @SierraSombrero Рік тому +2

      The deep learning models are trained using non-linearities to capture non-linear relationships in the data. Hence, the function (=model architecture) you use to learn the embeddings has non-linearities.
      When we train a deep learning model to obtain an embedding, we most of the time have an embedding layer as the first layer in our model. We then train the model using a specific objective (goal), which is suitable to obtain word embeddings. After having trained the model, we just take the embedding layer out of the full model and discard the rest. You can imagine the embedding layer as a matrix of size (vocab_size x embedding_dimension). That means each word/token in our vocabulary is represented by a vector with as many numbers as the embedding dimension. The matrix (embedding layer) itself has no non-linearities, it's just a matrix. Therefore, the vectors that represent the tokens can be compared with each other using linear metrics as you said above.
      Hope it helps :)

    • @zeroheisenburg3480
      @zeroheisenburg3480 Рік тому

      @@SierraSombrero Appreciate the response. But I think there's some critical issue lingering.
      1. The input is a matrix. It goes through linear -> non-linear -> linear transformations. The back-propagation has to go through the same steps when updating the embedding layer's weight. So it's carrying non-linear information over to the embedding layer, thus breaking the linear properties, right?
      2. By "The matrix (embedding layer) itself has no non-linearities", does that mean I can extract any weights before the activation unit in a neuron and use it as embedding?

    • @SierraSombrero
      @SierraSombrero Рік тому +1

      ​@@zeroheisenburg3480
      I'll try to answer as best as I can. I'm not sure I'll be able to answer question 1 satisfactorily, though :)
      I guess I'll start with question because I can explain it better.
      2. An embedding layer is not the same as a linear layer. It does not represent neurons and does not output activations (but rather representations).
      In a linear layer you have an input x that you multiply with the weight w and then you add a bias b. (I don't know of any case where weights have been used as embeddings.)
      An embedding layer can usually only be the first layer in a network. You don't multiply an input x with a weight w here.
      Instead you have a number of input classes in the form of integers (that represent e.g. words) that you can feed your model (the number of integers is your vocab size). Each of these input integers is mapped to one row in your embedding layer (vocab_size x embed_dim). You can imagine it like table where you look up which embedding belongs to which word.
      Once you have looked up the embedding for your current word, you use it as input to the next layer in your model.
      Now, before having trained your model the embedding is random and the embedding layer is updated during training using backprob just like every other layer (though differently because it is a different mathematical operation than a linear layer).
      After training the model, the embedding layer has been changed so that every of your inputs words now has a meaningful representation in the embedding space (if your training was successful).
      Now you can take the lookup table (embedding layer) out of your model, feed it a word and it will give you the meaningful embedding belonging to your word.
      I suggest you to check out the difference between Linear and Embedding layer in pytorch :)
      Make sure to understand what kinds of inputs you feed it and what you get as outputs.
      pytorch.org/docs/stable/generated/torch.nn.Linear.html
      pytorch.org/docs/stable/generated/torch.nn.Embedding.html
      Maybe also try to find a good explanation of how the first static embeddings were trained (CBOW, Skipgram).
      I think this should give you the intuition.
      1. It's true that during training backpropagation also non-linear operations take place.
      However, since you're discarding all non-linear parts of the model and only keep the embedding layer, it is definitely possible in practice to apply linear operations on them.
      If there are theoretical mathematical issues lingering in the background then I'm certainly the wrong person to answer your question.
      But since it works so well in practice I would personally not worry too much about it :)

  • @BlayneOliver
    @BlayneOliver 3 місяці тому

    How do you introduce categorical embeddings into a seq2seq model which works on sequence input_layers?

  • @cirostrizzi3760
    @cirostrizzi3760 Рік тому

    Greati video, very informative and clear. Can someone tell me some names of modern embeddings (e.g. OpenAI) and maybe give me some sources to search and understand more about them?

  • @garyboy7135
    @garyboy7135 9 місяців тому

    Maybe some topics around word2vec and popular embedding method. And how embedding can expand beyond texts.

  • @Tonkuz
    @Tonkuz 2 місяці тому

    What will happens to the embedding created for one LLM if I change the LLM

    • @mojekonto9287
      @mojekonto9287 Місяць тому

      Nothing. At least in the context of a RAG system where you use the embeddings to search through their vector database to retrieve context for the LLM.

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому

    are there other options besides embeddings?

  • @Septumsempra8818
    @Septumsempra8818 Рік тому

    Time series embedding? And encoders?

  • @xspydazx
    @xspydazx 10 місяців тому

    really if you train your embedding model with entity lists and topic themed sentences , ie highly classified and enity rich data paired with its assocciated topic or entity. then you will build the right model. this modfel should form your base model and when performing tasks then you "fine tune the model" for the customized corpus so that it will no also update the vocabulary from your new corpus ... Re assigning the terms closer. to optimize then it would be neccasary to retrain for a set of epochs (without over fitting the new data ) as the pretrained model contains the data (YOU WANT UNDERNEATH) the new model is poluted to the new data corpus... hence keeping a base model unchanged to give your projects a jumpstart... tuning these models with new entity lists and topic lists etc .. updating the new knowledge in the model. even cleaning and pruning the vocabulary of stop words which are unwanted from the model , even offensive words and missassigned words... So a base model is your starting point . if you train a fresh model on the corpus then it will produce the results that you show.. it will essentailly not be fuit for purpose except the purpose it was trained for,,, ,

  • @EeshanMishra
    @EeshanMishra Рік тому

    Google sends me to you when I am working on Llama embeddings :)

  • @micahdelaurentis6551
    @micahdelaurentis6551 Рік тому

    that was a fantastic example

  • @value_functions
    @value_functions Рік тому +3

    I would like to point out an important distinction: The *concepts* described by the symbols in context of other symbols can have vastly different embeddings. The *symbols* themselves however need absolute/fixed embeddings. If you use multiple symbols in a sequence, like words in a sentence, you can use all the other symbols in order to give each other context.
    So the raw input embeddings are always the same. In that case, I would argue that the initial "common misconception" is actually accurate.
    Using a model like a transformer allows you to input a sequence of (fixed) symbol-embeddings and end up with contextualized embeddings in place of those symbols. The transformer then iteratively applies *transformations* on those embedding vectors depending on the *context* .
    The symbol "parrot" always starts as the same fixed embedding vector, no matter in which context it appears. But depending on the context, the repeated transformations done by the transformer eventually *map* that vector to another vector close to "parrot" if the context is a poem, or yet another vector close to "kale" if the context is a cooking recipe.
    This is why word2vec back then just was not enough. It only computed something similar to those input embeddings and then stopped there without doing those transformations.

  • @meguellatiyounes8659
    @meguellatiyounes8659 Рік тому

    embeddings are good tools for statistics discovery
    they hold the spirit of statistics of information organisation

  • @anishbhanushali
    @anishbhanushali Рік тому

    In short Context matters