Making a Language Model From 0 (You Can Run It Too!)

Поділитися
Вставка
  • Опубліковано 19 гру 2024

КОМЕНТАРІ • 143

  • @8AAFFF
    @8AAFFF  8 днів тому +10

    Go to piavpn.com/8AAFFF to get 83% off Private
    Internet Access with 4 months free (and support me :D)!
    thanks for watching!
    also discord server now: discord.gg/MC4wTeb4

    • @thefcraft8763
      @thefcraft8763 6 днів тому

      It's nice but i think your architecture has some flows like suppose a text "This is a ...." And now there are different possible next world predictions here like "dog, cow, mountain" and dog and cow are nearby in vocab dimensions space but mountain are might far apart and if you train your model in such cases it will average out the result and might give some nonsense or hallucinate etc... (basically it might give medium point/vector of cow dog and mountain)

    • @Myexpectationsarerealistic
      @Myexpectationsarerealistic 5 днів тому

      I did similar. Not touching Rust.

    • @AndroidFerret
      @AndroidFerret 5 днів тому

      The production and information value of this video is insane. How long did you edit this?? Fantastic

  • @scoffpickle9655
    @scoffpickle9655 8 днів тому +76

    The reason why the 160k batch REAN was worse with the graphics card prompt is because the network is overfitting itself, I'd recommend using a test set with some prompts to choose the model that performs best on that test set instead of just running it with high batch amounts

    • @8AAFFF
      @8AAFFF  8 днів тому +17

      ur right its most likely overfitted, the weird thing is that most other test prompts i was running were generally getting better with more batches so idk

    • @scoffpickle9655
      @scoffpickle9655 8 днів тому +5

      @8AAFFF It sounds like a data problem, then, too little or not general enough data would lead to worse curve fitting. I suppose that there wasn't much data about graphics cards, so it freaked tf out and kept spamming "graphics"

    • @8AAFFF
      @8AAFFF  8 днів тому +6

      maybe, also possible that the graphics cards knowledge just got overshadowed because it was in the beginning of the dataset. i did some more tests today and basically it just seems to have some knowledge points that it tries sticking to no matter what the prompt is

    • @PaulanerStudios
      @PaulanerStudios 6 днів тому +3

      @8AAFFF Are you using any sort of speculative decoding or temperature scaling? That wasn't mentioned in the video and does make quite a difference.

    • @NoSubsWithContent
      @NoSubsWithContent 6 днів тому

      ​@@8AAFFF what if you used an existing super efficient model like the granite MoE with 400M active parameters to comb through a different dataset like fineweb EDU and produce a list of knowledge it could access during training via RAG or something?
      if you figure out a way to do that I feel like it'd get much better performance because it doesn't have to spend so much of its weights on memorizing stuff, instead it can learn actual patterns, intelligence even?

  • @mrpro7737
    @mrpro7737 6 днів тому +23

    To editing skills in this video harder that that new architecture 😂

  • @IceMetalPunk
    @IceMetalPunk 5 днів тому +18

    I mean... your network only predicts the most likely next token, whereas GPT models predict the probability of all tokens and sample from there (they don't just choose the highest-probability token); and your tokens are just entire words from the corpus. So it's like a GPT model that (a) always has a temperature of 0, and (b) can't understand anything that's not a word present in the corpus. I think we can see from that why GPT and its similar models didn't go this route 😅

    • @jairjuliocc
      @jairjuliocc 5 днів тому +2

      With respect to temperature , is possible to find the k neighborhood vectors more similar , and adding a probability based in the similarity score. In this way you can mimic temperature

    • @IceMetalPunk
      @IceMetalPunk 4 дні тому +7

      @jairjuliocc Yet that would result in a very different probability space. It'll be "the probability of words that are most related to the most likely next word" instead of "the probability of words that are likely to be the next word".

    • @pacoalsal
      @pacoalsal 3 дні тому +1

      @jairjuliocc worth remembering that these embeddings can’t differentiate between senses of the same word. So “fly” the insect and “fly” the verb share the same point in the embedding space, e.g. somewhere in between an “animals” cluster and a “forms of locomotion” cluster. Sampling as you say, you’d get a mix of words closely related in one sense or another but you can’t distinguish which sense is relevant to the current context.

    • @IceMetalPunk
      @IceMetalPunk 3 дні тому

      @@pacoalsal Well, no, not quite. That's what the attention heads are for: they push and pull the embedding away from its initial position -- for instance, that midpoint between "locomotion" and "animals" -- based on the context, therefore separating the resulting vectors for each meaning. So the resulting vector would definitely encode the context-specific meaning. The problem here isn't that it fails to disambiguate context; it's just that the probability space would be based on the one most likely output rather than all the different likely outputs.

  • @gilbertenevoldsen4469
    @gilbertenevoldsen4469 6 днів тому +18

    This video is super well made and informative! But i'm a bit curious on why you chose the achitecture that you did. The reason this way of outputting words isn't typically used in large language models. Is because it's useful for the model to be able to have multiple high propability cadidates for the next word, that aren't necessarily close to each other in vector space.
    For example, let's say a sentence comes up in training like "My favorite hobby is..." There are a lot of possibilities for the next word. So the model would be optimised to output the average vector of those possible answers, which likely isn't a sensible continuation of the sentence.
    I would love to see what you could make by doing it the traditional way, and showing how good of a model you can train as a single person.

    • @simonwillover4175
      @simonwillover4175 6 днів тому +1

      or um maybe reward it for simply choosing any word close to any option rather than the average?

    • @WoolyCow
      @WoolyCow 5 днів тому

      @@simonwillover4175 could just be weighted by distance as well, or even add in some error on purpose to get some more divergent responses

  • @jondoe6608
    @jondoe6608 6 днів тому +11

    Out of curiosity are you aware of the RWKV architecture? Its a LLM thats based on a type of RNN, its main advantage is removing the hard context limit, making it possible to have longer contexts on weaker devices, due to using a constent amount of memory. Your idea of using embeddings as the input and output is really cool, especially due it further reducing vram requirements.

  • @salad_txt
    @salad_txt 8 днів тому +30

    You are so underrated it is actually insane, keep it up dude. Great stuff.

  • @zaj007
    @zaj007 8 днів тому +27

    18:25 Bro there has gyat to be a better way! I'm crying 😭😭 wtf is that timeline 💀💀

    • @8AAFFF
      @8AAFFF  8 днів тому +21

      bro did the tower of babel editing technique ahh

  • @toofardoug2188
    @toofardoug2188 6 днів тому +1

    This is so high quality it's nuts! The editing is excellent. The explanations are crisp. The relative context ti the SOTA for each variable choice js excellent. Such as. The origin and then evolution of concepts is extremely valuable. Such as the beginning/origin of tokenization that becomes embeddings.

  • @slowpoke101
    @slowpoke101 8 днів тому +6

    GReat video, these longer videos are always nice to see. Thank you for opensourcing the code.

  • @OscarTheStrategist
    @OscarTheStrategist 4 дні тому

    Excellent video. Thank you for explaining everything in such detail, despite the setbacks. WOuld love to see more of this architecture being refined if you deem it worthy of continuing development.

  • @lionlight9514
    @lionlight9514 8 днів тому +3

    This is so cool man! Please, keep going.

  • @jaythecoderx4623
    @jaythecoderx4623 6 днів тому +5

    This should have millions of views what the hell this is epic, very well edited too

  • @newtral6303
    @newtral6303 20 годин тому +1

    Your Implementation is kinda learning the Expectation over the whole corpus, whereas GPT type models learn the whole probability distribution.
    Since they capture the whole distribution instead of only the expectation, their outputs are much richer.
    Here, even if you trained it for like a lot of time I don't think you would get any better results, because the model will try to collapse to the expectation of the corpus provided (for the next token).
    Since, sampling from a distribution and always using the mean / expectation are two very different things.

  • @v0idbyt3
    @v0idbyt3 6 днів тому +1

    damn you made davinci resolve go kaboom at the end
    btw cool video! i hope this architecture eventually gets a remake or succeeds, because this could be a way better alternative to GPT architecture.

  • @aamindehkordi
    @aamindehkordi 6 днів тому +1

    Insane time spent and crazy W video. don't worry about compression or pacing this is gas and should blow up soon

  • @hasirciogli
    @hasirciogli 4 дні тому

    MAN YOUR ANIMATIONS SO PURFECT

  • @brams06
    @brams06 6 днів тому +3

    I was shocked to see that this video has so little views. I feel so lucky to come across this gem.

  • @Aragamin
    @Aragamin 6 днів тому

    Чел, это замечательная работа.
    Рад видеть, что энтузиазм порой превращается не только в увлечение, но и в серьёзные разработки)
    Продолжай свой путь!
    P.S: реально крутой дизайн видосов - зачёт.

  • @ClayShoaf
    @ClayShoaf 5 днів тому

    Switching the end from a token list to a vector output is pretty genius. This will probably make some non-natural language things (like coding and markdown formatting) more spotty, but for keeping the final layer small, it seems like it's worth a shot.

  • @WoolyCow
    @WoolyCow 5 днів тому +1

    was 784 a reference to mnist? loved the vid, well explained and beautifully edited :D dropped a sub

    • @8AAFFF
      @8AAFFF  5 днів тому

      Nice XD someone saw it
      Thx for the sub :)

  • @rkpstam
    @rkpstam 5 днів тому +2

    Хорошая работа, Олег

  • @Pratikology
    @Pratikology 6 днів тому +1

    Wtf, why isn’t This at a million views? keep it up bro what a fantastic blend of knowledge and creativity 🔥

  • @kotway2
    @kotway2 8 днів тому +1

    Very cool video and project man!

  • @TeamDman
    @TeamDman 6 днів тому +1

    Your animations are awesome :o

  • @mrcavas
    @mrcavas 5 днів тому

    Such a fire video! I love the style

  • @sandded7962
    @sandded7962 6 днів тому +2

    Hi , Can you elaborate on the 12:43 part where the circular text says the following:
    “I’m edging, I’m edging , I’m edging , I’m edging”

  • @PseudoProphet
    @PseudoProphet 6 днів тому +2

    Wow, it could work, 😮😮
    You just need a better and more complete dataset.
    You should have also tried to ask it questions that you knew were in it's training data, to see it's performance.

  • @Quozul
    @Quozul 5 днів тому

    This is an amazing project! And I love the graphics and visuals of the video too!

  • @bedahtpro
    @bedahtpro 6 днів тому

    Great quality video man!

  • @4.0.4
    @4.0.4 3 дні тому

    The problem with efficiency optimizations like RWKV, Mamba, BitNet etc is that the ones making huge models are reluctant (understandably) to train a 7-70B on them.

  • @TheLucaz37
    @TheLucaz37 4 дні тому

    This is amazing... ive always been fascinated by how AIs work

  • @A_Me_Amy
    @A_Me_Amy 5 днів тому

    i like ur ad or rather, ur general artistic style. Also for the model, i think that the idea of the vocabulary space makes sense. there is a research that came out today from meta that could pair with this fairly well about LCM as opposed to LLM, and it takes small sentence with limited tokens, and I could imagine if you were to in essence translate into the 768 vocab any sentence, or something like this... not technically aware enough to contribute more than to say this. perhaps word2word2vec2word2word process, so that it can still speak the full vocab list and understand it, but it processes the core essence in the smaller architecture. I do think that figure this out is the sort of future, or that there is a lot possible...Oh and the same dude who talked about this paper today also talked about another research form princeton about slow and shorter training leading to more in context learning ICL and that at some point when training weights it loses t he ability to do this.... but yeah the most fully reasoning model at the lowest possible is the new effect extension to the computational power halving in physical size and doubling in power process, i forget what it is called. meh. moores law. more law. even more law. 93/93 ok... but the new moores law. ai gets twice as smart every 2 years and half as large. I am quite sure this will be the trend to be honest.

    • @w花b
      @w花b 5 днів тому

      That's nice to have something that's not manim for once

  • @lewisbeith
    @lewisbeith 5 днів тому

    Very good editing!

  • @devbuffer0112
    @devbuffer0112 6 днів тому

    Creel style visuals, cool bgm and hot topics related to CS. You're gonna become huge

  • @ainet8415
    @ainet8415 3 дні тому

    Instead of training a rean model, try to take a model like llama 3.2 1b and add your idea (rean) as a layer and train this layer. Basically, fine tune it and use a high quality dataset

  • @vassa-vn7xz
    @vassa-vn7xz 6 днів тому +3

    How is this even working? Why will it not collapse to single embedding for all words?

  • @DallasMcMillan
    @DallasMcMillan 6 днів тому

    Incredible project and so many insights into ai in a fun and digestible way!
    Thanks !!!! ❤❤❤

  • @juliansantos1900
    @juliansantos1900 5 днів тому

    Crazy work not to mention crazier animation, i know the concept of the ais but dont have this extensive knowledge to write it on m y own without lm libs 😆

  • @that_guy1211
    @that_guy1211 4 дні тому

    i remember one time watching a channel explain how chatGPT became whicked cause somebody multiplied a variable by -1. And in that video they were like, there's the syntax teacher, and the reason teacher, the AI gets points for using "good" words, aka being censored and not using "bad" words, and the syntax teacher would give a punishment if the grammar was off, or if it repeated too much the words and stuff, that's how ChatGPT got so good, maybe you can implement something similar? But instead for the reason teacher being a censor filter, being something else? IDK

  • @driss1227
    @driss1227 5 днів тому

    The graphics or so great, curious what you used to produce this video? Looks like manim expertly used

  • @foreignconta
    @foreignconta 5 днів тому

    It's just a transformer which uses pre learnt embeddings from word2vec. The attention is still quadratic.

  • @hypercoder-gaming
    @hypercoder-gaming 6 днів тому

    With more training, this could definitely be very powerful.

  • @AverusMuto
    @AverusMuto 5 днів тому

    This is very useful. Thank you.

  • @alisyoung2741
    @alisyoung2741 6 днів тому

    I have been working on one as well but ran across issues currently! So exciting!

    • @8AAFFF
      @8AAFFF  6 днів тому +1

      yooo gl man
      are you doing like a custom architecture?

    • @alisyoung2741
      @alisyoung2741 5 днів тому

      Yes! I customized the standard U-net architecture by rebuilding the bridge to process input using a Bi-lstm a memory system and attention mech before re-upsampling.

    • @alisyoung2741
      @alisyoung2741 5 днів тому

      Your video actually inspired me to try and work on a kind of single token tokenizer that will produce a single unique token for any given input of a certain size, hopefully really large.

  • @AllExistence
    @AllExistence 7 днів тому +7

    You seem to have went a weird route with training. Normally, networks are just trained in plain text first, to learn normal language. Then, they are finetuned with "human/assistant" data to actually answer questions instead of talking to themselves.

    • @8AAFFF
      @8AAFFF  6 днів тому +2

      yeah thats true
      its just that the higher quality human/assistant dataset was so big that i didnt need to first train on raw text

  • @VioletGiraffe
    @VioletGiraffe 8 днів тому

    Even your animations are cool, how did you make them? Or do you have another neural net to do that for you? :)

    • @8AAFFF
      @8AAFFF  8 днів тому +1

      thanks :), basically with just images / clips in davinci resolve.
      I put the almost final timeline at the end 18:26

  • @kamertonaudiophileplayer847
    @kamertonaudiophileplayer847 5 днів тому

    You need to patent your approach. It's a very interesting, although I use a slightly modified one.

  • @xorcise
    @xorcise 5 днів тому

    8:20 ah yes
    good birchrunville Cassiano

  • @Moshi74533
    @Moshi74533 6 днів тому

    sick bro, absolutely sick

  • @toofardoug2188
    @toofardoug2188 6 днів тому

    I wonder if there's a better sampling mechanism when you're using the word2VEC model? If you watch the got from scratch video from youll see that andrej karpathy doesnt just take the highest predicted token. They sometime take from the top 3rd value.

  • @60pluscrazy
    @60pluscrazy 6 днів тому

    Amazing..how did you animate 👌🎉🎉🎉

    • @8AAFFF
      @8AAFFF  6 днів тому

      thanks :) all the animations are pretty much fully made up of davinci resolve images and clips and stuff
      i put the timeline at 18:26 if you want to see

  • @MrNootka
    @MrNootka 6 днів тому

    Hello! Nice video,
    In the section "Final word2vec Results" i.e. at point 11:14 and 11:28, you had a space inside the variable value of similar_by_world in one and the other you didnt... I wonder if the space changes the results

    • @v0idbyt3
      @v0idbyt3 6 днів тому

      a space in the code would make the compiler or interpreter think that its something else, so it would make an error (which is a difference)

    • @8AAFFF
      @8AAFFF  6 днів тому

      thanks :)
      also well done for noticing, the space does change the results because its a slightly different token in the word2vec (but they are really close to each other). i dont know why its there its probably by accident but if ur curious this is the output for "2021" with no space:
      [('2020', 0.7180283069610596),
      ('2022', 0.6416824460029602),
      ('2021 ', 0.6249533295631409),
      ('2019', 0.6035624742507935),
      ('October ', 0.5840676426887512),
      ('october ', 0.5773099660873413),
      ('January ', 0.5399696230888367),
      ('2020 ', 0.5389090776443481),
      ('2018', 0.5194795727729797),
      ('July ', 0.5182425379753113)]

    • @MrNootka
      @MrNootka 6 днів тому

      @@8AAFFF Thanks for the reply, I mainly asked because of your "tokenization" approach; Anyway I believe what you have cooked here has some serious potential! When I found your channel yesterday I binge watched most of your videos and this one and the dimensions simulator are my top favorite ones 😆, I am working on somerhing similar, keep up the good work!

  • @Kwenen
    @Kwenen 5 днів тому

    4:00 It's intuitive to do so, but I'm surprised that big companies still choose to output Token regressions

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 5 днів тому

      Because you need the model to be able to produce different outputs. For example if you have "Cat sits on a tree" and "Cat sits on a sofa" in the training data, this trained model will always predict (tree + sofa) / 2 when given "Cat sits on a" as a prompt, and there is no remedy for this issue

    • @Kwenen
      @Kwenen 5 днів тому

      @@ЕгорКолов-ч5с
      I don’t think it matters, because what we usually want is an output, and we don’t care what the value of a specific Token is (for example, 0~1 for emotion recognition), and the current model will also have situations where both Tokens are 0.5, which is also passed Throw a weighted die when an output is needed.
      The vector used in the video, (tree + sofa) / 2 also shows that this sentence can be followed by two words.
      Then I think the model can learn the usage of language very well. When calculating the similarity with the output, both It's 0.5, just throw a dice and everything is fine.
      I guess, in the video, the maximum value is always chosen, and there should be a chance of outputting other words when the probability is half and half. This is like using a Markhov chain, but letting the maximum value determine the transfer.
      :)

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 5 днів тому

      @@Kwenen I don't really understand what you are referring to. I'm just relaying a simple fact: output of this model is a 784 value embedding, that corresponds to a vector in word2vec space which is not as powerfull as a probability distribution over tokens. Generating next word is just taking the closest word in word2vec space to generated embedding. Because of the way word2vec works, 1) embeddings of contextually close words will be closer in word2vec space, so even if you randomly choose 1 out of 10 closest word to the embedding you will get sinonyms at best and gibberish at worst, 2) because word2vec doesn't care about word order, model trying to predict next token will always produce chains of same words over and over. The main reason that nobody uses this method is that it fundamentally doesn't work.

    • @Kwenen
      @Kwenen 5 днів тому

      @@ЕгорКолов-ч5с
      If the correction w2v brings to the model is not strong enough and the training is too slow, it will really make me give up this path.
      Oh, maybe what I said was a bit out of focus.
      I mean, we hope that the language model's output of the meaning is enough, even if it is a synonym. Therefore, I thought that if I output a vector (fuzzy meaning) and then select from similar words, it should be enough to support communication. That model may have a lighter load on the graphics card.
      Of course, if the model gives meaningless vectors, then choosing from them will really only result in a bunch of gibberish, then I can only say that it is very sad.
      And I naively thought that the position encoding of the input and Self-Attention were enough to make the output position-sensitive.
      So the idea of ​​playing with the prediction of the next token on the vector doesn’t really work?
      It’s just that this method really makes me feel intuitive.
      It is easy to imagine a sequence of arrows in hyperspace, pointing to multiple different words in sequence.
      As you reminded, this seems to be inefficient at the learning level. After all, the words are too discrete. Even if it is easier at the output layer, it does not mean that things have become easier, right?

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 5 днів тому

      @@Kwenen Maybe this could work, but there is no such this as a free meal in ML, in order for this approach to be viable, you would probably need to operate in embedding space that is as large as a common vocabulary (50000+ tokens for gpt3) instead of 784 dimensions, then there will probably be enough redundancy to make it competitive with NTP, but then the approach loses memory efficiency (and training such a big w2v model also becomes too hard)

  • @KristoferPettersson
    @KristoferPettersson 6 днів тому

    Arn't you using any sampling when picking the next token?

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 5 днів тому

      His architecture doesn't output a probability distribution, where is he supposed to sample from?

  • @yeyito7040
    @yeyito7040 4 дні тому

    2:51 MNIST MENTIONED!!!!!!!!

  • @mtlr3803
    @mtlr3803 6 днів тому

    crazy video!!

  • @FoxGoesBrrr
    @FoxGoesBrrr 3 дні тому

    content quality so high i think i need to pay for ts 😭🙏

  • @takeraparterer
    @takeraparterer 8 днів тому +4

    ua-cam.com/video/_B2RImihdUI/v-deo.html that's not correct. gpt models predict every "next word" from a sequence at the same time

    • @8AAFFF
      @8AAFFF  8 днів тому +1

      yeah 100% correct
      i just lied about it in the beginning for the explanation to be easier, but i do later correct myself
      well done for noticing :)

  • @Leb369
    @Leb369 8 днів тому

    very good video, the only default is the sound quality.

  • @yyhhttcccyyhhttccc6694
    @yyhhttcccyyhhttccc6694 3 дні тому

    what if you just made it output letters and looped it a bunch of times to spell words?

  • @rikhendrix261
    @rikhendrix261 6 днів тому

    3:11 i thought chat gpt 3 had a 12288 embedding size. You are saying as high as 4096.

    • @8AAFFF
      @8AAFFF  6 днів тому +1

      tbh i asked chatgpt whats its embedding dim XD so idk if its correct
      i looked it up again and ur right the biggest version of gpt3 is 12k embedding dim, and openai is really secretive about gpt4 so im probably completly wrong on that. thanks for pointing out :)

    • @rikhendrix261
      @rikhendrix261 6 днів тому

      ​@@8AAFFF Its okay I tought I might have been wrong. At 4:57 you are saying that you are going to compare the vector of the word it wants to predict to all the words in your database with its vector representation (RAG with cosine similarity). But words like Mole for example can be an animal, soemthing on your cheeck or arm, or it can be something having to do with the total number of molecules 6.02 x 10 ^23. Does this mean that your word database has these words written down multiple times?
      And at some point you said you had 340.000 words in your database?? instead of the 40.000 from openai?
      Im also interested to know what the most important thing you learned during this time was? I have only been learning about AI recently so im all ears.

    • @8AAFFF
      @8AAFFF  6 днів тому +1

      ah i get what ur saying. basically the word2vec model has a bunch of tokens in its vocabulary. every token can appear only once, but usually there are some accidental half duplicates like "mole", "mole ", " mole" etc...
      usually the duplicates have pretty much the same meaning as the "true" word just because they appear in exactly the same contexts.
      because the words are not written down multiple times there are some misplaced words that have meanings in two different areas so they are just awkwardly put in the middle of both "meaning areas".
      this doesnt really hurt the neural net because im relying on it understanding that even if a word's vector doesnt 100% match the situation, its still good because of the context of whatever its typing.
      as for the vocab size being 340k instead of something smaller like 40k its due to me using a tokenizer that splits the text into bigger tokens, usually the size of full words, instead of smaller half words like what openai uses.
      so for me "hello, world!" would be split into something like: "hello" "," " world" "!"
      and for openai same thing would be split into something like: "hel" "lo" ", " "wo" "rld" "!"
      so openai needs less of these tokens to fully encode english
      and probably the bigggest thing i learned with this was how to properly use tensorboard. its a library that lets you track stuff like the loss graphs in real time, compare how two different models trained and more stuff like that.
      the best way to learn about ai stuff is to do big projects like this because you actually encounter problems in the wild, then solve them, instead of just learning about solutions to problems you never had

    • @rikhendrix261
      @rikhendrix261 6 днів тому

      ​@@8AAFFF Wow, very interesting! Yes i now understand why your token count was higher. This would also mean that for you a simple "normal" english sentence would consist of less total tokens than openai which would save on the compute.
      Do you by chance follow "Discover AI" he has some very interesting videos on new Test-Time compute and Test-Time training which according to the literature saves a lot of time and has great results, but my level of AI understanding isn't at that point yet. Maybe that you would be able to combine that tactic with your own?
      I'll follow you and see what more you will post.

  • @TimeLordRaps
    @TimeLordRaps 6 днів тому

    bros cracked. thank you fellow.

  • @user-qw1rx1dq6n
    @user-qw1rx1dq6n 6 днів тому

    you should probably use a cosine similarity loss

  • @TheTruthOfAI
    @TheTruthOfAI 6 днів тому

    Hahaha funny guy.. it's like reading a long gpt4 hallucination

  • @_XoR_
    @_XoR_ 4 дні тому

    Add Byte-Pair Encoding to it :P

  • @ВладЧорний-ч4и
    @ВладЧорний-ч4и 6 днів тому

    This is fire

  • @Vine_Zx
    @Vine_Zx 6 днів тому

    Remember me when you make it big!

  • @tevincoburn1469
    @tevincoburn1469 5 днів тому

    Dude. Great video but like... Bump up your audio by like 4db. You're so quiet I have to max out my speakers.

    • @8AAFFF
      @8AAFFF  5 днів тому

      Thanks! Alot of people said that, was the general audio too quiet or just the voiceover?

  • @callowaysutton
    @callowaysutton 5 днів тому

    I think you're running into issues at the front of the pipeline, when "translating" from the vocabulary to the tokens try just blacklisting tokens already mentioned in the past 3 tokens up to the point you're translating at.

    • @8AAFFF
      @8AAFFF  5 днів тому +1

      To be honest i didnt think of that
      This could also work as some sort if temperature like GPTs have

    • @callowaysutton
      @callowaysutton 5 днів тому

      @@8AAFFF Temperature would be more equivalent to doing a left tailed random distribution over the list of tokens for the given category, this would just be a repeat penalty

  • @averesenso
    @averesenso 8 днів тому +3

    Your voice is quiet on my speakers

  • @yyhhttcccyyhhttccc6694
    @yyhhttcccyyhhttccc6694 3 дні тому

    tutorial?

  • @CC1.unposted
    @CC1.unposted 4 дні тому

    The reason GPT and other transformers have every token output instead is because these solve a problem so there's no weird outputs (Blurring effect if you try to do Image Gen because model is trained on multiple similar outputs per same input etc)
    You could just use renderable ASCII Why use Word2Vec because your using word2Vec to solve this problem and yet your still making it worse Transformer
    model will still need to memorize every word vector because now it needs to return new vector! It gives it extra capacity by just letting it not learn Vector embeddings but only understand it as much as it needs
    Your just telling a worst Transformer Model
    Current challenges of AI is It's generalisation for which you need something called Time dependent Archetexture not this universal vector to vector Aproximator
    Humman Brains do this infact Meta learning is also this but it's just too computationally intensive like you don't want to store so much specific Data per user it will be infecible
    You should just abandon this project! You could try making some code which Mutate Or trains a JS string so like users could write test cases and get a function which is far better infact I also tried it using basic Mutation in node js but was painfully slow because I didn't made it Regressive mutate (instead of random chars changing I can change keywords or Santax Friendly changeling chars)

  • @epicman9105
    @epicman9105 4 дні тому

    ur sick dude

  • @MommysGoodPuppy
    @MommysGoodPuppy 6 днів тому

    holy GOATED

  • @fortaber
    @fortaber 8 днів тому +2

    The editing of the video is just amazing!!

  • @lobiqpidol818
    @lobiqpidol818 6 днів тому

    🤓 well AksUaLly each embedding vector takes up space on the device. So while you save space by vector quantizing the output embeddings the vocabulary is still limited by GPU space. Also you lose the ability to do some calculations on the output like temperature. Good video

    • @yoavco99
      @yoavco99 6 днів тому

      You can probably just have it not be on the gpu, and just check the closest token on like the CPU or whatever.
      Also can't you just easily recreate temperature with this?

    • @8AAFFF
      @8AAFFF  6 днів тому +1

      yeah thats exactly what im doing
      the word2vec weight is stored on regular RAM and is only used to translate tokens back and fourth.
      so the only stuff on the GPU VRAM is the network and the already translated vectors.
      its true that i dont really have regular temperature like other GPT models but i can sort of recreate it by either adding noise to the input or selecting the 2nd / 3rd closest word to the network output instead of the 1st :)

    • @user-qw1rx1dq6n
      @user-qw1rx1dq6n 6 днів тому

      You can absolutely recreate temperature if you just train the embedding model differently

    • @lobiqpidol818
      @lobiqpidol818 6 днів тому

      @@8AAFFF what I've been thinking about what if you use very small embedding vectors only 2 dims for example to represent words then immediately expand it to more dimensions with linear layers when inside the model. Does the model see this as the same thing or completely different?

  • @sandded7962
    @sandded7962 6 днів тому

    That’s crazyyyyy

    • @8AAFFF
      @8AAFFF  6 днів тому

      cdn.discordapp.com/attachments/888851399490797620/1242133470235594853/attachment.gif?ex=675e49b1&is=675cf831&hm=dc928ebc5d6bb49010b1d0ce10dd3a420fbc86c69d8aeed38906f4dc3a526d0a&

  • @RasmusSchultz
    @RasmusSchultz 6 днів тому

    interesting idea, except... it doesn't seem to work? 😅

  • @Myexpectationsarerealistic
    @Myexpectationsarerealistic 5 днів тому

    I did the same thing.

  • @absentmindhere
    @absentmindhere 5 днів тому

    nice

  • @raihanhossain3423
    @raihanhossain3423 6 днів тому

    Is that your real credit card number? 🙃

    • @8AAFFF
      @8AAFFF  6 днів тому

      one way to find out

  • @piglava
    @piglava 4 дні тому

    I am writing this comment to comment a comment comment comment comment comment comment comment, and comment comment, comment...

  • @idf3da
    @idf3da 8 днів тому

    top!

  • @that_guy1211
    @that_guy1211 6 днів тому

    Bro, not trynna be mean or anything, but you AI looks dumb as hell.... Keep working on it mah dude, would love to see this architecture get better with a actual decent LLM on it bruv!

  • @Tenraiden
    @Tenraiden 6 днів тому +2

    Speak louder!!