why llama-3-8B is 8 billion parameters instead of 7?
Вставка
- Опубліковано 20 кві 2024
- llama-3 has ditched it's tokenizer and has instead opted to use the same tokenizer as gpt-4 (tiktoken created by openai), it's even using the same first 100K token vocabulary.
In this video chris walks through why Meta has switched tokenizer and the implications on the model sizes, embeddings layer and multi-lingual tokenization.
he also runs his tokenizer benchmark and show's how it's more efficient in languages such as japanese
repos
------
github.com/chrishayuk/embeddings
github.com/chrishayuk/tokeniz... - Наука та технологія
Excellent demonstration Chris , thanks for sharing!
Great stuff.. no nonsense presentation style, clear and technical, as it should be 😅.. question: is there a reason why it’s not better to have common English syllables in the vocabulary? I understand “lov” being there, but I can’t imagine that “el” is a very useful token as part of “Lovelace”.. intuitively, I would think that is should simply be tokenized as “love” and “lace”
ok, that is all very concrete! Awesome. Thanks for this. This seems like a lot of quick wins that are easy to discover, or is that because hindsight by you explaining it so clearly? Anyway, its all a bit new to me. Perhaps, lets say Norway, would be wise to run this with their own tokeniser? Or is that to simplistic thinking?
Im super excited to see the `llama.cpp`, `llama2.c`, etc. category be implemented for llama3!
Agree
llama.cpp already supports Llama3
great video, thank you
Thank you, glad it was useful
What are you thought on including space in the tokenizer? I tried it once and the LLM was optimising to predict spaces as those easy wins for the LLM, but I like the way tiktoken has done to keep the space but not space as a token on it own....
I’m okay with it, if you watch my visualizing embeddings layer video you’ll see that words with spaces and words without spaces are so closely correlated on the initial embeddings layer that it’s basically a non issue. The cost however is the size of the vocabulary and therefore the embeddings layer size. It does however make the model much more efficient not having spaces handled separately. So having words with spaces as its own token makes so much more sense
Why is there some pytorch? Does finetuned or merged versions need it?