how the tokenizer for gpt-4 (tiktoken) works and why it can't reverse strings
Вставка
- Опубліковано 16 січ 2024
- chris breaks down the chatgpt (gpt-4) tokenizer and shows why large language models such as gpt, llama-2 and mistral struggle to reverse words. chris looks at how words, programming languages, different languages and even how morse code is tokenized, and shows how tokenizers tend to be biased towards english languages and programming languages,
- Наука та технологія
The funny thing is the most complete the vocabulary the less pressure in the upper layers, so it's not only cheaper because of fewer tokens, but in processing, I wonder if somebody has prepared a semi handcrafted tokenizer, where, let's say the first 30K tokens come from a dictionary and the rest is generated.
exactly. tbh, i wouldn't' be surprised if someone goes that direction
It is very helpful to understand how the tokenization works. Thanks! Do you think data that we encode using tiktoken will be sent to the AI?
definitely not, it's all local
Thanks, very interesting information
glad it was useful