how the tokenizer for gpt-4 (tiktoken) works and why it can't reverse strings

Поділитися
Вставка
  • Опубліковано 16 січ 2024
  • chris breaks down the chatgpt (gpt-4) tokenizer and shows why large language models such as gpt, llama-2 and mistral struggle to reverse words. chris looks at how words, programming languages, different languages and even how morse code is tokenized, and shows how tokenizers tend to be biased towards english languages and programming languages,
  • Наука та технологія

КОМЕНТАРІ • 6

  • @ernestuz
    @ernestuz Місяць тому +1

    The funny thing is the most complete the vocabulary the less pressure in the upper layers, so it's not only cheaper because of fewer tokens, but in processing, I wonder if somebody has prepared a semi handcrafted tokenizer, where, let's say the first 30K tokens come from a dictionary and the rest is generated.

    • @chrishayuk
      @chrishayuk  10 днів тому

      exactly. tbh, i wouldn't' be surprised if someone goes that direction

  • @feniyuli
    @feniyuli 2 місяці тому +1

    It is very helpful to understand how the tokenization works. Thanks! Do you think data that we encode using tiktoken will be sent to the AI?

    • @chrishayuk
      @chrishayuk  10 днів тому

      definitely not, it's all local

  • @ilyanemihin6029
    @ilyanemihin6029 3 місяці тому +2

    Thanks, very interesting information