💡 There is a SMARTER way to split your documents for GenAI apps

Поділитися
Вставка
  • Опубліковано 12 чер 2024
  • Learn semantic splitting in this hands-on tutorial to improve your language model's performance on document processing tasks.
    We dive into a practical Python implementation for finding optimal segmentation points by meaning, essential for retrieval-augmented generation.
    Code along with me following the GitHub-hosted notebook and elevate your app's efficiency with this smart splitting strategy.
    GitHub Repo: github.com/bitswired/semantic...
    🌐 Visit my blog at: www.bitswired.com
    📩 Subscribe to the newsletter: newsletter.bitswired.com/
    🔗 Socials:
    LinkedIn: / jimi-vaubien
    Twitter: / bitswired
    Instagram: / bitswired
    TikTok: / bitswired
    00:00 Why Do We Split Documents?
    02:02 Semantic Splitting: The Theory
    05:06 Semantic Splitting: The Practice
    11:28 Takeaways
  • Наука та технологія

КОМЕНТАРІ • 15

  • @HassanAllaham
    @HassanAllaham 27 днів тому +1

    This is one of the most powerful videos related to AI I ever seen. Very clear, very informative, and very useful. Thanks for the good content 🌹🌹🌹

    • @bitswired
      @bitswired  27 днів тому +1

      Thank you very much for your kind words!
      It means a lot to hear that the video had such a positive impact on you and it makes all the effort worth it.
      Thanks again for watching and for taking the time to leave such a thoughtful comment 👍🏽

  • @natevaub
    @natevaub Місяць тому +2

    Great video bro, keep going with these fire topics!

    • @bitswired
      @bitswired  Місяць тому

      Thanks frero 💪🏽
      Let’s gooooo!
      Let’s make it work and play Elden Ring soon ahah

  • @cyberpunkdarren
    @cyberpunkdarren 7 днів тому +1

    Once all the vectors are loaded into the vector database the text splitting no longer matters. As long as you dont split on a compound word or phrase it doesnt really affect the vectorspace.

    • @bitswired
      @bitswired  6 днів тому

      Hey :)
      I see your point but I would say that in practice it’s not the case.
      For instance if you embed an entire page versus multiple smaller paragraphs the resulting vectors will be different even though you’ve indexed the same text.
      And it affects the similarity search.
      That’s why pyramidal embeddings are a way to improve RAG performance by indexing the data at different precision levels and using multiple index to answer queries.

  • @vogendo7377
    @vogendo7377 Місяць тому +2

    Very interesting

    • @bitswired
      @bitswired  Місяць тому

      Thanks big boss ❤️

  • @mariegautier3765
    @mariegautier3765 Місяць тому +1

    Love it ❤ You know how to transmit your passion, congrats 😍🦍🔥

    • @bitswired
      @bitswired  Місяць тому +1

      Merci Bella ❤️🦍🐆
      EKIP au max!

  • @oryxchannel
    @oryxchannel Місяць тому +1

    Good presentation but I do not understand how it's different from document AI's that can do this automatically. Why do this manually?

    • @bitswired
      @bitswired  Місяць тому +2

      Hey :)
      You’re right there are libraries that does it for you.
      However the purpose of the video was to understand how it works in depth, to do so I proposed a simple implementation from scratch.
      The goal was to help people grasp the concept.
      I hope you still enjoyed the video 😁

  • @MichaelScharf
    @MichaelScharf 4 дні тому

    Grat Video! But totally annoying music

    • @MichaelScharf
      @MichaelScharf 4 дні тому

      It makes is hard to understand you and it distracts from your great work

    • @MichaelScharf
      @MichaelScharf 4 дні тому

      If your video content would not be so great, I would have stopped watching