Real-world Data Prep for LLMs: Challenges and Solutions

Поділитися
Вставка
  • Опубліковано 8 вер 2024
  • Building LLM applications? One of the top problems you'll face is going to be presenting the LLM with good input data. Good LLM responses need good input data. Clean, native text PDFs that are used in explainer articles and example code are rarely what you'll encounter in production use cases. Real-world data is wild to say the least!

    Here are some challenges you'll face:
    - Scanned PDFs
    - Scans with non-standard orientations
    - PDF forms with checkboxes and radiobuttons
    - Handwritten forms
    - Smartphone-clicked documents
    - Complex tables
    - Tables that span pages
    ​In this practical workshop, let's compare the various libraries and techniques we have at our disposal, looking at their strengths and limitations. This talk hopes to arm you with the knowledge of extracting raw text from real-world documents with the aim of sending that raw text to Large Language Models so that we can structure that data for easy processing downstream.
    ​Your speaker, Shuveb Hussain, is the co-founder and CEO of Unstract, an open source startup building an LLM-powered platform that extracts data from unstructured documents, helping automate critical business processes. Unstract currently extracts and structures millions of pages of real-word data every month. The two products they offer are LLMWhisperer, a Raw Text Extraction API and Unstract, an LLM-powered data structuring platform.

КОМЕНТАРІ •