A Fun & Absurd Introduction to Vector Databases • Alexander Chatzizacharias • GOTO 2024

How Millionaire Bankers Actually Work | Authorized Account | Insider

Encrypting Data in the Browser - Exploring Web Crypto APIs by Aakansha Doshi

МастерШеф 14 сезон. Випуск 1 від 24.08.2024 | ПРЕМ’ЄРА

Сказала дочке НЕТ!

Хто зверху? 2024 - Випуск 1 від 05.09.2024 | ПРЕМ'ЄРА

Real-world Data Prep for LLMs: Challenges and Solutions

Yujian Tang

Переглядів 1 033

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 8 вер 2024
Building LLM applications? One of the top problems you'll face is going to be presenting the LLM with good input data. Good LLM responses need good input data. Clean, native text PDFs that are used in explainer articles and example code are rarely what you'll encounter in production use cases. Real-world data is wild to say the least!

Here are some challenges you'll face:
- Scanned PDFs
- Scans with non-standard orientations
- PDF forms with checkboxes and radiobuttons
- Handwritten forms
- Smartphone-clicked documents
- Complex tables
- Tables that span pages
In this practical workshop, let's compare the various libraries and techniques we have at our disposal, looking at their strengths and limitations. This talk hopes to arm you with the knowledge of extracting raw text from real-world documents with the aim of sending that raw text to Large Language Models so that we can structure that data for easy processing downstream.
Your speaker, Shuveb Hussain, is the co-founder and CEO of Unstract, an open source startup building an LLM-powered platform that extracts data from unstructured documents, helping automate critical business processes. Unstract currently extracts and structures millions of pages of real-word data every month. The two products they offer are LLMWhisperer, a Raw Text Extraction API and Unstract, an LLM-powered data structuring platform.

КОМЕНТАРІ •

Наступне

Автоматичне відтворення

A Fun & Absurd Introduction to Vector Databases • Alexander Chatzizacharias • GOTO 2024

A Fun & Absurd Introduction to Vector Databases • Alexander Chatzizacharias • GOTO 2024

How Millionaire Bankers Actually Work | Authorized Account | Insider

How Millionaire Bankers Actually Work | Authorized Account | Insider

Encrypting Data in the Browser - Exploring Web Crypto APIs by Aakansha Doshi

Encrypting Data in the Browser - Exploring Web Crypto APIs by Aakansha Doshi

МастерШеф 14 сезон. Випуск 1 від 24.08.2024 | ПРЕМ’ЄРА

МастерШеф 14 сезон. Випуск 1 від 24.08.2024 | ПРЕМ’ЄРА

Сказала дочке НЕТ!

Сказала дочке НЕТ!

Хто зверху? 2024 - Випуск 1 від 05.09.2024 | ПРЕМ'ЄРА

Хто зверху? 2024 – Випуск 1 від 05.09.2024 | ПРЕМ'ЄРА

escape in roblox in real life

escape in roblox in real life

[1hr Talk] Intro to Large Language Models

[1hr Talk] Intro to Large Language Models

How to set up RAG - Retrieval Augmented Generation (demo)

How to set up RAG - Retrieval Augmented Generation (demo)

How to Optimize an RNN in PyTorch (~20% to over 80% accuracy)

How to Optimize an RNN in PyTorch (~20% to over 80% accuracy)

Marker: This Open-Source Tool will make your PDFs LLM Ready

Marker: This Open-Source Tool will make your PDFs LLM Ready

How to Build a Multi Agent AI System

How to Build a Multi Agent AI System

Chunk large complex PDFs to summarize using LLM

Chunk large complex PDFs to summarize using LLM

Beyond the Hype: A Realistic Look at Large Language Models • Jodie Burchell • GOTO 2024

Beyond the Hype: A Realistic Look at Large Language Models • Jodie Burchell • GOTO 2024

The software engineering industry in 2024: what changed in 2 years, why, and what is next

The software engineering industry in 2024: what changed in 2 years, why, and what is next

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Люди в Курській області просять українську армію захистити їх від російської. ЕКСКЛЮЗИВ ТСН.Тижня

Люди в Курській області просять українську армію захистити їх від російської. ЕКСКЛЮЗИВ ТСН.Тижня

Прием в первый класс (1969) #ссср #школа

Прием в первый класс (1969) #ссср #школа

I Took a LUNCHBAR OFF A Poster 🤯 #shorts

I Took a LUNCHBAR OFF A Poster 🤯 #shorts

Statue of Liberty Helps Blind Man Cross Road #shorts

Statue of Liberty Helps Blind Man Cross Road #shorts

«А ми під Україну підемо?»: жителька Курщини #україна #війна #зсу #курск

«А ми під Україну підемо?»: жителька Курщини #україна #війна #зсу #курск

СТРІМ ДО ДНЯ ЗНАНЬ З ЛЕВАМИ НА ДЖИПІ

СТРІМ ДО ДНЯ ЗНАНЬ З ЛЕВАМИ НА ДЖИПІ

"Ми в тюрмі побували. Що нас може лякати?": як служать колишні вʼязні / hromadske

"Ми в тюрмі побували. Що нас може лякати?": як служать колишні вʼязні / hromadske

ТЕРМІНОВО! Щойно ЗААРЕШТУВАЛИ ПУТІНА? Монголія ПЛАНУЄ наступ на РОСІЮ?

ТЕРМІНОВО! Щойно ЗААРЕШТУВАЛИ ПУТІНА? Монголія ПЛАНУЄ наступ на РОСІЮ?