LlamaOCR - Building your Own Private OCR System

Sam Witteveen

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 26 гру 2024

КОМЕНТАРІ • 57

@jameswagstaff1962 Місяць тому ⁺²
I just tried this, it is very simple to use but it is basically just a wrapper for the together-ai package. All this is doing is restricting configurability! But thank you very much for the video and pointing me to this project. I was surprised at how accurate it is
@Charles-Darwin Місяць тому ⁺¹
Vision models be mysterious wizardry. They make me the most excited out of all bc I firmly believe a future conscious 'model' could be iterated from vision models (not new, but not mentioned enough i think). If there were a way to keep the vision model exclusively in virtual space... a whole wealth of experimentation could open up with visualizing things, it might even turn hallucinations into useful features.
@WhyitHappens-911 Місяць тому ⁺⁴
Nice! Any difference with docling or llamaparse solutions?
@TheRealChrisVeal Місяць тому ⁺¹
exciting!
@gotonethatcansee 26 днів тому
there used to be a chrome extension that made any img text editable , where is it
@bzmrgonz Місяць тому ⁺¹
I'm gonna suggest this video to PAPERLESS-NGX, I think this needs to be a MUST feature on that project.
@murattosundan Місяць тому ⁺¹
Can it recognize license plates in non latin alphabets?
@Piotr_Sikora Місяць тому ⁺⁵
Doing simple OCR via LLM is shut fly using bazooka.
@_PataNahi Місяць тому ⁺²
I think they have the capability to understand the context of the information of the input. If there is any mistakes like simple letter mistakes, there maybe could be a feature to automatically correct those. There could also be a slider to adjust between more most original to most sensible. Without any of these, its just like any other model I guess.
@IoT_ Місяць тому ⁺¹
Actually, it can be even worse than the specialized models like YOLO, Tesseract ,Paddle ,etc.
For instance if you have custom ASCII symbols no LLM can provide a good recognition pattern like fine-tuned OCR library can
@bzmrgonz Місяць тому
Question @Sam, so would the design of forms, documents etc to assist OCR help? For example delimiting label:data with a colon(:). Assuming colons have no reason to exist in text. In your opinon what works best? delimiters, color contrast?
@darkreader01 Місяць тому ⁺³
does it work with handwritten text?
@gurupartapkhalsa6565 Місяць тому
No, but you can train your own to work on your own handwriting specifically, without too much difficulty.
@seadude 29 днів тому
GPT-4o is surprisingly good at handwriting OCR, but as with all GenAI output, you must validate before using it for anything critical.
@victorkarlsson5183 Місяць тому ⁺³
I'd be super interested in knowing the process of training on object detection / region of interest. Anyone have pointers where I can read up on this?
@KEKW-lc4xi Місяць тому ⁺³
I've done it before using YOLOv7 (don't use v8 that requires you to use some cringe website)
And then for labeling images I used CVAT. CVAT will let you label and store your images and then save to yolo format and then it's a matter of piping it through to YOLOv7 framework for training.
@seadude 29 днів тому
Hm…I’d rather use Python to crop the image to a given region, then feed the entire cropped image to the vision model. Not sure why / if you can train a “general vision model” to only look at certain regions of an image…could be interesting but doesn’t that turn the model into a more traditional supervised model at that point?
@ifeanyinnaemego Місяць тому ⁺¹
Can it capture handwritten text perfectly
@KleiAliaj Місяць тому
Is it possible to do it in javascript ?
@SDAravind Місяць тому
Can we get Bounding boxes using this model?
@itsbhardwaj1677 Місяць тому
when you are integrating it with Agents ?
@beingalien6394 18 днів тому
How can i convert op to required op as json
@minhsenma Місяць тому
How many languages supposed?
@nirmesh44 Місяць тому
already fan of your videos the way you explain. Can you Please tell only for pdf document which llm model is good? i want to use locally. unstructured didn't help. even after pdf to image pixtral also didnt work. i want perfect accuracy.
@seadude 29 днів тому
Use a dedicated OCR model like tesseract or Azure Document Intel if you want to increase accuracy. Vision models should not be used for OCR at this point in the technology, at least not where accuracy matters.
@el_arte Місяць тому
What are the benefits of using a giant LLM for something as simple as OCR?
@samwitteveenai Місяць тому
They can get better results than things like Tesseract. You don't have to use a huge model like the 90b you can often get very good results as a much smaller model
@el_arte Місяць тому
@ Does it help with extracting content from complex layouts? At a semantic level.
@hqcart1 Місяць тому
after downloading tons of agents, i found out the hardwaym if you are using chatgpt or claud, agents are 100% useless and will give you worse results in real life applications, it's too early to adapt them.
i think agents should actually be an LLM but in a very specific field, for example, an agent just know how to do math, or codes just in js, beats o1 model by a margine, and doesn't know anything else.
@daarrrkko Місяць тому ⁺¹
OCR is not simple, and quality can be really bad. It also doesn't preserve original layout since it really just looks at characters in isolation.
@el_arte Місяць тому ⁺¹
@ You can get way above 90% accuracy from models with less than 25 million parameters. As for extracting from arbitrary layouts, that remains hard, hence my follow up question.
Місяць тому
how to get rid of hallucination especially in this kind of project? i json a good ouptu format?
@ivan007230 Місяць тому ⁺²
I would say that on its own json output alone won’t help. It is only helpful if you know the structure of the data that is to be extracted (say, every document has a title, table with certain columns, etc). Then specifying json schema (expected output format) should help
@coredog64 Місяць тому ⁺³
A few things that have helped me: Use a temperature at/near zero. If you have the potential for empty data, prompt to leave it out rather than give empty values.
@sandorkonya Місяць тому
@@coredog64 leaving out is def. a good strategy. it even saves tokens.
@samwitteveenai Місяць тому ⁺⁴
Another trick if latency isn't an issue is the sample multiple times and use an LLM as a judge to look for what is consistent and what just gets hallucinated occasionally
@staticalmo Місяць тому
did someone try to integrate it in n8n?
@alogghe Місяць тому
This seems objectively bad at the job.
The Walmart receipt just flat out ignored the whole central column of numbers.
Reordering sections of text...
Not seeing its usefulness at this level of error and garbling things.
What about a mixed tesseract + LLM to correct it?
@samwitteveenai Місяць тому
yes this is why I talked about the Regions of Interests concept but I personally wouldn't use Tesseract for this. Also fine tuning the model for the kind of OCR that you want will halp it get much better as well.
@daarrrkko Місяць тому
@@samwitteveenaiis there a way to generate synthetic scans at scale based on a certain structure? I think you mentioned using a tool to create the scan.
@OnePlusky Місяць тому ⁺³
Submitting your data to 3rd party is not PRIVATE !
@samwitteveenai Місяць тому ⁺²
All the models that I showed here can be run locally, most people wont have the GPUs to do it for the 90b though
@viky2002 Місяць тому ⁺³
Qwen vl is better than llama 3.2 on ocr
@choiswimmer Місяць тому ⁺¹
Besides the huggingface leaderboards, do you have a live production example proving this?
@zmeta8 Місяць тому
try the space of it on hf
@murattosundan Місяць тому
Its not better for thai license plates, i tested it.
@seadude 29 днів тому
Using a vision model for OCR is way too prone to hallucinations for anything critical. There are dedicated OCR tools that provide way more accuracy. At this point in the technology, I’d only use vision models for describing images, and only if they were not critical.
@murattosundan 29 днів тому ⁺¹
@ I don’t plan to use it in production. Unfortunately, of all the free ocrs available to python, none of them worked well enough for license plate reading even with post processing.
@wangbei9 Місяць тому
If the model can return the coordinates, then it will be great and no point to use the OCR service from Microsoft and google anymore.
@ShresthShukla-h9n Місяць тому
👀👀
@orangehatmusic225 Місяць тому
What a weird wrapper project. Just use llama vision and say :
`Convert the provided image into Markdown format. Ensure that all content from the page is included, such as headers, footers, subtexts, images (with alt text if possible), tables, and any other elements.
Requirements:
- Output Only Markdown: Return solely the Markdown content without any additional explanations or comments.
- No Delimiters: Do not use code fences or delimiters like \`\`\`markdown.
- Complete Content: Do not omit any part of the page, including headers, footers, and subtext.
`;
cause literally that's all this project is doing.
@orangehatmusic225 Місяць тому ⁺¹
PS you need 64gb ram to run this version... not a very good script.
@suryakantbrewr Місяць тому
@@orangehatmusic225use google colab
@nikosterizakis Місяць тому
Not sure of the usefulness of this. You can always use Lens and runs on a mobile phone ;)
@greendsnow Місяць тому
There is Tika for that. Stop showing AI as the address to solved problems
@erniea5843 Місяць тому ⁺³
You do realize Tika uses deep learning… which is what fundamentally makes LLMs.

Наступне

Автоматичне відтворення

Multi-Agent AI EXPLAINED: How Magentic-One Works