Vision models be mysterious wizardry. They make me the most excited out of all bc I firmly believe a future conscious 'model' could be iterated from vision models (not new, but not mentioned enough i think). If there were a way to keep the vision model exclusively in virtual space... a whole wealth of experimentation could open up with visualizing things, it might even turn hallucinations into useful features.
Question @Sam, so would the design of forms, documents etc to assist OCR help? For example delimiting label:data with a colon(:). Assuming colons have no reason to exist in text. In your opinon what works best? delimiters, color contrast?
I've done it before using YOLOv7 (don't use v8 that requires you to use some cringe website) And then for labeling images I used CVAT. CVAT will let you label and store your images and then save to yolo format and then it's a matter of piping it through to YOLOv7 framework for training.
I think they have the capability to understand the context of the information of the input. If there is any mistakes like simple letter mistakes, there maybe could be a feature to automatically correct those. There could also be a slider to adjust between more most original to most sensible. Without any of these, its just like any other model I guess.
Actually, it can be even worse than the specialized models like YOLO, Tesseract ,Paddle ,etc. For instance if you have custom ASCII symbols no LLM can provide a good recognition pattern like fine-tuned OCR library can
They can get better results than things like Tesseract. You don't have to use a huge model like the 90b you can often get very good results as a much smaller model
after downloading tons of agents, i found out the hardwaym if you are using chatgpt or claud, agents are 100% useless and will give you worse results in real life applications, it's too early to adapt them. i think agents should actually be an LLM but in a very specific field, for example, an agent just know how to do math, or codes just in js, beats o1 model by a margine, and doesn't know anything else.
@ You can get way above 90% accuracy from models with less than 25 million parameters. As for extracting from arbitrary layouts, that remains hard, hence my follow up question.
2 дні тому
how to get rid of hallucination especially in this kind of project? i json a good ouptu format?
I would say that on its own json output alone won’t help. It is only helpful if you know the structure of the data that is to be extracted (say, every document has a title, table with certain columns, etc). Then specifying json schema (expected output format) should help
A few things that have helped me: Use a temperature at/near zero. If you have the potential for empty data, prompt to leave it out rather than give empty values.
Another trick if latency isn't an issue is the sample multiple times and use an LLM as a judge to look for what is consistent and what just gets hallucinated occasionally
This seems objectively bad at the job. The Walmart receipt just flat out ignored the whole central column of numbers. Reordering sections of text... Not seeing its usefulness at this level of error and garbling things. What about a mixed tesseract + LLM to correct it?
yes this is why I talked about the Regions of Interests concept but I personally wouldn't use Tesseract for this. Also fine tuning the model for the kind of OCR that you want will halp it get much better as well.
@@samwitteveenaiis there a way to generate synthetic scans at scale based on a certain structure? I think you mentioned using a tool to create the scan.
What a weird wrapper project. Just use llama vision and say : `Convert the provided image into Markdown format. Ensure that all content from the page is included, such as headers, footers, subtexts, images (with alt text if possible), tables, and any other elements. Requirements: - Output Only Markdown: Return solely the Markdown content without any additional explanations or comments. - No Delimiters: Do not use code fences or delimiters like \`\`\`markdown. - Complete Content: Do not omit any part of the page, including headers, footers, and subtext. `; cause literally that's all this project is doing.
Vision models be mysterious wizardry. They make me the most excited out of all bc I firmly believe a future conscious 'model' could be iterated from vision models (not new, but not mentioned enough i think). If there were a way to keep the vision model exclusively in virtual space... a whole wealth of experimentation could open up with visualizing things, it might even turn hallucinations into useful features.
I'm gonna suggest this video to PAPERLESS-NGX, I think this needs to be a MUST feature on that project.
Nice! Any difference with docling or llamaparse solutions?
Question @Sam, so would the design of forms, documents etc to assist OCR help? For example delimiting label:data with a colon(:). Assuming colons have no reason to exist in text. In your opinon what works best? delimiters, color contrast?
I'd be super interested in knowing the process of training on object detection / region of interest. Anyone have pointers where I can read up on this?
I've done it before using YOLOv7 (don't use v8 that requires you to use some cringe website)
And then for labeling images I used CVAT. CVAT will let you label and store your images and then save to yolo format and then it's a matter of piping it through to YOLOv7 framework for training.
exciting!
Can we get Bounding boxes using this model?
Can it capture handwritten text perfectly
when you are integrating it with Agents ?
Can it recognize license plates in non latin alphabets?
Doing simple OCR via LLM is shut fly using bazooka.
I think they have the capability to understand the context of the information of the input. If there is any mistakes like simple letter mistakes, there maybe could be a feature to automatically correct those. There could also be a slider to adjust between more most original to most sensible. Without any of these, its just like any other model I guess.
Actually, it can be even worse than the specialized models like YOLO, Tesseract ,Paddle ,etc.
For instance if you have custom ASCII symbols no LLM can provide a good recognition pattern like fine-tuned OCR library can
does it work with handwritten text?
No, but you can train your own to work on your own handwriting specifically, without too much difficulty.
What are the benefits of using a giant LLM for something as simple as OCR?
They can get better results than things like Tesseract. You don't have to use a huge model like the 90b you can often get very good results as a much smaller model
@ Does it help with extracting content from complex layouts? At a semantic level.
after downloading tons of agents, i found out the hardwaym if you are using chatgpt or claud, agents are 100% useless and will give you worse results in real life applications, it's too early to adapt them.
i think agents should actually be an LLM but in a very specific field, for example, an agent just know how to do math, or codes just in js, beats o1 model by a margine, and doesn't know anything else.
OCR is not simple, and quality can be really bad. It also doesn't preserve original layout since it really just looks at characters in isolation.
@ You can get way above 90% accuracy from models with less than 25 million parameters. As for extracting from arbitrary layouts, that remains hard, hence my follow up question.
how to get rid of hallucination especially in this kind of project? i json a good ouptu format?
I would say that on its own json output alone won’t help. It is only helpful if you know the structure of the data that is to be extracted (say, every document has a title, table with certain columns, etc). Then specifying json schema (expected output format) should help
A few things that have helped me: Use a temperature at/near zero. If you have the potential for empty data, prompt to leave it out rather than give empty values.
@@coredog64 leaving out is def. a good strategy. it even saves tokens.
Another trick if latency isn't an issue is the sample multiple times and use an LLM as a judge to look for what is consistent and what just gets hallucinated occasionally
did someone try to integrate it in n8n?
👀👀
If the model can return the coordinates, then it will be great and no point to use the OCR service from Microsoft and google anymore.
This seems objectively bad at the job.
The Walmart receipt just flat out ignored the whole central column of numbers.
Reordering sections of text...
Not seeing its usefulness at this level of error and garbling things.
What about a mixed tesseract + LLM to correct it?
yes this is why I talked about the Regions of Interests concept but I personally wouldn't use Tesseract for this. Also fine tuning the model for the kind of OCR that you want will halp it get much better as well.
@@samwitteveenaiis there a way to generate synthetic scans at scale based on a certain structure? I think you mentioned using a tool to create the scan.
Qwen vl is better than llama 3.2 on ocr
Besides the huggingface leaderboards, do you have a live production example proving this?
What a weird wrapper project. Just use llama vision and say :
`Convert the provided image into Markdown format. Ensure that all content from the page is included, such as headers, footers, subtexts, images (with alt text if possible), tables, and any other elements.
Requirements:
- Output Only Markdown: Return solely the Markdown content without any additional explanations or comments.
- No Delimiters: Do not use code fences or delimiters like \`\`\`markdown.
- Complete Content: Do not omit any part of the page, including headers, footers, and subtext.
`;
cause literally that's all this project is doing.
PS you need 64gb ram to run this version... not a very good script.
@@orangehatmusic225use google colab
There is Tika for that. Stop showing AI as the address to solved problems
You do realize Tika uses deep learning… which is what fundamentally makes LLMs.