LlamaOCR - Building your Own Private OCR System

Поділитися
Вставка
  • Опубліковано 21 лис 2024

КОМЕНТАРІ • 41

  • @Charles-Darwin
    @Charles-Darwin 2 дні тому +1

    Vision models be mysterious wizardry. They make me the most excited out of all bc I firmly believe a future conscious 'model' could be iterated from vision models (not new, but not mentioned enough i think). If there were a way to keep the vision model exclusively in virtual space... a whole wealth of experimentation could open up with visualizing things, it might even turn hallucinations into useful features.

  • @bzmrgonz
    @bzmrgonz День тому

    I'm gonna suggest this video to PAPERLESS-NGX, I think this needs to be a MUST feature on that project.

  • @WhyitHappens-911
    @WhyitHappens-911 2 дні тому +2

    Nice! Any difference with docling or llamaparse solutions?

  • @bzmrgonz
    @bzmrgonz День тому

    Question @Sam, so would the design of forms, documents etc to assist OCR help? For example delimiting label:data with a colon(:). Assuming colons have no reason to exist in text. In your opinon what works best? delimiters, color contrast?

  • @victorkarlsson5183
    @victorkarlsson5183 3 дні тому +3

    I'd be super interested in knowing the process of training on object detection / region of interest. Anyone have pointers where I can read up on this?

    • @KEKW-lc4xi
      @KEKW-lc4xi День тому +1

      I've done it before using YOLOv7 (don't use v8 that requires you to use some cringe website)
      And then for labeling images I used CVAT. CVAT will let you label and store your images and then save to yolo format and then it's a matter of piping it through to YOLOv7 framework for training.

  • @TheRealChrisVeal
    @TheRealChrisVeal 2 дні тому +1

    exciting!

  • @SDAravind
    @SDAravind 2 дні тому

    Can we get Bounding boxes using this model?

  • @ifeanyinnaemego
    @ifeanyinnaemego День тому

    Can it capture handwritten text perfectly

  • @itsbhardwaj1677
    @itsbhardwaj1677 3 дні тому

    when you are integrating it with Agents ?

  • @murattosundan
    @murattosundan 14 годин тому

    Can it recognize license plates in non latin alphabets?

  • @Piotr_Sikora
    @Piotr_Sikora 2 дні тому +4

    Doing simple OCR via LLM is shut fly using bazooka.

    • @_PataNahi
      @_PataNahi 2 дні тому

      I think they have the capability to understand the context of the information of the input. If there is any mistakes like simple letter mistakes, there maybe could be a feature to automatically correct those. There could also be a slider to adjust between more most original to most sensible. Without any of these, its just like any other model I guess.

    • @IoT_
      @IoT_ День тому

      Actually, it can be even worse than the specialized models like YOLO, Tesseract ,Paddle ,etc.
      For instance if you have custom ASCII symbols no LLM can provide a good recognition pattern like fine-tuned OCR library can

  • @darkreader01
    @darkreader01 2 дні тому +2

    does it work with handwritten text?

    • @gurupartapkhalsa6565
      @gurupartapkhalsa6565 День тому

      No, but you can train your own to work on your own handwriting specifically, without too much difficulty.

  • @el_arte
    @el_arte 3 дні тому

    What are the benefits of using a giant LLM for something as simple as OCR?

    • @samwitteveenai
      @samwitteveenai  3 дні тому

      They can get better results than things like Tesseract. You don't have to use a huge model like the 90b you can often get very good results as a much smaller model

    • @el_arte
      @el_arte 3 дні тому

      @ Does it help with extracting content from complex layouts? At a semantic level.

    • @hqcart1
      @hqcart1 2 дні тому

      after downloading tons of agents, i found out the hardwaym if you are using chatgpt or claud, agents are 100% useless and will give you worse results in real life applications, it's too early to adapt them.
      i think agents should actually be an LLM but in a very specific field, for example, an agent just know how to do math, or codes just in js, beats o1 model by a margine, and doesn't know anything else.

    • @daarrrkko
      @daarrrkko День тому

      OCR is not simple, and quality can be really bad. It also doesn't preserve original layout since it really just looks at characters in isolation.

    • @el_arte
      @el_arte День тому

      @ You can get way above 90% accuracy from models with less than 25 million parameters. As for extracting from arbitrary layouts, that remains hard, hence my follow up question.

  •  2 дні тому

    how to get rid of hallucination especially in this kind of project? i json a good ouptu format?

    • @ivan007230
      @ivan007230 2 дні тому +1

      I would say that on its own json output alone won’t help. It is only helpful if you know the structure of the data that is to be extracted (say, every document has a title, table with certain columns, etc). Then specifying json schema (expected output format) should help

    • @coredog64
      @coredog64 2 дні тому +2

      A few things that have helped me: Use a temperature at/near zero. If you have the potential for empty data, prompt to leave it out rather than give empty values.

    • @sandorkonya
      @sandorkonya 2 дні тому

      @@coredog64 leaving out is def. a good strategy. it even saves tokens.

    • @samwitteveenai
      @samwitteveenai  2 дні тому +2

      Another trick if latency isn't an issue is the sample multiple times and use an LLM as a judge to look for what is consistent and what just gets hallucinated occasionally

  • @staticalmo
    @staticalmo 2 дні тому

    did someone try to integrate it in n8n?

  • @ShresthShukla-h9n
    @ShresthShukla-h9n 3 дні тому

    👀👀

  • @wangbei9
    @wangbei9 18 годин тому

    If the model can return the coordinates, then it will be great and no point to use the OCR service from Microsoft and google anymore.

  • @alogghe
    @alogghe 2 дні тому

    This seems objectively bad at the job.
    The Walmart receipt just flat out ignored the whole central column of numbers.
    Reordering sections of text...
    Not seeing its usefulness at this level of error and garbling things.
    What about a mixed tesseract + LLM to correct it?

    • @samwitteveenai
      @samwitteveenai  2 дні тому

      yes this is why I talked about the Regions of Interests concept but I personally wouldn't use Tesseract for this. Also fine tuning the model for the kind of OCR that you want will halp it get much better as well.

    • @daarrrkko
      @daarrrkko День тому

      ​@@samwitteveenaiis there a way to generate synthetic scans at scale based on a certain structure? I think you mentioned using a tool to create the scan.

  • @viky2002
    @viky2002 2 дні тому +1

    Qwen vl is better than llama 3.2 on ocr

    • @choiswimmer
      @choiswimmer 2 дні тому +1

      Besides the huggingface leaderboards, do you have a live production example proving this?

  • @orangehatmusic225
    @orangehatmusic225 2 дні тому

    What a weird wrapper project. Just use llama vision and say :
    `Convert the provided image into Markdown format. Ensure that all content from the page is included, such as headers, footers, subtexts, images (with alt text if possible), tables, and any other elements.
    Requirements:
    - Output Only Markdown: Return solely the Markdown content without any additional explanations or comments.
    - No Delimiters: Do not use code fences or delimiters like \`\`\`markdown.
    - Complete Content: Do not omit any part of the page, including headers, footers, and subtext.
    `;
    cause literally that's all this project is doing.

    • @orangehatmusic225
      @orangehatmusic225 2 дні тому

      PS you need 64gb ram to run this version... not a very good script.

    • @suryakantbrewr
      @suryakantbrewr 2 дні тому

      ​@@orangehatmusic225use google colab

  • @greendsnow
    @greendsnow 2 дні тому

    There is Tika for that. Stop showing AI as the address to solved problems

    • @erniea5843
      @erniea5843 2 дні тому +2

      You do realize Tika uses deep learning… which is what fundamentally makes LLMs.