Extract text from images with Tesseract OCR on Windows

Поділитися
Вставка
  • Опубліковано 4 гру 2024

КОМЕНТАРІ • 112

  • @josephc3080
    @josephc3080 4 роки тому +11

    This is really good tutorial. I appreciate the care you took in going step by step, especially through altering the path.

  • @GNS216
    @GNS216 6 років тому +9

    This is the most helpful tutorial on Tesseract that I've found. Thank you.

  • @hkim644
    @hkim644 6 років тому +5

    omg. I was watching your video to install Tesseract. Meanwhile, I was amazed that you can read Korean. I thought you chose a random non-english language to prove Tesseract works with different language. Amazed as a Korean.
    I am trying to learn how OCR works because I want to make an app that requires OCR. But I have no coding experience or anything even close to digital languages, I am having some difficulties. At least I was able to use Tesseract after watching this video. Thank you so much!

  • @TheJoinckim
    @TheJoinckim 6 років тому +2

    Very very good tutorial for tessseract for koreans and clear pronunciation. Thank you.

  •  4 роки тому

    Thanks for this tutorial: I have had trouble with converting text in mayan language here in Guatemala, I followed your steps and voila!
    Next step for me is to figure out how to train a set of recognition for our local mayan alphabets.
    Thanks a lot.

    • @iancardenas-spanishbutcomp4074
      @iancardenas-spanishbutcomp4074 3 роки тому

      Did you get to train it for a different alphabet? Can you help me? I'm trying to get OCR working for IPA characters recognition

  • @emmanuelvelasco8753
    @emmanuelvelasco8753 6 років тому +1

    keep making these videos man! interesting content

  • @TzKet4m
    @TzKet4m 5 років тому

    Your voice makes me happy to browse youtube, so clear fuark

  • @seung-wanson9447
    @seung-wanson9447 6 років тому +1

    FYI, If we never add anything to PATH other than default one, it will not pup-up that edit selection box.
    So by looking your video, i need to manually make the entry by separating new one with ";" (semicolon)
    Afterwards, if i click the edit button, i get the same pop up edit box.

  • @deepak223098
    @deepak223098 4 роки тому +2

    Can you tell how to train our own dataset ??

  • @R.t.a.s
    @R.t.a.s 4 роки тому +1

    Thanks a lot for this but can i use this for manuscripts as well? And if so plz tell me how :)

  • @philglanville3974
    @philglanville3974 3 роки тому

    Hi, a very good tutorial, but as mentioned by yourself, and a comment by another, ref batch folder/file processing , I can not see or find any uploaded tutorial video ?????

  • @opheliafromlcf9509
    @opheliafromlcf9509 3 роки тому +1

    How did you turn each page of the pdf into pngs? Thank you for this high-quality video.

    • @opheliafromlcf9509
      @opheliafromlcf9509 3 роки тому

      Alright, alright, I got that to work. Now I am wondering how you write the code to make it run all the pngs at once instead of having to do each one line by line, one at a time.

    • @harmindersinghnijjar
      @harmindersinghnijjar 3 роки тому

      Hey there, you can use Snip & Sketch on Windows. I'm making a guide on just that currently.

  • @rezkiy95
    @rezkiy95 3 роки тому

    Thanks for no bs tutorial!

  • @allirashna2072
    @allirashna2072 4 роки тому

    im kind of skeptical of allowing changes to hardware. is it completely safe?

  • @pixelvader2451
    @pixelvader2451 5 років тому

    So, should I do it one by one? I have complete books, is there no way to do this for several images?

  • @itsdannyftw
    @itsdannyftw 5 років тому

    What mic are you using? Great video, thanks!

  • @ahmedfarouk8197
    @ahmedfarouk8197 6 років тому

    you can change your pdf to a one tiff file instead of converting it to several png files

  • @saikushalmandala6438
    @saikushalmandala6438 6 років тому +1

    thats a good video
    but, how to preprocess the input image and then pass through tesseract
    can u please help on it ASAP

  • @jennyf.2124
    @jennyf.2124 4 роки тому

    Have you maybe tried out wether it also works with handwritten texts?

    • @DFIRScience
      @DFIRScience  4 роки тому +1

      Hand-written text (block letters) will work, but not be very accurate. Ideally, Tesseract should be re-trained on whatever font you are focused on.

    • @jennyf.2124
      @jennyf.2124 4 роки тому +1

      @@DFIRScience I see, thank you very much!

  • @prateekgupta2916
    @prateekgupta2916 4 роки тому

    Hi sir
    Much needed video..
    Can u tell me how to train tesseract to identify specific font

  • @kevinsanti4091
    @kevinsanti4091 6 років тому

    a video on tips on how to train tesseract would be great! anyway thanks a lot for this video so far! helpful for my first steps and really appreciated!
    I'm wondering if someone has already done -as something more looking like a sort of end user application rather than an in-the-field programmer use - (or eventually how to do it ) 1) an overlay of the pictured document and the ocr recognition in such a way that the original document remain displayed as it is but "highlight-able " or 2) aslo how to generate a parallel ocr document which keeps the letter positioning and layout in the space page of the ocr output like on the original picture and in case of a document keep the original cutted picture in case of difficulties and low confidence level in the recognition. like for example on graphs pictures drawings...

  • @beastmonsterthing3
    @beastmonsterthing3 5 років тому

    thanks so much. easy to understand and so helpful. you're a legend

  • @fabarchimilku4073
    @fabarchimilku4073 3 роки тому

    Hi, how do link to the batch folder converting thingy?

  • @mrmikearmstrong
    @mrmikearmstrong 6 років тому

    Nice tutorial, makes everything nice and simple to handle - On another note, I want to call the tesseract.exe file from a .NET application that has just taken an image of some text, is there a way to get the output of the OCR as a string in the console? Or would I have to wait until the character recognition has completed, then go and read that text file at a later time?

    • @DFIRScience
      @DFIRScience  6 років тому

      Yeah, I'm pretty sure you have to read the file after. I'll check if you can output to pipe.

  • @punnarajeev867
    @punnarajeev867 4 роки тому

    can we convert captcha image into text

  • @hyperventilate7318
    @hyperventilate7318 3 роки тому

    I have photographs of people with the date printed below, can this solution extract the date? I need to do this for 1000s of photos. (batch)

  • @jarongaus
    @jarongaus 3 роки тому

    Your instructions are phenomenal. You are amazing to explain computer commands and tricks. The only problem is that this program sucks and it is a nightmare to use it
    Its not your fault. Thanks so much for teaching so many tricks.

  • @aradsoltani4646
    @aradsoltani4646 3 роки тому

    thank you that was very helpful:-D

  • @knowsmynametoonobody9191
    @knowsmynametoonobody9191 5 років тому +1

    nice video, it's what I'm looking for , So, thank you very much!😀

  • @danielveraec
    @danielveraec 5 років тому

    Thanks for the information.
    How can I install additional languages to the ones you sample? Maybe you already said it but my English is not very good and I didn't listen to it.

  • @cohas3424
    @cohas3424 6 років тому +1

    제가 찾던 동영상이네요 고맙습니다. ^^

  • @epochseven4197
    @epochseven4197 2 роки тому

    Interestingly enough, the default install path for the Windows x64 version is:
    C:\Users\username\AppData\Local\Programs\Tesseract-OCR

  • @luisguevara9292
    @luisguevara9292 5 років тому

    It helped me a lot. Thank you very much

  • @a2zGodz
    @a2zGodz 6 років тому +1

    How do u train the tesseract? Can u point me in the right direction with something I can use?

    • @DFIRScience
      @DFIRScience  6 років тому

      I'll try to do a video about that shortly. Until then you can check the documentation here: github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract

    • @prateekgupta2916
      @prateekgupta2916 4 роки тому +1

      Can u pls help me in training tesseract..,, for the sake of public help.. I will be very thankful to you

  • @simunyugashakti5373
    @simunyugashakti5373 6 років тому

    Hi..Please guide me how I can retrieve the coordinate positions of the word that I retrieved from the image..

  • @KhalilYasser
    @KhalilYasser 5 років тому

    Thanks a lot. How can I add a new language after the installation?

  • @davidpimental6704
    @davidpimental6704 5 років тому

    I need help with mixed language pdfs - English and Ancient Greek. Also, I would like to target positions within the image taken from a pdf file.

  • @Barklo69
    @Barklo69 3 роки тому +1

    what happen with the tutorial to make your own datatrainer :(

  • @Bismillah_bismillah_bb
    @Bismillah_bismillah_bb 6 років тому

    i usually play trivia games and i want to use it there can u plz try to make a video on that?

  • @GermanPowershell
    @GermanPowershell 5 років тому

    Basicly nice Video. But why you open and use PowerSHELL ISE, and then don't use anything from Powershell?

  • @mrcb1698
    @mrcb1698 6 років тому

    Not sure if you will answer to this but i'd love if you could help me doing the powershell/batch code you spoke about at the end to make it work on a hole file. I'm currently trying but not success yet. Good video btw !

    • @DFIRScience
      @DFIRScience  6 років тому +1

      Hey there. Sure, I can help with that. I'll post back after recording.

    • @iancardenas-spanishbutcomp4074
      @iancardenas-spanishbutcomp4074 3 роки тому

      @@DFIRScience did you make a tutorial for training the ocr to get another alphabet? I'm trying to get it to work with IPA

  • @etil2jz
    @etil2jz 6 років тому

    Really good tutorial, clear.

  • @thesocialtalk1853
    @thesocialtalk1853 Рік тому

    hello, i want to use another language in tesseract

  • @jennilthiyam1261
    @jennilthiyam1261 6 років тому

    how to train the new language which is not in the language list

  • @sunnyraven4563
    @sunnyraven4563 5 років тому

    can you please do the batch file video?

  • @yllamaecataylo9282
    @yllamaecataylo9282 6 років тому

    Can I actually use this to categorize a file into different folders? Btw, im using php so i dont know if it will work

  • @sayankumardey6826
    @sayankumardey6826 3 роки тому

    Hi, please share this pdf file to download.

  • @adoniskomplex91
    @adoniskomplex91 5 років тому

    How can I increase the accuracy?

    • @DFIRScience
      @DFIRScience  5 років тому +1

      You will need to retrain the model based on your specific problem. I'm working on a video for training tesseract.

  • @rodrigogutierrez7775
    @rodrigogutierrez7775 6 років тому

    can do this with a captcha image??????

  • @finestanime5878
    @finestanime5878 6 років тому

    Thanks bro it is really helpful

    • @DFIRScience
      @DFIRScience  6 років тому +1

      Thanks a lot! I appreciate it.

  • @danperryy
    @danperryy 4 роки тому

    What a great job.

  • @sebastienjurkowski
    @sebastienjurkowski 6 років тому

    Hi, we are looking for some knowledgeable with OCR, specifically for text from a Video feed. The text would appear most often distorted, non-horizontal and sometimes wrapped or partially wrapped. The text to be read is strictly a short sequence of number and/or letters. There can be multiple variations of those sequences in the same image. Contact me that rings your bell :)

  • @dipsikhaphukan5563
    @dipsikhaphukan5563 4 роки тому

    Wanted this same thing using java ..Please help!!!!

  • @atharvagupta9355
    @atharvagupta9355 4 роки тому

    hey, does anyone know how to scan multiple pictures in one go and measure the amount of time taken for the same?
    Thanks for the great video

  • @gabrielbessa2575
    @gabrielbessa2575 5 років тому

    Great tutorial! thx

  • @venkateshdhande6318
    @venkateshdhande6318 6 років тому +2

    first how to create pdf to images

  • @selvas7502
    @selvas7502 4 роки тому

    how to convert multiple images from the folder. without giving image name one by one.
    is there is any commend to do it.?

    • @harmindersinghnijjar
      @harmindersinghnijjar 3 роки тому

      Hey there, you can use Snip & Sketch on Windows. I'm making a guide on just that currently.

  • @mattchew2203
    @mattchew2203 6 років тому

    How did you manage to get such fast results? It is taking me at least 15 seconds to OCR a full page...

    • @DFIRScience
      @DFIRScience  6 років тому

      The quality of your image will make a difference. Try around 300dpi. That will give you good recognition but should reduce processing time.

  • @jennilthiyam980
    @jennilthiyam980 6 років тому

    lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file ../../../../ccmain/tessedit.cpp, line 193
    i got the above error when try to perform
    tesseract.exe 3.jpeg ..\out1.txt -l ben
    plz help me out

    • @gabrielbessa2575
      @gabrielbessa2575 5 років тому

      try completely uninstalling and dowloading a updated version :v
      hope it helps

  • @sangjunlee391
    @sangjunlee391 5 років тому +1

    형님 감사합니다.

  • @tkinter3160
    @tkinter3160 5 років тому

    Sir ocr can extract text from video ?

    • @gabrielbessa2575
      @gabrielbessa2575 5 років тому +1

      unfortunately no, but if you extract the frames and turn them into individual pictures, you can then execute the program and get the .txt files :3

  • @aokaf
    @aokaf 6 років тому

    please help me find how can i use it on MAC
    pleeeeease

  • @tobiaskarl4939
    @tobiaskarl4939 4 роки тому

    also one has to set TESSDATA_PREFIX to "installdir\tessdata"

  • @mahmoodal-imam2892
    @mahmoodal-imam2892 6 років тому

    Thanks a lot, brother

  • @mydulislam4218
    @mydulislam4218 6 років тому

    Thank you very much for your nice tutorial. Buy I would like to help with you that how to use this tesseract ocr without power she'll. How can I have can I use this very easy way that is either the first year I take the PNG or image then how to use is the tesseract another way so that I can easily without any complexity. After installation the it is a vector and the language platform how I can use this very easy way from the text and from the image.

  • @randomvideosshideos8508
    @randomvideosshideos8508 5 років тому

    but this is not detecting text from product images

    • @DFIRScience
      @DFIRScience  5 років тому

      Yes, there are a lot of situations where the current training will not work. You may need to create a training set based on the problems you are working on, and retrain tesseract with your problem set. I'm working on a video to make custom training sets for tesseract.

  • @cezarmoniz6579
    @cezarmoniz6579 6 років тому

    Congratulations on the video. I'm from Rio de Janeiro - Brazil. Great accent in English! Can we work with tesseract with PHP?
    By the way what's your name?

  • @19perception83
    @19perception83 6 років тому

    Excellent video, however, my output was dreadful.
    English, clear to see and it rendered about 90% fine, however, there are wingding style artefacts all over the place. A bit pants really.
    Can also render as different file formats with some more easily readable formatting (.odt) etc etc
    Will look for an alternative to compare against

    • @DFIRScience
      @DFIRScience  6 років тому

      If you'll be using the same types of input, you may want to train a new classifier on your specific dataset. For a random image 90% is not bad. I would make a filter script to clean the text and remove wingdings, etc.

  • @AliMurtaza-hs2ct
    @AliMurtaza-hs2ct 6 років тому

    Warning. Invalid resolution 0 dpi. Using 70 instead and blank text comes. please help

    • @DFIRScience
      @DFIRScience  6 років тому

      What is your input file? JPEG? PNG?

    • @AliMurtaza-hs2ct
      @AliMurtaza-hs2ct 6 років тому

      Png

    • @DFIRScience
      @DFIRScience  6 років тому

      You might try the solution here: stackoverflow.com/questions/42990139/tesseract-ocr-how-do-i-improve-result

    • @AliMurtaza-hs2ct
      @AliMurtaza-hs2ct 6 років тому +1

      Thanks . It worked

  • @xlnyc77
    @xlnyc77 6 років тому

    using powershell ? so its not really for windows? this is DOS.
    Did you ever make a powershell script?

  • @rachelludmir7169
    @rachelludmir7169 6 років тому

    greet vidoe very clear .
    you have a vidoe on how to train tesseract
    ?
    please it can be very useful for me

  • @jaiksah
    @jaiksah 6 років тому

    the moment i type tesseract.exe --help, it opens the exe for installation ,don't know why

    • @DFIRScience
      @DFIRScience  6 років тому +1

      Try uninstalling, and downloading the installer from here: digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.01.exe

  • @nikhilgjog
    @nikhilgjog 5 років тому

    good info, but it would much better if the author could make a condensed video. He has repeated same info or provided unnecessary info at multiple places

  • @adoniskomplex91
    @adoniskomplex91 5 років тому +1

    I've used pdftoppm.exe from poppler. Works very well.

  • @hitachimonsta9553
    @hitachimonsta9553 5 років тому

    Thanks!

  • @christianrazvan
    @christianrazvan 2 роки тому

    It doesn't appear that tesseract is any good

    • @DFIRScience
      @DFIRScience  2 роки тому +1

      Default models are so-so. You'll definitely need to train on your specific problem. I've used default models for general ocr where high error wasn't a problem.

  • @bj16162
    @bj16162 7 місяців тому

    btw default windows ocr better than tesseract in my language

  • @zardashtshwany3784
    @zardashtshwany3784 4 роки тому

    tnx a lot

  • @송승협-b9g
    @송승협-b9g 4 роки тому

    Korean?

  • @massivefins2597
    @massivefins2597 5 років тому

    Tesseract is crud... Use Tabula and PDF's... You can select your tables also...

  • @tasmia5243
    @tasmia5243 3 роки тому

    so it is easy to use to everyone and I am the one who is freaking out?!

  • @silviotadeu607
    @silviotadeu607 6 років тому

    Wonderful Dad!!..lol

  • @mauroamorso
    @mauroamorso 3 роки тому

    tesseract 0001.jpg -l eng

  • @fabulusinvictus2198
    @fabulusinvictus2198 6 років тому

    Suzy!!!!

  • @proxy7362
    @proxy7362 5 років тому

    Tesseract OCR is terrible.

  • @mohamedseddig5878
    @mohamedseddig5878 3 місяці тому

    how in all dir by one click