Extract text from images with Tesseract OCR on Windows

DFIRScience

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 31 січ 2025

КОМЕНТАРІ • 112

@josephc3080 5 років тому ⁺¹¹
This is really good tutorial. I appreciate the care you took in going step by step, especially through altering the path.
@GNS216 6 років тому ⁺⁹
This is the most helpful tutorial on Tesseract that I've found. Thank you.
@hkim644 6 років тому ⁺⁵
omg. I was watching your video to install Tesseract. Meanwhile, I was amazed that you can read Korean. I thought you chose a random non-english language to prove Tesseract works with different language. Amazed as a Korean.
I am trying to learn how OCR works because I want to make an app that requires OCR. But I have no coding experience or anything even close to digital languages, I am having some difficulties. At least I was able to use Tesseract after watching this video. Thank you so much!
4 роки тому
Thanks for this tutorial: I have had trouble with converting text in mayan language here in Guatemala, I followed your steps and voila!
Next step for me is to figure out how to train a set of recognition for our local mayan alphabets.
Thanks a lot.
@iancardenas-spanishbutcomp4074 3 роки тому
Did you get to train it for a different alphabet? Can you help me? I'm trying to get OCR working for IPA characters recognition
@TheJoinckim 7 років тому ⁺²
Very very good tutorial for tessseract for koreans and clear pronunciation. Thank you.
@TzKet4m 5 років тому
Your voice makes me happy to browse youtube, so clear fuark
@emmanuelvelasco8753 7 років тому ⁺¹
keep making these videos man! interesting content
@seung-wanson9447 6 років тому ⁺¹
FYI, If we never add anything to PATH other than default one, it will not pup-up that edit selection box.
So by looking your video, i need to manually make the entry by separating new one with ";" (semicolon)
Afterwards, if i click the edit button, i get the same pop up edit box.
@philglanville3974 4 роки тому
Hi, a very good tutorial, but as mentioned by yourself, and a comment by another, ref batch folder/file processing , I can not see or find any uploaded tutorial video ?????
@beastmonsterthing3 5 років тому
thanks so much. easy to understand and so helpful. you're a legend
@opheliafromlcf9509 3 роки тому ⁺¹
How did you turn each page of the pdf into pngs? Thank you for this high-quality video.
@opheliafromlcf9509 3 роки тому
Alright, alright, I got that to work. Now I am wondering how you write the code to make it run all the pngs at once instead of having to do each one line by line, one at a time.
@harmindersinghnijjar 3 роки тому
Hey there, you can use Snip & Sketch on Windows. I'm making a guide on just that currently.
@rezkiy95 3 роки тому
Thanks for no bs tutorial!
@deepak223098 4 роки тому ⁺²
Can you tell how to train our own dataset ??
@aradsoltani4646 3 роки тому
thank you that was very helpful:-D
@DFIRScience 3 роки тому
Glad it helped!
@knowsmynametoonobody9191 5 років тому ⁺¹
nice video, it's what I'm looking for , So, thank you very much!😀
@ahmedfarouk8197 6 років тому
you can change your pdf to a one tiff file instead of converting it to several png files
@R.t.a.s 4 роки тому ⁺¹
Thanks a lot for this but can i use this for manuscripts as well? And if so plz tell me how :)
@epochseven4197 3 роки тому
Interestingly enough, the default install path for the Windows x64 version is:
C:\Users\username\AppData\Local\Programs\Tesseract-OCR
@luisguevara9292 6 років тому
It helped me a lot. Thank you very much
@jarongaus 3 роки тому
Your instructions are phenomenal. You are amazing to explain computer commands and tricks. The only problem is that this program sucks and it is a nightmare to use it
Its not your fault. Thanks so much for teaching so many tricks.
@jennyf.2124 4 роки тому
Have you maybe tried out wether it also works with handwritten texts?
@DFIRScience 4 роки тому ⁺¹
Hand-written text (block letters) will work, but not be very accurate. Ideally, Tesseract should be re-trained on whatever font you are focused on.
@jennyf.2124 4 роки тому ⁺¹
@@DFIRScience I see, thank you very much!
@etil2jz 6 років тому
Really good tutorial, clear.
@kevinsanti4091 6 років тому
a video on tips on how to train tesseract would be great! anyway thanks a lot for this video so far! helpful for my first steps and really appreciated!
I'm wondering if someone has already done -as something more looking like a sort of end user application rather than an in-the-field programmer use - (or eventually how to do it ) 1) an overlay of the pictured document and the ocr recognition in such a way that the original document remain displayed as it is but "highlight-able " or 2) aslo how to generate a parallel ocr document which keeps the letter positioning and layout in the space page of the ocr output like on the original picture and in case of a document keep the original cutted picture in case of difficulties and low confidence level in the recognition. like for example on graphs pictures drawings...
@itsdannyftw 5 років тому
What mic are you using? Great video, thanks!
@cohas3424 6 років тому ⁺¹
제가 찾던 동영상이네요 고맙습니다. ^^
@prateekgupta2916 4 роки тому
Hi sir
Much needed video..
Can u tell me how to train tesseract to identify specific font
@finestanime5878 7 років тому
Thanks bro it is really helpful
@DFIRScience 6 років тому ⁺¹
Thanks a lot! I appreciate it.
@saikushalmandala6438 6 років тому ⁺¹
thats a good video
but, how to preprocess the input image and then pass through tesseract
can u please help on it ASAP
@gabrielbessa2575 5 років тому
Great tutorial! thx
@danielveraec 5 років тому
Thanks for the information.
How can I install additional languages to the ones you sample? Maybe you already said it but my English is not very good and I didn't listen to it.
@davidpimental6704 5 років тому
I need help with mixed language pdfs - English and Ancient Greek. Also, I would like to target positions within the image taken from a pdf file.
@simunyugashakti5373 6 років тому
Hi..Please guide me how I can retrieve the coordinate positions of the word that I retrieved from the image..
@danperryy 5 років тому
What a great job.
@hyperventilate7318 4 роки тому
I have photographs of people with the date printed below, can this solution extract the date? I need to do this for 1000s of photos. (batch)
@mrcb1698 6 років тому
Not sure if you will answer to this but i'd love if you could help me doing the powershell/batch code you spoke about at the end to make it work on a hole file. I'm currently trying but not success yet. Good video btw !
@DFIRScience 6 років тому ⁺¹
Hey there. Sure, I can help with that. I'll post back after recording.
@iancardenas-spanishbutcomp4074 3 роки тому
@@DFIRScience did you make a tutorial for training the ocr to get another alphabet? I'm trying to get it to work with IPA
@Bismillah_bismillah_bb 6 років тому
i usually play trivia games and i want to use it there can u plz try to make a video on that?
@pixelvader2451 5 років тому
So, should I do it one by one? I have complete books, is there no way to do this for several images?
@GermanPowershell 6 років тому
Basicly nice Video. But why you open and use PowerSHELL ISE, and then don't use anything from Powershell?
@a2zGodz 6 років тому ⁺¹
How do u train the tesseract? Can u point me in the right direction with something I can use?
@DFIRScience 6 років тому
I'll try to do a video about that shortly. Until then you can check the documentation here: github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract
@prateekgupta2916 4 роки тому ⁺¹
Can u pls help me in training tesseract..,, for the sake of public help.. I will be very thankful to you
@allirashna2072 5 років тому
im kind of skeptical of allowing changes to hardware. is it completely safe?
@mrmikearmstrong 6 років тому
Nice tutorial, makes everything nice and simple to handle - On another note, I want to call the tesseract.exe file from a .NET application that has just taken an image of some text, is there a way to get the output of the OCR as a string in the console? Or would I have to wait until the character recognition has completed, then go and read that text file at a later time?
@DFIRScience 6 років тому
Yeah, I'm pretty sure you have to read the file after. I'll check if you can output to pipe.
@KhalilYasser 5 років тому
Thanks a lot. How can I add a new language after the installation?
@cezarmoniz6579 6 років тому
Congratulations on the video. I'm from Rio de Janeiro - Brazil. Great accent in English! Can we work with tesseract with PHP?
By the way what's your name?
@sebastienjurkowski 6 років тому
Hi, we are looking for some knowledgeable with OCR, specifically for text from a Video feed. The text would appear most often distorted, non-horizontal and sometimes wrapped or partially wrapped. The text to be read is strictly a short sequence of number and/or letters. There can be multiple variations of those sequences in the same image. Contact me that rings your bell :)
@Barklo69 4 роки тому ⁺¹
what happen with the tutorial to make your own datatrainer :(
@fabarchimilku4073 3 роки тому
Hi, how do link to the batch folder converting thingy?
@jennilthiyam1261 6 років тому
how to train the new language which is not in the language list
@mahmoodal-imam2892 6 років тому
Thanks a lot, brother
@mattchew2203 6 років тому
How did you manage to get such fast results? It is taking me at least 15 seconds to OCR a full page...
@DFIRScience 6 років тому
The quality of your image will make a difference. Try around 300dpi. That will give you good recognition but should reduce processing time.
@sunnyraven4563 5 років тому
can you please do the batch file video?
@yllamaecataylo9282 6 років тому
Can I actually use this to categorize a file into different folders? Btw, im using php so i dont know if it will work
@atharvagupta9355 4 роки тому
hey, does anyone know how to scan multiple pictures in one go and measure the amount of time taken for the same?
Thanks for the great video
@mydulislam4218 6 років тому
Thank you very much for your nice tutorial. Buy I would like to help with you that how to use this tesseract ocr without power she'll. How can I have can I use this very easy way that is either the first year I take the PNG or image then how to use is the tesseract another way so that I can easily without any complexity. After installation the it is a vector and the language platform how I can use this very easy way from the text and from the image.
@jennilthiyam980 6 років тому
lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file ../../../../ccmain/tessedit.cpp, line 193
i got the above error when try to perform
tesseract.exe 3.jpeg ..\out1.txt -l ben
plz help me out
@gabrielbessa2575 5 років тому
try completely uninstalling and dowloading a updated version :v
hope it helps
@thesocialtalk1853 Рік тому
hello, i want to use another language in tesseract
@nikhilgjog 5 років тому
good info, but it would much better if the author could make a condensed video. He has repeated same info or provided unnecessary info at multiple places
@rachelludmir7169 6 років тому
greet vidoe very clear .
you have a vidoe on how to train tesseract
?
please it can be very useful for me
@xlnyc77 6 років тому
using powershell ? so its not really for windows? this is DOS.
Did you ever make a powershell script?
@19perception83 6 років тому
Excellent video, however, my output was dreadful.
English, clear to see and it rendered about 90% fine, however, there are wingding style artefacts all over the place. A bit pants really.
Can also render as different file formats with some more easily readable formatting (.odt) etc etc
Will look for an alternative to compare against
@DFIRScience 6 років тому
If you'll be using the same types of input, you may want to train a new classifier on your specific dataset. For a random image 90% is not bad. I would make a filter script to clean the text and remove wingdings, etc.
@hitachimonsta9553 5 років тому
Thanks!
@sayankumardey6826 3 роки тому
Hi, please share this pdf file to download.
@punnarajeev867 4 роки тому
can we convert captcha image into text
@sangjunlee391 5 років тому ⁺¹
형님 감사합니다.
@selvas7502 4 роки тому
how to convert multiple images from the folder. without giving image name one by one.
is there is any commend to do it.?
@harmindersinghnijjar 3 роки тому
Hey there, you can use Snip & Sketch on Windows. I'm making a guide on just that currently.
@AliMurtaza-hs2ct 6 років тому
Warning. Invalid resolution 0 dpi. Using 70 instead and blank text comes. please help
@DFIRScience 6 років тому
What is your input file? JPEG? PNG?
@AliMurtaza-hs2ct 6 років тому
Png
@DFIRScience 6 років тому
You might try the solution here: stackoverflow.com/questions/42990139/tesseract-ocr-how-do-i-improve-result
@AliMurtaza-hs2ct 6 років тому ⁺¹
Thanks . It worked
@jaiksah 6 років тому
the moment i type tesseract.exe --help, it opens the exe for installation ,don't know why
@DFIRScience 6 років тому ⁺¹
Try uninstalling, and downloading the installer from here: digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.01.exe
@dipsikhaphukan5563 4 роки тому
Wanted this same thing using java ..Please help!!!!
@randomvideosshideos8508 5 років тому
but this is not detecting text from product images
@DFIRScience 5 років тому
Yes, there are a lot of situations where the current training will not work. You may need to create a training set based on the problems you are working on, and retrain tesseract with your problem set. I'm working on a video to make custom training sets for tesseract.
@aokaf 6 років тому
please help me find how can i use it on MAC
pleeeeease
@rodrigogutierrez7775 6 років тому
can do this with a captcha image??????
@venkateshdhande6318 6 років тому ⁺²
first how to create pdf to images
@adoniskomplex91 5 років тому
How can I increase the accuracy?
@DFIRScience 5 років тому ⁺¹
You will need to retrain the model based on your specific problem. I'm working on a video for training tesseract.
@zardashtshwany3784 4 роки тому
tnx a lot
@tobiaskarl4939 5 років тому
also one has to set TESSDATA_PREFIX to "installdir\tessdata"
@adoniskomplex91 5 років тому ⁺¹
I've used pdftoppm.exe from poppler. Works very well.
@tkinter3160 5 років тому
Sir ocr can extract text from video ?
@gabrielbessa2575 5 років тому ⁺¹
unfortunately no, but if you extract the frames and turn them into individual pictures, you can then execute the program and get the .txt files :3
@christianrazvan 2 роки тому
It doesn't appear that tesseract is any good
@DFIRScience 2 роки тому ⁺¹
Default models are so-so. You'll definitely need to train on your specific problem. I've used default models for general ocr where high error wasn't a problem.
@bj16162 9 місяців тому
btw default windows ocr better than tesseract in my language
@송승협-b9g 4 роки тому
Korean?
@massivefins2597 5 років тому
Tesseract is crud... Use Tabula and PDF's... You can select your tables also...
@silviotadeu607 6 років тому
Wonderful Dad!!..lol
@tasmia5243 3 роки тому
so it is easy to use to everyone and I am the one who is freaking out?!
@fabulusinvictus2198 6 років тому
Suzy!!!!
@mauroamorso 3 роки тому
tesseract 0001.jpg -l eng
@proxy7362 5 років тому
Tesseract OCR is terrible.
@mohamedseddig5878 5 місяців тому
how in all dir by one click

Наступне

Автоматичне відтворення

How to Install the Libraries (OCR in Python Tutorials 01.02)