Using Tesseract-OCR to extract text from images

DFIRScience

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 3 гру 2024

КОМЕНТАРІ • 145

@Shaalimar 7 років тому ⁺²⁷
Excellent Videos! As a second-language speaker, i appreciate your accurate spoken english a lot.Thanks!
@z0a0i0n0a0b 6 років тому
I’m recieving empty page!! Empty page!! What could the problem be ?
@Mike.Freeman 6 років тому ⁺¹¹
for more than one language you could use the + sign to concatenate the 3-character ISO 639-2 language codes (see the man page)
e.g.
tesseract out.tiff -l eng+kor multi.txt
@jonathanvillatorocordoba7511 2 роки тому ⁺²
Thank you so much! This is the simplest tutorial I could think of, that explains tesseract in depth.
@DFIRScience 2 роки тому ⁺¹
Glad it was helpful!
@Teck_1015 6 років тому ⁺⁷
Finally a Native English speaker tutorial for this. Thank you very much.
@ilianos 2 роки тому
Not sure this was possible when this video came out, but a quick Google search just showed me that it seems to be possible to hand over several languages as parameters (using "+") at the same time.
@mengtaoan5613 2 роки тому ⁺⁴
Very detailed tutorial, can you show how to use PaddleOCR next time? It includes more languages
@DFIRScience 2 роки тому
Thank you. Yeah, I can show PaddleOCR. Stay tuned!
@fengxie918 7 років тому ⁺²
Thank you so much, it helps me dive into ocr really quickly.
@stefanodeboni6000 5 років тому
Thanks so much, it's very clear for not native English speaker too.
@axelmarruenda2858 4 роки тому ⁺¹
Thanks!! Hard to come across a tutorial as well explained as this one
@3ZEBRA 4 роки тому ⁺¹
Wow super useful man, thanks!
@randomtoons1425 4 роки тому ⁺¹
clear and concise! can't help but subscribe. Thanks buddy!
@mauriciogaubeur1453 3 роки тому
Thanks a million for your time!!!!
@KilgoreTroutAsf 5 років тому ⁺⁴
2x playback speed really improves the pacing.
@vipinamar8323 3 роки тому
Is ImageMagick required to get the tiff file?
@lilazeonboa 4 роки тому
Need your opinion. I'm researching of how to take a jpeg photograph receipt and run a java app to get the text from the receipt. Is Tesseract would be a best solution?
@arunaslipnickas4405 5 років тому
Hi. Why the "Key words :" were NOT extracted from the document? See on 6.43.
@zenoshirani9758 5 років тому
What if tesseract is unable to recognize the English font "Ford's folly italic and ladylike BB font " ? How do we embid the font into tesseract for recognising the characters in the PDF ?
@npandu 7 років тому ⁺²
Good video. Very informative. Thanks.
@havoclyyours 5 років тому ⁺¹
how we can apply the ImagetoString function for a live feed of cv2 (frames)??
@xelasans8461 6 років тому ⁺¹
how to install tessdata from github? I mean where to extract it?
@ahsan-li7sh 7 років тому ⁺¹
I liked all your videos which are very informative. you should produce more videos often. thanks
@DFIRScience 7 років тому
Thanks a lot. I appreciate it.
@UpcycleElectronics 5 років тому
Did you ever find a way to combine the text from 2 languages? I have a 270 page pdf in Simplified Chinese with around 1/3rd in English....such a nightmare to translate.
@eloiulrichguebayi301 6 років тому ⁺¹
thank you. Excellent video! how to install textract on windows 7 x64?
@JK-hm3lf 6 років тому ⁺¹
Great. Does this work to extract handwritten text from snapshots taken from mobile?
@jarodmorris611 5 років тому ⁺²
You're going to have a very difficult time with recognizing handwritten text, as in almost impossible. Just because it will "read" handwritten text doesn't mean the resulting text file will be useful due to all of the errors.
@jram8961 6 років тому ⁺¹
How can we detect text in file and extract even it has some noisy background?
does it require OpenCV?
@MARQUITOSGUALACBA 6 років тому
Hi! What did you do? I have the same cuestion
@jram8961 6 років тому ⁺¹
@@MARQUITOSGUALACBA i found that EAST TEXT Detector does this job well.
@MARQUITOSGUALACBA 6 років тому
@@jram8961 thanks! I Will try
@armaansoni6906 6 років тому ⁺⁶
is the rock music being played in the background
@nathanfisher8792 5 років тому ⁺¹
it is, why
@MedoHamdani 2 роки тому
Thanks mate for the video.
Q: Is it possible to extract tables and index and is it possible to keep the formating of the tables, index and titles or subtitles?
@DFIRScience 2 роки тому
Default Tesseract-OCR will extract text from tables and indexes, but it will NOT keep the table formatting.
@MedoHamdani 2 роки тому
@@DFIRScience Noted, then we will have to create an algorithm in ML & A.I to keep the formatting.
@awerqga Рік тому
@@MedoHamdani what to you mean be ML?? Machine learning???
@MedoHamdani Рік тому
@@awerqga Yes
@hayatt143 7 років тому
Joshua is there a way we can know if pdf contains graphical data (table ,charts , graph , etc)?
@seasonfilms9375 5 років тому
I've been looking for a neatreceipts replacement for a very long time to store and keep my receipts like I used to when I had Neat receipts for either mac or windows. I have been trying to get fully far away from both platforms but I still need an easier way to scan and store my receipts. Do you think this is a good alternative?
@DFIRScience 5 років тому
By itself, Tesseract would not be a good replacement. If you combined it with some sort of management and search back-end, it may do what you need.
@ThunderPuppy11 6 років тому
Since Tesseract version 3.03 image files can be directly converted into PDF.
example: tesseract myimage.tiff out pdf
@Beauty.and.FashionPhotographer 5 років тому
Could you explain ?
@jacobkn6594 4 роки тому
For the following error related to ImageMagick conversion:
convert-im6.q16: not authorized `characterization.pdf' @ error/constitute.c/ReadImage/412.
convert-im6.q16: no images defined `out.tiff' @ error/convert.c/ConvertImageCommand/3258.
change the rights from:
to:
in etc/ImageMagick-6/policy.xml
@jarodmorris611 5 років тому
How well does it handle italicized English text?
@codingwithme898 2 роки тому
Man is this usefull for handwritten images or easy ocr give much better result for handwritten image
@MARQUITOSGUALACBA 6 років тому
Can find text from an image with colors?
@miguelalfaro1196 7 років тому ⁺¹
Excellent video joshua
@daffardigitalarchives 2 роки тому
Why would you do all that work for a typed document?
@DFIRScience 2 роки тому
Sometimes things like PDFs are saved as a kind of image, which means text is not searchable. This method can extract text that you can index and make searchable. Similarly, we can use this technique on any type of image and even video to extract text and make it searchable.
@jackdaniel1579 6 років тому ⁺¹
Nice video. How does this work on Windows? I couldn't get it to work tesseract-ocr in windows. Any ideas?
@ProgramoZ 6 років тому ⁺¹
it's easy Jack , just download the tesseract exe , install it and add it in path variable .
@jackdaniel1579 6 років тому ⁺²
webhtg thank you! I read the documentation and downloaded teseract.exe and then added it to path. Then followed some steps mentioned in a stackoverflow and it worked. I am using Anconda Python 3.5.
@ProgramoZ 6 років тому
great
@DFIRScience 6 років тому ⁺²
I see you already found an answer. Posting this for others just in case: github.com/tesseract-ocr/tesseract/wiki#windows
@anantojha9033 4 роки тому
How to do this in anaconda - windows 64 bit?
@grzegorzg944 7 років тому
This is simply ossooooom intro. Big THX!
@rezanibrahim7184 4 роки тому
can I do that over Arabic words???
@thiagoluna3469 5 років тому
Thanks a lot for this video! On windows 10 prompt command, the line "convert -density 300..." only run if put the word "magic" first of all. "magic convert -density 300...". After this, the computer stays extremaly slow. Does anybody else with the same situation?
@tomsfamily4201 7 років тому
amazing videos but have one question can i use English language in cursive font ? please reply me if you know
@DFIRScience 7 років тому ⁺¹
Hello. Technically, yes you can. You will need to train tesseract in the cursive font that you want to detect. The problem, of course, is that cursive hand writing is quite unique between people. If you want to do a general cursive extractor, you will have to have a huge corpus of samples to train on and the results will likely not be great. If you are trying to detect cursive fonts in office, the problem is much easier. Instructions to train tesseract are here: github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
@ShahidKhan-my1gi 6 років тому
Hello, Great Video. i can see the data is getting extracted nicely from PDF but i have a question to ask. The actual data in PDF file has different fonts like header line has higher font size than subject data. i see the extracted came out with same font for both of them.
Is it possible to find actual font size from the data that has got extracted using OCR? Please enlighten us :)
@DFIRScience 6 років тому
Hello Shahid, the text was extracted from high-res images, not the PDF directly. As far as I know tesseract does not have the ability to detect font sizes. It is just character rec. You could potentially make another utility that guesses the font/font sizes of each line, and then apply it to the already extracted text. I don't know of a utility like that, but I will look.
@jarodmorris611 5 років тому
@@DFIRScience do sites like whatthefont have an API?
@sangjo20 7 років тому
It is interesting. Though there are some errors in case of Korean language converting, It is so cool.
@DFIRScience 7 років тому
SANGJO PARK Yeah. The English is also not perfect, but a good start. A better model could probably be trained.
@haraldurkarlsson1147 2 роки тому
I tried a simple file with some sort of logo on the top of the first page and it blacked out the first page entirely.
@saketg5954 4 роки тому ⁺¹
Well that is not strictly "from images", it's from a pdf where the text is already rendered.
@benjpns 4 роки тому ⁺¹
show us how to create trainddata for handwriting ;)
@misterperez8167 4 роки тому
I would like to know too :)
@3ZEBRA 4 роки тому
You are a hero
@3ZEBRA 4 роки тому ⁺¹
Lol I didn't see that I already commented on this. At least I'm consistent.
@chrisbakhito8512 Рік тому
Hello, Thanks a lot for the tutorial. Very helpful. I had one problem which I worked around but still you may have a better solution.
After scanning I saved the document as .pdf. when I run "convert" I got an error message:
convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408
Do you have any suggestions?
Thank you very much. Christophe
@rosyluo7710 7 років тому
Clear explanation! THX
@jennilthiyam1261 6 років тому
please do a video on how to train tesseract -ocr in new script.
@bens4446 6 років тому
What's up with all the flickering?
@AlexChen0905 2 роки тому
Wait a second, I've just installed tesseract, but it won't work unless I change my terminal to being at the directory of the Tesseract packages and code and other stuff like that!
But the file I wanna convert to a text file is in another directory that's not in tesseract. What do I do?
@DFIRScience 2 роки тому
You can run tesseract from the folder it's installed in and give the file's full path on the command line. Like: tesseract C:\Users\Test\data.pdf outfile eng
@AlexChen0905 2 роки тому
@@DFIRScience Wait, it's now saying that I can't use the scripts I've downloaded because it was denied permission!.
@DFIRScience 2 роки тому
@@AlexChen0905 are you using Windows or Linux? If its Linux, make sure you chmod +x [script name]
@AlexChen0905 2 роки тому
@@DFIRScience It's windows.
@DFIRScience 2 роки тому
@@AlexChen0905 You can get the newest Tesseract-OCR Windows installer from here: github.com/UB-Mannheim/tesseract/wiki
@trejkaz 7 років тому
I see one major flaw here. There are a lot of languages, and I can't identify all of them by looking at them.
Therefore,
To identify languages, I use a language detection library, which takes text as input.
To get the text, when all I have is an image. I have to use OCR.
This OCR library then wants me to tell it what language it's looking at, completing the infinite loop.
So basically, this OCR tool is useless, because it hasn't done the one job I expected it to do - recognise what characters it was looking at.
@DFIRScience 7 років тому
trejkaz hello. OCR is not language detection, it is character recognition. It is trained what a symbol (picture) looks like and then gives a corresponding text version based on the trained symbol. You can apply language detection after OCR.
@trejkaz 7 років тому
In theory, you can. In practice, it isn't possible with Tesseract because it has forced us to provide the language ourselves.
@DFIRScience 7 років тому
trejkaz take a look at apache Tika. They use tesseract for multi language extraction. tika.apache.org You could also potentially train on multiple languages at the same time, that would remove language selection. Not sure how that would affect accuracy.
@trejkaz 7 років тому
Tika works the same as Abbyy FineReader - you specify the combination of languages you already know are in the document, so presumably it works by merging those models. I know from Abbyy's stuff that you can't even get decent results with 2 languages (assuming the results you get with 1 can even be called decent with that piece of crap), but I'd have to specify all the supported languages in order to do identification after the fact.
The real problem, though, is that OCR developers are separating the two parts of what is supposed to be a single process.
Correctly-implemented OCR should include some kind of language recognition as part of it. You can't distinguish a Cyrillic A from a Latin A without seeing the context around it, but the OCR software is expected to emit different code points for each. If you do it character by character, you'll most likely get a Latin A, and then when you try to recognise the language after the fact, it's too late, because you're already looking at mangled data.
@DFIRScience 7 років тому ⁺¹
That last paragraph is a perfect problem description. I'll see if anything exists, or maybe try making something.
@sebastienjurkowski 6 років тому
Hi, we are looking for some knowledgeable with OCR, specifically for text from a Video feed. The text would appear most often distorted, non-horizontal and sometimes wrapped or partially wrapped. The text to be read is strictly a short sequence of number and/or letters. There can be multiple variations of those sequences in the same image. Contact me that rings your bell :)
@mosquitohippy 7 років тому
Great post!
One question though, how has been your experience using this procedure with tables?
@DFIRScience 7 років тому
If the tables are simple, like one text line per cell, then it seems to work ok. The more complicated the table, or text in the table, the more odd formatting or cell content mixing you'll have.
@jarodmorris611 5 років тому
@@DFIRScience If a person knows the format of the table, specific location in the document, you could use imagemagick to crop multiple smaller images from the page image, order those images as necessary, and extract the text from the series of smaller images so that you are forcing the correct text recognition. Additionally, with things such as technical texts, they often have cluttered footers which create issues between pages. You could crop those out so you don't have that text between the text from the pages. The rule is the more you know about your PDF, the better you can tailor your solution. Pretty much true with any project.
@keeganreeve6667 3 роки тому
2:29 after installing Tesseract
@태조왕선생 4 роки тому
good video thanks i have a question. I want to use for two language kor, eng. is tethered any way? and. are you living in chuncheon?
@narayantiwary2132 7 років тому ⁺¹
how i solve this error:::::---------
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in pixReadStreamGif: Can't use giflib-5.1.2; suggest 5.1.1 or earlier
Error in pixReadStream: gif: no pix returned
Error in pixRead: pix not read
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
@lechsiz1642 4 роки тому ⁺¹
Sure which all FOSS used something NOT owned by Microsoft like GitHub is now. Maybe they should use gitlab.
@elyasali598 2 роки тому
sudo apt install tesseract-ocr
sudo apt install imagemagick
convert density 300 __.pdf depth 8 -strip -background white -alpha off out.tiff
tesseract -l eng out.tiff text
@jonessmith8670 4 роки тому
Hello Sir,
When I tried to: convert -density 300 abc.pdf -depth 8 -strip -background white -alpha off out.tiff
It showed
Error: /undefined in F,NP
Operand stack:
Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 2045 1 3 %oparray_pop 2044 1 3 %oparray_pop 2025 1 3 %oparray_pop 1884 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval--
Dictionary stack:
--dict:964/1684(ro)(G)-- --dict:0/20(G)-- --dict:77/200(L)--
Current allocation mode is local
Current file position is 4
GPL Ghostscript 9.26: Unrecoverable error, exit code 1
convert-im6.q16: no images defined `out.tiff' @ error/convert.c/ConvertImageCommand/3258.
Could you please point me out what I did wrong?
Best Regards
@DFIRScience 4 роки тому
Do you have the tiff image named out.tiff in the same directory that you are running the command?
@jonessmith8670 4 роки тому
Yes, I have the tiff image named out.tiff in the same directory. Thank you very much for the reply!
@DFIRScience 4 роки тому
@@jonessmith8670 Are you able to convert any other PDF with the same command? Is the error same for all PDFs?
@Beauty.and.FashionPhotographer 5 років тому
Thanks, ... but have a question: would you be inclined to make a new video with a Graphical User Interface (GUI)? To install Tesseract is already quite a hurdle for us normal human beings..... but the way you use TERMINAL to actually grab a picture with Writing/Letters on it,... is not understandable at all, ....unless one knows Code.... making this video here only interesting to one in 10'000 people. Wouldn't it be smarter to make this video more accessible to all the other people out there, the 9'999 others? ...By simply showing in a new video, how it can be used with a GUI? By those rest of us, the 9'999 people from the 10'000 who are actually very curious about it?
Food for Thought?...BR,..A.Simon
@DFIRScience 5 років тому ⁺¹
Yeah, I understand. I'm working on some new videos, and will include GUI-based tutorials. Thanks for the suggestion!
@Beauty.and.FashionPhotographer 5 років тому
You are amaaaazzziiinnggg !!! Thanks sooo much !!!
@jecosvi 7 років тому
They need to make tesseract-ocr like Abbyy Fine Reader, but they need to maintence terminal comands. They need to make a Tesseract GUI.
They need to use code from www.abbyy.com/en-eu/ because is better the character recognition. Or may be Abbyy can make a free version for Linux.
If they don't do it, you can make piratery, download the program cracked and install in a Virtual Machine Windows XP or Windows 7.
@0babul0 6 років тому ⁺¹
Take a look at gImageReader. It's avaiable for windows as well.
@khaingmyo2024 4 роки тому
How to detect if text from image up or down
@陳阿東-m2e 6 років тому
Gracias.
@zoemcginnis3216 4 роки тому
This guy has such Sal Khan vibes
@jasonmoore4429 2 роки тому
Pipe through browser and use Google translate to translate
@DFIRScience 2 роки тому
Yeah, Google OCR and translation API is good. We used Tesseract-OCR because the lab is offline with no network connection possible.
@Beauty.and.FashionPhotographer 6 років тому
Would you be willing to do one for MAC users? one "without" Github / coding and terminal? one tutorial which is for normal human beings, who are not a relative of Albert Einstein? And if we are at it: one that covers OCR recognition for old scripture like german gothic fonts ( Fraktur Schrift ), or Arabian and Chinese and Indian fonts etc? ;):):)
@DFIRScience 6 років тому ⁺¹
Akos Simon Haha! I'll try to find a Mac I can borrow... The problem is that you would still have to use the terminal. I don't know of a point and click interface for OCR. Can you point me to a pdf/image with the text sample you want to analyze?
@Beauty.and.FashionPhotographer 6 років тому
COOL !!! ....ok, ...here then,... for example, Gothaischer Hofkalender scans,... at the Boston Library, and on google books, all free and all high res canon 5dmark 2 repros images, saved as pdf, also as jp2, also as dejavue, and other archival format methods.
here a direct link to a couple of archival historical books, some even reproduced in raw photos (.CR2 ) canon 5dmark2, then as jpgs saved.... all these can be downloaded in original full resolution and size, usually around 500MB to 900MB, for the whole searchable book, but often the text recognition these archive librarian use, is simply horrible ( as in Abbey!!), and CAN NOT! learn from mistakes, which is needed with Fraktur Schrift, ... which is needed, and which one of the versions of tesseract seem to be able to do, ....as much as the Sunny software (only windows) seems to be able to do.
anyway, here a few pages of a Family Called Pallfy :
archive.org/stream/gothaischerhofka1917gothuoft#page/402/mode/2up/search/palffy
here another one:
seite 408, archive.org/stream/almanachdegotha01unse_0#page/n869/mode/2up ,
and here also a typical one :
archive.org/details/gothaischerhofka1917gothuoft
I really wonder why NO ONE has made a UI yet .... and why the one who attempted it, made it for windows only, a commercial suicide decision... and not for MAC's, where all professions who deal with writing, live exclusively on.
@Quarterpounderspatch 6 років тому ⁺¹
doesn't take a genius to be a script kiddie.
@jarodmorris611 5 років тому
Point and Click is called Acrobat Pro.
@joshuajosephson7965 5 років тому
Like most of these 'tutorials', the very first command you try to duplicate FAILS. Not that I cannot figure it out on my own (I will because I always have to if I really need to this to work), it just makes me wonder: why make these tutorials at all (and you can tell by most of the comments that nobody is trying the commands)? But for the disappointed, here is the error EVERYONE will get:
convert-im6.q16: not authorized `
@joshuajosephson7965 5 років тому
BTW, I only 'care' because I fully support the IDEA of FOSS and Linux. But I condemn the execution of it. My tinfoil hat is telling me these are paid disinformation courtesy of the commercial software industry (they've got a LOT of spare cash lying around to invest in FUDD and sabotage).
@joshuajosephson7965 5 років тому
Okay, for those who are interested in applying this tutorial (which I am sure is exactly ONE of us):
In order to complete the first step (using ImageMagick's 'convert' command), you must change the 'policy.xml' file located in (in most cases probably) '/etc/ImageMagick-6/'.
Find the lines:
In each case (for our purposes here, we especially want the 'PDF' right), change the word 'none' to 'read | write'. This will allow you to convert these file types. YOU'RE WELCOME!
@jacobkn6594 4 роки тому
For the following error
convert-im6.q16: not authorized `characterization.pdf' @ error/constitute.c/ReadImage/412.
convert-im6.q16: no images defined `out.tiff' @ error/convert.c/ConvertImageCommand/3258.
change the rights
from:
to:
in etc/ImageMagick-6/policy.xml
@PA-rm3ly 6 років тому
thanx ! greetings ! (^ ^)
@paulfrahm3842 4 роки тому
music ---> thumbs down
@DFIRScience 4 роки тому ⁺¹
Yup. Failed experiment.
@iznone 2 роки тому
This is a "most advanced OCR program" for Linux ... what a joke.
@DFIRScience 2 роки тому
A lot has changed in 7 years. It's still suitable for some applications, but there are other, friendlier options now too.
@nastyhardcore7641 6 років тому
Imagemagick sucks, use pdftoppm instead.

Наступне

Автоматичне відтворення

Extract Text From Images in Python (OCR)