Tesseract OCR - Lesson 2: Training Tesseract for new font
Вставка
- Опубліковано 19 вер 2024
- jTessBox Editor: sourceforge.ne...
Step 1: Make box files for images that we want to train
Syntax: tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
Eg:tesseract train.my.exp0.tif train.my.exp0 batch.nochop makebox
{*Note: After making box files we have to change or modify wrongly identified characters in box files.}
Step 2: Create .tr file (Compounding image file and box file)
Syntax: tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] box.train
Eg: tesseract train.my.exp.tif train.my.exp0 box.train
step 3: Extract the charset from the box files (Output for this command is unicharset file)
Syntax: unicharset_extractor [langname].[fontname].[expN].box
Eg: unicharset_extractor train.my.exp0.box
step 4: Create a font_properties file based on our needs.
Syntax: echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" [angle bracket should be here] font_properties
Eg: echo "arial 0 0 1 0 0" [angled bracket] font_properties
Step 5: Training the data.
Syntax: mftraining -F font_properties -U unicharset -O [langname].unicharset [langname].[fontname].[expN].tr
Eg: mftraining -F font_properties -U unicharset -O train.unicharset train.my.exp0.tr
Step 6:
Syntax: cntraining [langname].[fontname].[expN].tr
Eg: cntraining train.my.exp0.tr
{*Note:After step 5 and step 6 four files were created.(shapetable,inttemp,pffmtable,normproto) }
Step 7: Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)
Syntax: rename filename1 filename2
Eg:
rename shapetable train.shapetable
rename inttemp train.inttemp
rename pffmtable train.pffmtable
rename normproto train.normproto
Step 8: Create .traineddata file
Syntax: combine_tessdata [langname].
Eg: combine_tessdata train.
Move .traineddata file to tesseract programs tessdata directory
C:\Program Files\Tesseract-OCR\tessdata
Run tesseract for trained fronts
tesseract Test2.png stdout -l train
This worked perfectly for me! I trained a model to decipher text from the Gravity Falls ARG (I didn't want to do the soul contract by hand). It needs a little fine tuning, but in the end, it gave me the majority of the text correctly! Thank you!
Thank you! Finally, I found somebody that explains this for beginners!
Note: if shapetable file didn't create, you need to run shapeclustering command to generate for you.
example:
shapeclustering -F -U
or, in windows
shapeclustering.exe -F -U
Hey, thanks for your contribution! I still haven't been able to finish the process because, even after running your command, shapetable doesn't seem to generate. It's only generated after I run the next command (step 5), but the other two files in the video are not created. When I try to run the command again, I get an error saying "Failed to read shape table shapetable" Do you know why this may be?
This is super helpful, Tesseract doc is a mess. I don't know you're Indian or not but Indian youtubers make thing so much easier than the original docs.
This helped a lot in understanding the generation process of traineddata. Thank you!
Very good video. Please continue your channel and make more such videos please.
Thanks a lot for the video!
Gave up making part 3?!
You should do it!
Congratulations!
You saved like a week load of work for me!
you saved my code & my day ... thanks ( stdout is a masterpiece )
I followed the exact same steps but when I open the tiff file in box editor I don't see anything to edit on the left side
i run mftraining command and it only says no shape table file, and then nothing happens
I'm facing the same issue.
thank you for the video, what about if i want to make training for multi images, and result one train file ?
Hello can you please upload part 2 how to prepare images for better accuracy.
Thanks a lot! Very useful tutorial, and thanks for the material too!
thk!, please upload part 3
great video, waiting for Lesson 3
Thanks for this. I was able to duplicate the process in Linux. However, there was zero improvement in the recognition of my hand writing at all. I don't know if I did something wrong or Tesseeract is that bad lol. Thanks again.
Good tutorial, one of the best, thanks!
Nice explanation, Easley understood the steps. Can you share the content /Video to train and use the GD&T (Mechanical Characters).
Hi did you find some good exapmples with GD&T?
Hi Man, awesome tutorial.
Quick question: Struggling with step 5, my tesseract creating only one file (train.unicharset) instead of four as on your tutorial (missing: inttemp, pffmtable, normproto) , so receiving in cmd:
Warning: No shape table file present: shapetable
Reading train.my.exp0.tr ...
Flat shape table summary: Number of shapes = 11 max unichars = 1 number with multiple unichars = 0
on 04:41 can see that you get 3 more lines from cmd.. maybe you can give me some advice?
Issue occurred on Tesseract 5.X.... after installing Tesseract 4.1 issue is not present
@@adamchochowski5357 Thank you so much for following up with the solution! MVP
good job bro (y)
For multiple images should i do multiple traineddata or only single traineddata. if single means how to train multiple data
It appears that you need tesseract 4.1 running for this tutorial as with 5.0-alpha i couldn't pass the last steps
that's true
@Devdevdevdev idk, the probably can, but you will need a lot of samples to train that thing
@Devdevdevdev how many pages do you train with
@Devdevdevdev yes you can train more, and you probably should
@Devdevdevdev i didn't post any kind of script, i think you are mistaking me with someone, you should watch some kind of tutorial how to generate the training data, first of all, you should have a font.
If you don't have a font, which is obvious in the case of hand written stuff, then the only way to generate 5, 10, or 50+ pages would be to make a software, that can cut the predefined rectangle positions, and then generate a page containing randomly spread letters with predefined rectangles containing data which letter it is, if you can program that shouldn't be hard, then generate many pages containing the letters.
Thanks for the tutorial. How do I train data for Urdu and Arabic Languages. What would be the font properties. I have an urdu font and lots of 100s of urdu data in jpg format. No clue where to start how to start.
Thank you, hope have lesson 3~~
I trying follow with this video in step 5 have error: "Warning: No shape table file present: shapetable"
What happen with it?
Hey, did you ever figure it out? I'm getting the same error message.
@@samuelbastias3752 I think doing them in adminstator permissions and deleting the older files will fix your issues
I have an error at the last step to use it to read the image. it says error opening data file. make sure tessdata_prefix environment variable is set to tessdata directory. But I already put the program file\Tesseract-OCR into my path environment variable. Can you help witht his?
Thanks @ The Code....not all files generated !!! what should be the issue ?
I am trying to train tesseract in a Linux machine, I am getting segmentation fault in Step 5??
Thanks for your tutorial, I want to capture an email from an image but Tesseract does not recognize the @ symbol, how can I solve it?
Hey, How can I combine two traineddata files into single traineddata file
Hi, I am getting error while training the data. Could you please tell which tesseract version you are using?
it's in the movie, it's 4.0
Thank you for your video. It was very much useful. Can you please share the next part too?
Hey! Have you done your work on tesseract or doing?
facing error
'tesseract' is not recognized as an internal or external command,
operable program or batch file.
set the path correctly , search for path in window's search and then in variables , open path file and create new path ( eg:-c:/programfiles/tesseractocr)
How can we train the model with some specific user's handwritten data?
Where is part 3 ?
cannot find letters on geometric shapes. how can i solve this?
hi im getting error : "APPLY_BOXES: boxfile line 6/25 ((421,1325),(494,1378)): FAILURE! Couldn't find a matching blob" while creating .tr file if any one know how to solve plese provide soluation
do you have an answer to it?
How can I use this custom trained tesseract model and use it with YOLOv8 to recognize license plate number?????
Pls Help
did you find the solution???
@@dalinsixtus6752 No Sir
thanks for the tutorial, can you help me? after doing step 4, there is no font_properties file. i run this on raspbian
Same on windows 11
@@DammIhateThisName the description says echo arial 0 0 1 0 0" *[angled bracket]* font_properties
you need to use echo arial 0 0 1 0 0" *>* font_properties
cám ơn bạn rất nhiều
Why my Tesseract just reading .tr file but not write the pffmtable, intemp, and normproto?
have u found the solution bro?
i'm having the same problem\
Yes, I use Tesseract v4.0.0 and work fine
use tesseract v4.0.0 and ensure eng.traineddata file present in tessdata folder.
I tried running mftraining but it never ends? Any fix for this?
Thanks!
What is your Tesseract version
4.0
👍👍👍
Is this some sort of joke? You downloaded jTessBoxEditor and then did the whole process in a command line. What the hell is the purpose of jTessBoxEditor then??
To edit the bounding boxes. You can add bounding boxes wherever necessary when trainning for new languages.
You need jbox to correct data because when you train it befor correcting it will give you failure
Excellent, thank you.
At 1:16, an incidental note on pronunciation, the “v” in “converting” is a voiced “f” sound, rather than any “w” related sounds.
“v” is positioned next to “w” but that's misleading-they don't sound alike. Their sound production is different.
“v” is more closely related to “f". Say the word “fee.” Make and hold the “f” sound. Then, while holding the “f” sound, hum while making the “f” sound.
“v” is a vibrating “f”.
Regards
This is old way, pre Tesseract 4, not for LTSM network.
Classical Indian youtuber
when i copy past this command in cmd tesseract train.my.exp0.tif train.my.exp0 batch.nochop makebox it say that it doesn't recognize it