Tesseract OCR - Lesson 2: Training Tesseract for new font

Поділитися
Вставка
  • Опубліковано 19 вер 2024
  • jTessBox Editor: sourceforge.ne...
    Step 1: Make box files for images that we want to train
    Syntax: tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
    Eg:tesseract train.my.exp0.tif train.my.exp0 batch.nochop makebox
    {*Note: After making box files we have to change or modify wrongly identified characters in box files.}
    Step 2: Create .tr file (Compounding image file and box file)
    Syntax: tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] box.train
    Eg: tesseract train.my.exp.tif train.my.exp0 box.train
    step 3: Extract the charset from the box files (Output for this command is unicharset file)
    Syntax: unicharset_extractor [langname].[fontname].[expN].box
    Eg: unicharset_extractor train.my.exp0.box
    step 4: Create a font_properties file based on our needs.
    Syntax: echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" [angle bracket should be here] font_properties
    Eg: echo "arial 0 0 1 0 0" [angled bracket] font_properties
    Step 5: Training the data.
    Syntax: mftraining -F font_properties -U unicharset -O [langname].unicharset [langname].[fontname].[expN].tr
    Eg: mftraining -F font_properties -U unicharset -O train.unicharset train.my.exp0.tr
    Step 6:
    Syntax: cntraining [langname].[fontname].[expN].tr
    Eg: cntraining train.my.exp0.tr
    {*Note:After step 5 and step 6 four files were created.(shapetable,inttemp,pffmtable,normproto) }
    Step 7: Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)
    Syntax: rename filename1 filename2
    Eg:
    rename shapetable train.shapetable
    rename inttemp train.inttemp
    rename pffmtable train.pffmtable
    rename normproto train.normproto
    Step 8: Create .traineddata file
    Syntax: combine_tessdata [langname].
    Eg: combine_tessdata train.
    Move .traineddata file to tesseract programs tessdata directory
    C:\Program Files\Tesseract-OCR\tessdata
    Run tesseract for trained fronts
    tesseract Test2.png stdout -l train

КОМЕНТАРІ • 85

  • @jessd.294
    @jessd.294 27 днів тому

    This worked perfectly for me! I trained a model to decipher text from the Gravity Falls ARG (I didn't want to do the soul contract by hand). It needs a little fine tuning, but in the end, it gave me the majority of the text correctly! Thank you!

  • @meve404
    @meve404 2 роки тому +6

    Thank you! Finally, I found somebody that explains this for beginners!

  • @meggiotto
    @meggiotto Рік тому +3

    Note: if shapetable file didn't create, you need to run shapeclustering command to generate for you.
    example:
    shapeclustering -F -U
    or, in windows
    shapeclustering.exe -F -U

    • @samuelbastias3752
      @samuelbastias3752 Рік тому

      Hey, thanks for your contribution! I still haven't been able to finish the process because, even after running your command, shapetable doesn't seem to generate. It's only generated after I run the next command (step 5), but the other two files in the video are not created. When I try to run the command again, I get an error saying "Failed to read shape table shapetable" Do you know why this may be?

  • @ronaldaug8504
    @ronaldaug8504 3 роки тому +2

    This is super helpful, Tesseract doc is a mess. I don't know you're Indian or not but Indian youtubers make thing so much easier than the original docs.

  • @logeshpaul
    @logeshpaul Рік тому

    This helped a lot in understanding the generation process of traineddata. Thank you!

  • @madhurgoel66
    @madhurgoel66 5 місяців тому +1

    Very good video. Please continue your channel and make more such videos please.

  • @saviomilbratz
    @saviomilbratz Рік тому +1

    Thanks a lot for the video!
    Gave up making part 3?!
    You should do it!
    Congratulations!

  • @adityanjsg99
    @adityanjsg99 Рік тому

    You saved like a week load of work for me!

  • @HocineFerradj
    @HocineFerradj Рік тому

    you saved my code & my day ... thanks ( stdout is a masterpiece )

  • @IamTheGreatCornholioo
    @IamTheGreatCornholioo 2 роки тому +2

    I followed the exact same steps but when I open the tiff file in box editor I don't see anything to edit on the left side

  • @rockntt783
    @rockntt783 6 місяців тому +1

    i run mftraining command and it only says no shape table file, and then nothing happens

    • @banguncool
      @banguncool 5 місяців тому

      I'm facing the same issue.

  • @MohammadJAbuNasserMAAN
    @MohammadJAbuNasserMAAN 5 місяців тому +1

    thank you for the video, what about if i want to make training for multi images, and result one train file ?

  • @Dailythingsx
    @Dailythingsx 2 роки тому +1

    Hello can you please upload part 2 how to prepare images for better accuracy.

  • @lucaavitabile7687
    @lucaavitabile7687 2 роки тому

    Thanks a lot! Very useful tutorial, and thanks for the material too!

  • @davidwebchile
    @davidwebchile 5 місяців тому +1

    thk!, please upload part 3

  • @TuanLe-ve7lm
    @TuanLe-ve7lm Рік тому +1

    great video, waiting for Lesson 3

  • @EvonOSmith1
    @EvonOSmith1 2 роки тому

    Thanks for this. I was able to duplicate the process in Linux. However, there was zero improvement in the recognition of my hand writing at all. I don't know if I did something wrong or Tesseeract is that bad lol. Thanks again.

  • @MishaDisable
    @MishaDisable Рік тому

    Good tutorial, one of the best, thanks!

  • @vijayk7819
    @vijayk7819 2 роки тому +1

    Nice explanation, Easley understood the steps. Can you share the content /Video to train and use the GD&T (Mechanical Characters).

    • @viteralex
      @viteralex 2 роки тому

      Hi did you find some good exapmples with GD&T?

  • @adamchochowski5357
    @adamchochowski5357 2 роки тому +2

    Hi Man, awesome tutorial.
    Quick question: Struggling with step 5, my tesseract creating only one file (train.unicharset) instead of four as on your tutorial (missing: inttemp, pffmtable, normproto) , so receiving in cmd:
    Warning: No shape table file present: shapetable
    Reading train.my.exp0.tr ...
    Flat shape table summary: Number of shapes = 11 max unichars = 1 number with multiple unichars = 0
    on 04:41 can see that you get 3 more lines from cmd.. maybe you can give me some advice?

    • @adamchochowski5357
      @adamchochowski5357 2 роки тому +3

      Issue occurred on Tesseract 5.X.... after installing Tesseract 4.1 issue is not present

    • @samuelbastias3752
      @samuelbastias3752 Рік тому

      @@adamchochowski5357 Thank you so much for following up with the solution! MVP

  • @CookWithKuroOfficial
    @CookWithKuroOfficial Місяць тому

    good job bro (y)

  • @roshanacharya8054
    @roshanacharya8054 2 роки тому +1

    For multiple images should i do multiple traineddata or only single traineddata. if single means how to train multiple data

  • @marsmediainfo
    @marsmediainfo 2 роки тому +3

    It appears that you need tesseract 4.1 running for this tutorial as with 5.0-alpha i couldn't pass the last steps

    • @professionalgambling6783
      @professionalgambling6783 2 роки тому

      that's true

    • @professionalgambling6783
      @professionalgambling6783 2 роки тому

      @Devdevdevdev idk, the probably can, but you will need a lot of samples to train that thing

    • @professionalgambling6783
      @professionalgambling6783 2 роки тому

      @Devdevdevdev how many pages do you train with

    • @professionalgambling6783
      @professionalgambling6783 2 роки тому

      ​@Devdevdevdev yes you can train more, and you probably should

    • @professionalgambling6783
      @professionalgambling6783 2 роки тому

      @Devdevdevdev i didn't post any kind of script, i think you are mistaking me with someone, you should watch some kind of tutorial how to generate the training data, first of all, you should have a font.
      If you don't have a font, which is obvious in the case of hand written stuff, then the only way to generate 5, 10, or 50+ pages would be to make a software, that can cut the predefined rectangle positions, and then generate a page containing randomly spread letters with predefined rectangles containing data which letter it is, if you can program that shouldn't be hard, then generate many pages containing the letters.

  • @techgalaxy100
    @techgalaxy100 2 роки тому

    Thanks for the tutorial. How do I train data for Urdu and Arabic Languages. What would be the font properties. I have an urdu font and lots of 100s of urdu data in jpg format. No clue where to start how to start.

  • @人榜鄭
    @人榜鄭 Рік тому

    Thank you, hope have lesson 3~~

  • @promaster6310
    @promaster6310 2 роки тому +1

    I trying follow with this video in step 5 have error: "Warning: No shape table file present: shapetable"
    What happen with it?

    • @samuelbastias3752
      @samuelbastias3752 Рік тому

      Hey, did you ever figure it out? I'm getting the same error message.

    • @Faruk_ck
      @Faruk_ck 14 днів тому

      @@samuelbastias3752 I think doing them in adminstator permissions and deleting the older files will fix your issues

  • @waynewu7763
    @waynewu7763 Рік тому

    I have an error at the last step to use it to read the image. it says error opening data file. make sure tessdata_prefix environment variable is set to tessdata directory. But I already put the program file\Tesseract-OCR into my path environment variable. Can you help witht his?

  • @hasanaqbayli149
    @hasanaqbayli149 2 роки тому

    Thanks @ The Code....not all files generated !!! what should be the issue ?

  • @gokulkumar8232
    @gokulkumar8232 Рік тому

    I am trying to train tesseract in a Linux machine, I am getting segmentation fault in Step 5??

  • @javierpachon4424
    @javierpachon4424 2 роки тому

    Thanks for your tutorial, I want to capture an email from an image but Tesseract does not recognize the @ symbol, how can I solve it?

  • @smklearn-hy9me
    @smklearn-hy9me 8 місяців тому

    Hey, How can I combine two traineddata files into single traineddata file

  • @devartimahakalkar8822
    @devartimahakalkar8822 2 роки тому +1

    Hi, I am getting error while training the data. Could you please tell which tesseract version you are using?

  • @shrutisalunkhe1873
    @shrutisalunkhe1873 Рік тому

    Thank you for your video. It was very much useful. Can you please share the next part too?

    • @hritiktyagi7043
      @hritiktyagi7043 Рік тому

      Hey! Have you done your work on tesseract or doing?

  • @youtubehodol3386
    @youtubehodol3386 8 місяців тому

    facing error
    'tesseract' is not recognized as an internal or external command,
    operable program or batch file.

    • @dalinsixtus6752
      @dalinsixtus6752 7 місяців тому

      set the path correctly , search for path in window's search and then in variables , open path file and create new path ( eg:-c:/programfiles/tesseractocr)

  • @shikhugupta2703
    @shikhugupta2703 Рік тому

    How can we train the model with some specific user's handwritten data?

  • @Rocketos
    @Rocketos Рік тому

    Where is part 3 ?

  • @oguzhanylmaz4586
    @oguzhanylmaz4586 2 роки тому

    cannot find letters on geometric shapes. how can i solve this?

  • @subramanyagopalbellary9845
    @subramanyagopalbellary9845 Рік тому

    hi im getting error : "APPLY_BOXES: boxfile line 6/25 ((421,1325),(494,1378)): FAILURE! Couldn't find a matching blob" while creating .tr file if any one know how to solve plese provide soluation

  • @Leo-hk7kk
    @Leo-hk7kk 11 місяців тому

    How can I use this custom trained tesseract model and use it with YOLOv8 to recognize license plate number?????
    Pls Help

    • @dalinsixtus6752
      @dalinsixtus6752 7 місяців тому

      did you find the solution???

    • @Leo-hk7kk
      @Leo-hk7kk 7 місяців тому

      @@dalinsixtus6752 No Sir

  • @muhammadfikriassegaf9147
    @muhammadfikriassegaf9147 2 роки тому

    thanks for the tutorial, can you help me? after doing step 4, there is no font_properties file. i run this on raspbian

    • @DammIhateThisName
      @DammIhateThisName 2 роки тому

      Same on windows 11

    • @Someone-On-The-Internet
      @Someone-On-The-Internet 2 роки тому +1

      @@DammIhateThisName the description says echo arial 0 0 1 0 0" *[angled bracket]* font_properties
      you need to use echo arial 0 0 1 0 0" *>* font_properties

  • @cuongo1094
    @cuongo1094 9 місяців тому

    cám ơn bạn rất nhiều

  • @nurahmadmiftahudin8314
    @nurahmadmiftahudin8314 2 роки тому +3

    Why my Tesseract just reading .tr file but not write the pffmtable, intemp, and normproto?

    • @haziqidrose4336
      @haziqidrose4336 2 роки тому

      have u found the solution bro?

    • @haziqidrose4336
      @haziqidrose4336 2 роки тому

      i'm having the same problem\

    • @nurahmadmiftahudin8314
      @nurahmadmiftahudin8314 2 роки тому +1

      Yes, I use Tesseract v4.0.0 and work fine

    • @aiesewss6659
      @aiesewss6659 2 роки тому +1

      use tesseract v4.0.0 and ensure eng.traineddata file present in tessdata folder.

    • @DiscworldZA
      @DiscworldZA 2 роки тому +1

      I tried running mftraining but it never ends? Any fix for this?

  • @burgir180
    @burgir180 3 роки тому

    Thanks!

  • @eltradermexicano
    @eltradermexicano 2 роки тому

    What is your Tesseract version

  • @dalipsingh8583
    @dalipsingh8583 3 роки тому

    👍👍👍

  • @Kenbreg
    @Kenbreg Рік тому +1

    Is this some sort of joke? You downloaded jTessBoxEditor and then did the whole process in a command line. What the hell is the purpose of jTessBoxEditor then??

    • @jagajitrabha1276
      @jagajitrabha1276 8 місяців тому +1

      To edit the bounding boxes. You can add bounding boxes wherever necessary when trainning for new languages.

    • @larabassabah202
      @larabassabah202 7 місяців тому +1

      You need jbox to correct data because when you train it befor correcting it will give you failure

  • @d0ugparker
    @d0ugparker 2 роки тому +1

    Excellent, thank you.
    At 1:16, an incidental note on pronunciation, the “v” in “converting” is a voiced “f” sound, rather than any “w” related sounds.
    “v” is positioned next to “w” but that's misleading-they don't sound alike. Their sound production is different.
    “v” is more closely related to “f". Say the word “fee.” Make and hold the “f” sound. Then, while holding the “f” sound, hum while making the “f” sound.
    “v” is a vibrating “f”.
    Regards

  • @siux94
    @siux94 2 роки тому

    This is old way, pre Tesseract 4, not for LTSM network.
    Classical Indian youtuber

  • @jehangirkhankhattak8002
    @jehangirkhankhattak8002 Рік тому

    when i copy past this command in cmd tesseract train.my.exp0.tif train.my.exp0 batch.nochop makebox it say that it doesn't recognize it