Training Tesseract 5 for a New Font

Поділитися
Вставка
  • Опубліковано 4 гру 2024

КОМЕНТАРІ • 177

  • @taylorbarnes6151
    @taylorbarnes6151 Рік тому +16

    God I love you. I just recently started messing with OCR's, specifically Tesseract, and I was reading through some documentation on the steps and after a few hours just wanted to end my life hahahaha. Thank you for this, this is extremely encouraging. I can't wait to try this!

  • @AchievementHuntGuru
    @AchievementHuntGuru 4 місяці тому +1

    This video on training is the only source that by following this you will be able to achieve results! Many thanks for this video!

  • @buny0n
    @buny0n 9 місяців тому +18

    Tesseract's documentation is abysmal.

    • @nikolaikrot8516
      @nikolaikrot8516 8 місяців тому +1

      I tend to think about tesseract documentation as the Augean Stables

  • @donjuanpond1
    @donjuanpond1 4 місяці тому +1

    thank you so much man. I've been looking everywhere for a tesseract tutorial, it all just points to the shitty unreadable docs. Without you I don't know where I'd be

  • @fivalt126
    @fivalt126 7 місяців тому +1

    Estuve rompiendome la cabeza tratando de entender el tutorial oficial y tú lo explicas de una manera sencilla. Soy tu suscriptor numero 666, Muchas Gracias.

  • @yichenyao5927
    @yichenyao5927 8 місяців тому +2

    I think the reason why the word error rate is high is because the font doesn't distinguish uppercase with lower case (it's all upper case) but the ground truth label distinguish between the two.

  • @45545videos
    @45545videos Рік тому +2

    Haven't watched the video yet, but if this works, you'll have my eternal gratitude

  • @wojd_
    @wojd_ Рік тому

    Great tutorial. Using WSL I was constantly getting new errors. Switching to OS installed on VirtualBox solved it. I was able to train my dataset-it's surprisingly easy.

    • @heetshah9394
      @heetshah9394 Рік тому

      Could you help me with the directory structure. I am a bit confused on how it is made?

  • @ConfusedProgrammer
    @ConfusedProgrammer 10 місяців тому +2

    I've been experimenting with this tutorial for three days , the file structure and the GitHub doesn't necessarily match, can you please update the repo if possible . I am having too many folder inconsistencies when trying to to connect the dots here as it was brushed over really quickly , thank you :)

  • @madhavpandey30
    @madhavpandey30 Рік тому +3

    Hey Gabriel, I am following your steps to train on my model on hand writtent text. But it is always failing with this erro:
    unicharset_extractor --output_unicharset "data/Apex/my.unicharset" --norm_mode 2 "data/Apex/all-gt"
    Failed to read data from: data/Apex/all-gt
    Wrote unicharset file data/Apex/my.unicharset
    Can you please help me here? I am stuck. Thanks!

  • @ganeshrajv130
    @ganeshrajv130 Рік тому +1

    the title is for new font , can I take it as new language ? using TIFF

  • @ganeshrajv130
    @ganeshrajv130 Рік тому +2

    one last question to shoot up, basically the Tesseract is not trained with handwritten text I guess and its trained on line files of system text which again converted to images on line basis for training. ? is my assumption true ?

    • @dhirazz
      @dhirazz Рік тому

      Hey, It seems like you were also looking to train tesseract with handwritten text. Did you do it? If so please shade light, I am so lost

    • @ganeshrajv130
      @ganeshrajv130 Рік тому

      @@dhirazz training is not an easy thing as you need huge amt of data and they as well clearly said training is not gonna make any sense ( google ) hence,if u wanna try adjusting the parameters then deep dive into cpp

  • @Ayaangaddam
    @Ayaangaddam 9 місяців тому

    Thank you for doing this tutorial. Can I use the Text2Image approach to generate box files and tif files to train new font for Tesserat 4.0?

  • @aayushjain7793
    @aayushjain7793 2 роки тому +3

    While running the script 'split_training_text.py'. I am getting the following error:
    Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
    Could you help me how to resolve this?

    • @jayrigger7508
      @jayrigger7508 2 роки тому

      I am also getting this.. running as sudo helped a bit still getting this "Unable to open '../tmp/fonts.conf' for writing: No such file or directory"

    • @jayrigger7508
      @jayrigger7508 2 роки тому

      just top add.. I am getting eng_XX.box f eng_XX.tiff and eng_xx.gt.txt

    • @aayushjain7793
      @aayushjain7793 Рік тому

      @@jayrigger7508 I have resolved the issue by just changing the --font flag to /usr/share/fonts

    • @goksel9908
      @goksel9908 3 місяці тому

      @@aayushjain7793 you mean,
      '--font= Apex',
      you changed this to
      '--font= /usr/share/fonts/Apex',
      this?

  • @ganeshrajv130
    @ganeshrajv130 Рік тому +1

    I tried with this font for hindi language ( Kruti Dev 010 ) even tried with Kruti Dev 016 but its showing : Error: Call PrepareToWrite before WriteTesseractBoxFile!!

  • @shadyas.1571
    @shadyas.1571 Рік тому +2

    Hi Gabriel.
    Thank you for this tutorial.
    I was trying to run the code but I'm receiving this error:
    Fontconfig error: Cannot load default config file: No such file: (null)
    This error appears to be font-related. I've experimented with several fonts but I'm unable to resolve this issue.
    Could you help me please?

    • @kavachek2
      @kavachek2 Рік тому

      такая же проблема

    • @pauliusliaudenskas9269
      @pauliusliaudenskas9269 10 місяців тому

      Have you been able to figure it out? I'm having the same problem

    • @kavachek2
      @kavachek2 10 місяців тому

      @@pauliusliaudenskas9269 к сожелению, не смог. Не понимаю, как это сделать

  • @adityanjsg99
    @adityanjsg99 Рік тому +1

    So far, the only tutorial on Tesseract 5, the old model of training by bash has been abandoned since December 2022

    • @faint.2396
      @faint.2396 Рік тому

      So, are you saying this video is now not useful at all?

  • @DalvinderKaur-iz5sn
    @DalvinderKaur-iz5sn Рік тому +1

    .lstmf files are missing. please help me to where i am wrong.

  • @wonkduck4759
    @wonkduck4759 Рік тому

    Hi Gabriel! Thank you so much for the video. A question I had was where did you upload your apex legends ttf file in the code directory like where should it be place? I have a custom font ttf file that I want to train on

  • @ganeshrajv130
    @ganeshrajv130 Рік тому +1

    If I have the line wise hand written image for any language with bounding box and the words so and so can I train it on this LSTM network ? will it work ? and could you share your thoughts on the backbone of LSTM architecture with the flow diagram says : how fonts is helping with training data

  • @akshatjain2925
    @akshatjain2925 10 місяців тому +1

    hi when u say we are using text2image nothing AI, but the text2image must be also some model only right ?

  • @listentomusicfeellikehome
    @listentomusicfeellikehome 6 місяців тому +1

    Hi.I try this on colab. I install tesseract and go on to run split_training_text.py and get this error FileNotFoundError: [Errno 2] No such file or directory: 'text2image'. Is there a solution?

  • @umandadikwatta178
    @umandadikwatta178 2 роки тому +1

    Thank you very much for this. One question. Can we train Tesseract with non unicode fonts using the same process?

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      I'm pretty sure, as long as text2image works correctly. If text2image doesn't work correctly you can either come up with another clever ways (like Python scripts) of automatically generating ground truth data (.gt.txt, .box and .tif files), or worst case, create them manually.

  • @3ombieautopilot
    @3ombieautopilot 2 роки тому +1

    Thank you for making this video. But I can't wrap my head around where to put all those data files to? I'm trying to fine tune variations of letters with accents, and I'm helpless.

  • @ManuthVANN
    @ManuthVANN 10 місяців тому

    Thank so much sir for ur clear explaination and code

  • @Leo-hk7kk
    @Leo-hk7kk Рік тому

    I want to custom train Tesseract 5 to read the license plates of the car which are detected using YOLO model. How can I do these as I have couple of thousand images? Help
    What are the steps I need to follow?

  • @ivanmongebadilla9454
    @ivanmongebadilla9454 2 роки тому +1

    Thanks for the tutorial Gabriel. I wanted to ask how could I do this process if I have the images in text? I guess I need to do the .txt file and the .box file and then just run the training command.
    Do you know any software that I could use to create the .box file from the images I have?
    Thanks in advance!

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      I have seen people use the jTessBoxEditor: vietocr.sourceforge.net/training.html

    • @ivanmongebadilla9454
      @ivanmongebadilla9454 2 роки тому

      @@AstuteJoe one more question, how would you use the newly trained model in python?
      Thank you

    • @AstuteJoe
      @AstuteJoe  2 роки тому +1

      @@ivanmongebadilla9454 I think just a parameter lang='your_new_model_name' as long as the new model is in the tessdata folder

    • @heetshah9394
      @heetshah9394 Рік тому

      Is it necessary for the box_file to be for each character or is it okay for it to be one word per bounding box?

  • @DalvinderKaur-iz5sn
    @DalvinderKaur-iz5sn Рік тому

    when tesseract training is start it show the bellow warning
    Can't encode transcription: 'पिए वई। ज़ख़मनि जो सूर वधंदो वियो हू चीखन्दो for Sindhi
    how I can handle this problem?

  • @Bengeljo
    @Bengeljo 9 місяців тому +1

    I always get an error when I want to use a font, it is installed and can be find by windows and even looking it up works perfectly. When I run the split_training_text.py I get the following Error:
    Fontconfig error: Cannot load default config file: No such file: (null)
    Fontconfig error: Cannot load default config file: No such file: (null)
    Could not find font named 'Quadrant'.
    Pango suggested font 'Cascadia Code'.
    Please correct --font arg.
    I want to train the model on Quadrat-Serial-Regular.ttf but it just won't regognize it. I tried to look it up but can't find it. Modifying the font flag doesn't help since it wants a name but it can't find it even tho it is there, but tbh I don't know where it is searching for the fonts.
    The Folder is located on the SSD E: and the operating system is on C: but tesseract and python are in the path of C: so they should get access to it. Please help

    • @TheComputerChip
      @TheComputerChip 8 місяців тому +1

      Having the same problem. Still trying to understand what it is looking for...

    • @Bengeljo
      @Bengeljo 8 місяців тому +1

      @@TheComputerChip I gave up, looked at another method that uses the Google colab and create my own model there it works pretty well. Don't know the video anymore cause probably between then and now I watched approximately 250 vids. Not kidding I don't have a life

    • @TheComputerChip
      @TheComputerChip 8 місяців тому +2

      @@Bengeljo hahaha no worries. I actually ended up getting this to work. The error doesn’t seem to affect the output oddly enough. As long as it finds the font everything still runs. Currently waiting as my PC generates the images and then I’ll sleep as it trains. On video #3 since starting the image creation! lol

    • @ROHIT_S_Patil
      @ROHIT_S_Patil 5 місяців тому

      ​@@Bengeljo Can you share the Google Colab workflow you followed to create your model?

  • @YashhBhushan
    @YashhBhushan 5 місяців тому

    Buddy i need help i need to learn this software but im absolutley clueless any sources tutorils and videoa i can watch

  • @saviomilbratz
    @saviomilbratz 4 місяці тому +1

    Training Tesseract is almost an impossible task.
    There could be an easier way just using Pyhton or something simpler.
    For regular Windows user like me, this task is almost impossible.

  • @AmphibianDev
    @AmphibianDev Рік тому +1

    Hi, I am having issues with the last make training command. It throws out a error "No module named 'PIL'".
    I have the Pillow library install but the error is still there. I am trying to solve this issue for a long, long time.
    If you know something I will appreciate the help. I wanted link to my github issue but I am afraid youtube doesn't allow link.

    • @mohammadmn7364
      @mohammadmn7364 10 місяців тому

      Hey, long time passed, But for others having the same issue, creating an virtual env and then installing requiremnets.txt (of the tesstrain repo) in it may fix the issue, at least for me it worked! also check if all txt files have related box files or not!

  • @gyeongwango5434
    @gyeongwango5434 Рік тому

    I want to train tesseract with an image file I have (consisting of several lines of text), but I'm not sure how to go about it, starting with creating the train data. I'd really appreciate your tips (URLs for reference, etc).

  • @ganeshrajv130
    @ganeshrajv130 Рік тому

    I tired with your font but didnt work its throwing like :: Could not find font named 'Arial Unicode MS Regular'.
    Pango suggested font 'Liberation Mono'. tried with arial but didnt work

  • @Bobo-wl6bs
    @Bobo-wl6bs Рік тому

    Hi Gabriel. I came across Tesseract today. I'm curious will I be able to train it to learn an arabic font?. I have a bunch of pdfs which are written in an indigenous language. The idea here is to train it on some sample pages so that it will be able to read it. It includes diacritics so I'm not sure if it will work.

    • @AstuteJoe
      @AstuteJoe  Рік тому

      Check the comments, a bunch of people train it for this exact intent

  • @azadehpedram7215
    @azadehpedram7215 9 місяців тому

    I have bunch of plate with some text on it , goal is change the image to text, special font is trained but not effective , how can i train tobetter result, thanks for help

  • @ganeshrajv130
    @ganeshrajv130 Рік тому

    can we train the tesseract without any font ? if not why cant we ?

  • @PsychologicalHeat
    @PsychologicalHeat 2 роки тому +1

    I am reciveing this error when I try to run your command:
    Failed to read boxes from data/myFont-ground-truth/eng_45.tif
    Error during processing.
    make: *** [data/myFont-ground-truth/eng_45.lstmf] Error 1
    TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=myFont START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
    I have added eng.traineddata to tessdata. Can you help me fixed it please?

    • @AstuteJoe
      @AstuteJoe  2 роки тому +1

      Did you generate the .box files successfully?

    • @PsychologicalHeat
      @PsychologicalHeat 2 роки тому

      ​@@AstuteJoe I cleaned the box files but now I get a different error
      Here is my output:
      + tesseract data/myFont-ground-truth/eng_2.tif data/myFont-ground-truth/eng_2 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_0.tif data/myFont-ground-truth/eng_0 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_5.tif data/myFont-ground-truth/eng_5 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_7.tif data/myFont-ground-truth/eng_7 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_3.tif data/myFont-ground-truth/eng_3 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_1.tif data/myFont-ground-truth/eng_1 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      find -L data/myFont-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/myFont/all-lstmf"
      Error: missing ground truth for training
      make: *** [data/myFont/list.train] Error 1
      Your help will be very appreciated 🙂

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      @@PsychologicalHeat Did you generate the .txt.gt files? Those are text files with the actual text in them

    • @PsychologicalHeat
      @PsychologicalHeat 2 роки тому

      ​@@AstuteJoe Yes, I have all gt.txt, .box, and .tiff files
      I think the problem is that I want the ocr to read only uppercase letters?
      I have made a custom training_text file and it only has numbers, '-' and uppercase letters.
      I played around with it and now this is the output:
      find -L data/myFont-ground-truth -name '*.gt.txt' | xargs paste -s > "data/myFont/all-gt"
      unicharset_extractor --output_unicharset "data/myFont/unicharset" --norm_mode 2 "data/myFont/all-gt"
      Bad box coordinates in boxfile string! 36-XR-34928-PN-54460-TN-50758-XB-02919-JP-10263-DG-99350-MF-07358-PK-31144-MB-35731-ZX-758
      Extracting unicharset from plain text file data/myFont/all-gt
      Other case x of X is not in unicharset
      Other case r of R is not in unicharset
      Other case p of P is not in unicharset
      Other case n of N is not in unicharset
      Other case t of T is not in unicharset
      Other case b of B is not in unicharset
      Other case j of J is not in unicharset
      Other case d of D is not in unicharset
      Other case g of G is not in unicharset
      Other case m of M is not in unicharset
      Other case f of F is not in unicharset
      Other case k of K is not in unicharset
      Other case z of Z is not in unicharset
      Wrote unicharset file data/myFont/unicharset
      make: *** No rule to make target `data/myFont-ground-truth/myFont_1.lstmf', needed by `data/myFont/all-lstmf'. Stop.

  • @kallemyllynen9571
    @kallemyllynen9571 10 місяців тому

    Running this on Windows I had to modify the Makefile to make it work

  • @Ethiopic
    @Ethiopic Рік тому

    Thank you for this video. I am now able to train Tesseract to ocr my language data in the Mac. This is working great both in the Linux and the Mac. (But, I am unable to do so because I am getting error "tessdata_prefix not recognized" in the Windows. )

    • @wonkduck4759
      @wonkduck4759 Рік тому

      Hello, I am currently stuck. Where did you upload your new font ttf file in the code directory like where should it be place? I have a custom font ttf file that I want to train on?

    • @alirezanadafy9267
      @alirezanadafy9267 Рік тому

      Hi
      Just run:
      set TESSDATA_PREFIX="../tesseract/tessdata"
      and then run the text2image....

  • @umandadikwatta178
    @umandadikwatta178 Рік тому

    Hello, Can you please explain how to debug the Tesseract code, to get an idea on how the code works ?

    • @AstuteJoe
      @AstuteJoe  Рік тому

      Honestly, I think your best bet is cloning the GitHub repo, readings the docs and then delving onto code, just reading it, eventually you'll be better at knowing where to look and after trying hard you might be comfortable and understand it. And I'm pretty sure in the docs you can dump and inspect some intermediary steps debug files, finally, be sure to run it on verbose mode, probably -v. Ah, and you can compile it with debugging symbols too, should help if you want to set breakpoints etc

  • @NotFlashYT
    @NotFlashYT Рік тому

    How do you get suggestions in your terminal for auto completion of commands.

  • @hoangcuong9521
    @hoangcuong9521 9 місяців тому

    Thank you for making this video. It helps me a lot. But I have a problem that when I copy and replace link to save dir or language_code..training_text, it appears that all of those generated image are white blank images. Pls help me out of this :

  • @mukilanru
    @mukilanru 4 місяці тому

    I want to be able to OCR '±' which is being detected as '+'.
    tesseract 5.4.0.20240606
    pytesseract 0.3.10
    python 3.12

  • @nilor7550
    @nilor7550 Рік тому

    I didn't understand how to run the training command after downloading the two folders from github. I have Windows system

    • @physicfor
      @physicfor 4 місяці тому

      It will never work for windows

  • @farazsoftinfo
    @farazsoftinfo 2 роки тому +1

    Hi Gabriel,
    Thanks for making this tutorial, I was waiting for it.
    I will start taring my model soon. 😍
    But how we can fine-tune a model?
    Can you please show me how can I combine this new trained file with another model?

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      Glad you liked it! In this tutorial you can see I actually fine-tuned, I started on the eng.traineddata file from Tesseract and trained it further on a new font, this should be enough for most cases.

    • @farazsoftinfo
      @farazsoftinfo 2 роки тому

      ​@@AstuteJoe Hi Gabriel, when I fine-tune I get a very bad result. I just wanna add some new words and some characters, but the final file that I get is worse than the main traineddata file.
      I'm trying to fine-tune an RTL language.
      Thanks a lot.

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      @@farazsoftinfo That's a very different rabbit hole, that's ML techniques, you might be overfitting (training too much) or underfitting (training too little) your model, have you tried generating all the 193k PDFs to train and leaving it to train for a bit?

    • @gabriel2011gabriel
      @gabriel2011gabriel Рік тому

      @@farazsoftinfo I'm trying to do the same thing and the result is a bunch of "mmmoooomom...". Is yours the same?

    • @farazsoftinfo
      @farazsoftinfo Рік тому +1

      ​@@gabriel2011gabriel I tried it for Persian, but I couldn't get a good result. The main models are still better than what I got. When I try to add some new words and fonts I get a worse model. Maybe I should check it more to figure out the best settings that work for the RTL languages.

  • @KINGERTADC_yay
    @KINGERTADC_yay Рік тому

    Hey Gabriel, nice vid, I am actually using it to train tesseract on Aurbesh font/language from star wars look it up it would explain a lot, each letter has a corresponding English letter I have collected roughly 100,000 sentences using your program and trained it with the command you provided but when I run a 6 letter word it completely melts down and just outputs the incorrect answer, I have changed iteration to small and big but no luck, I am wondering if you can help me or point me in the right direction. Thanks a lot

    • @ganeshrajv130
      @ganeshrajv130 Рік тому +1

      Hey you collected font but whats the training text data is that of Aurbesh ?

    • @kinderpinguiin7064
      @kinderpinguiin7064 Рік тому

      Hi ! I don't know if you resolved your issue since 1 month but don't forget to set a huge MAX_ITERATIONS to the make training. I personally set it to 10000 and it was quite better, it might be really enough for you if you have 100000 sentences. If you want to know the result check the log while the model is training, for example :
      At iteration 7800/7800/7800, Mean rms=5.642000%, delta=49.022000%, BCER train=97.817000%,
      BWER train=100.000000%, skip ratio=0.000000%, New best BCER = 97.817000 wrote checkpoint.
      BCER is the error rate for characters and BWER the error rate for words, you can see that at iteration 7800 it was higher than 95% and after the 9500th iteration I got several improvements.

  • @insidethoughts502
    @insidethoughts502 2 роки тому

    Is tessaract 5 can helpful for detecting only bold text from images

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      Only experimentation will tell, but Tesseract 5 does perform better some times

  • @cryptoplusone3850
    @cryptoplusone3850 2 роки тому

    does this also work on windows or do i have to use a different method?

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      I believe it works, but definitely not every step exactly like in the video. But as far as I remember the Tesseract mantainers highly recommend Linux instead

    • @focusofLandD
      @focusofLandD 2 роки тому

      I tried on Windows, not working very well, pls let me know if you are able to solve it

  • @DalvinderKaur-iz5sn
    @DalvinderKaur-iz5sn Рік тому

    when i run the training command, its gives me the bellow error
    Segmentation fault (core dumped) tesseract "data/Apex-ground-truth/eng_62.tif" data/Apex-ground-truth/eng_62 --psm 13 lstm.train
    Makefile:262: recipe for target 'data/Apex-ground-truth/eng_62.lstmf' failed
    make: *** [data/Apex-ground-truth/eng_62.lstmf] Error 139
    Can you help me to fix this?

    • @xzerozdead
      @xzerozdead Рік тому

      Your folder was probably named "Apex" and not "Apex-ground-truth"

  • @IshaqKhan010
    @IshaqKhan010 Рік тому

    Brother you can train for urdu nashtiliq font there no accurate trained data on net please

  • @DalvinderKaur-iz5sn
    @DalvinderKaur-iz5sn 2 роки тому

    Thanks for the tutorial Sir. I have a error after run the Training command-TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=
    eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000. the error is :
    "CMakefile:325: recipe for target 'data/foo/checkpoints/foo_checkpoint' failed". And coding of string failed! Failure bytes.... ..Can't encode transcription: .....Please can you help me regarding these issues?

  • @Schwartz999
    @Schwartz999 8 місяців тому

    When running your python script, an error occurs:
    Fontconfig error: Cannot load default config file
    Fontconfig error: Cannot load default config file
    Could not find font named 'Waukegan LDO Bold'.
    Please correct --font arg.
    How can I solve this error? I need to use my unique font "Waukegan LDO Bold.ttf"
    I hope you can help me to solve this problem, thank you in advance.

  • @legendevent3911
    @legendevent3911 2 роки тому

    Hey Gabriel, I have a training_text file with just digits like 1,234,567 in variety combinations. The Problem ist when I try to start your script i get following error message:
    python3 split_training_text.py
    Traceback (most recent call last):
    File "split_training_text.py", line 12, in
    for line in input_file.readlines():
    File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
    Could you help me to resolve this? Im a newbie in python.
    The tutorial was great!
    Edit: When im changing the script to: with open(training_text_file, 'rb') I get a new error TypeError: write() argument must be str, not bytes

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      Can you send me the whole file? Pastebin or GitHub does it, I believe I know exactly how to fix but I need the whole file to send you the fixed version

    • @abdeldjalilchougui
      @abdeldjalilchougui Рік тому

      Did you solve the problem ? if yes could you share it with me please ?

    • @abdeldjalilchougui
      @abdeldjalilchougui Рік тому

      @@AstuteJoe Did you solve the problem ? if yes could you share it with me please ?

    • @sebastianorzechowski4613
      @sebastianorzechowski4613 8 місяців тому

      I think you have to type encoding='utf-8' insine open function:
      with open(training_text_file,'r',encoding='utf-8') as input_file:

  • @snoopi6243
    @snoopi6243 Рік тому

    Is there any way to perform RTL languages/fonts fine tuning in windows just like this?

    • @physicfor
      @physicfor 4 місяці тому

      On windows text2image will never find the font name so better install some lnx vertual machine

  • @prakashchavda2813
    @prakashchavda2813 15 днів тому

    I guess Linux machine is must for training tesseract 5, because its not working in Windows OS.

  • @eusebiosouza2252
    @eusebiosouza2252 Рік тому

    Great Video !
    I'm getting this error when i try do run the training command:
    "Failed to read boxes from data/FE_Font-ground-truth/eng_16.tif"
    The file eng_16.tif not seems to be empty and it's very similar to all other trainning files. Im running with MAX_ITERATIONS=100 and with i delete the file that seems to be the problem, tesseract would throw the same error but with a different file. Does anyone could please help me ?

  • @asiburrahman3623
    @asiburrahman3623 2 роки тому

    I didn't get the font part. Where did you put the font?

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      It has to be installed on your system, each OS will have a different way of doing it

    • @asiburrahman3623
      @asiburrahman3623 2 роки тому +1

      @@AstuteJoe i'm using ubuntu. Is there any way to specify the directory?

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      @@asiburrahman3623 askubuntu.com/questions/3697/how-do-i-install-fonts

    • @asiburrahman3623
      @asiburrahman3623 2 роки тому +2

      @@AstuteJoe I have installed the font but still this error shows:
      Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
      Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
      Could not find font named 'Apex'.

    • @kannapatudompant8535
      @kannapatudompant8535 2 роки тому

      @@asiburrahman3623 I also have the same problem.
      I tried to add '--fontconfig_tmpdir={fontconf_dir}'. >> the default is /tmp which doesn't have our font directory in it.
      fonts.conf is usually located in etc/share/fonts.
      Now, I could create .box and .tif files.
      Hope this solution could solve your issue too.

  • @datarkmveri2228
    @datarkmveri2228 2 роки тому +1

    Hi,
    When I try to Run training command it give a error can you please help me ------->
    Config file is optional, continuing...
    Failed to read data from: data/langdata/Apex/Apex.config
    Failed to read data from: data/langdata/radical-stroke.txt
    Error reading radical code table data/langdata/radical-stroke.txt
    make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1

    • @datarkmveri2228
      @datarkmveri2228 2 роки тому +2

      command : TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
      combine_tessdata -u ../tesseract/tessdata/eng.traineddata data/eng/Apex

    • @datarkmveri2228
      @datarkmveri2228 2 роки тому

      tesseract "data/Apex-ground-truth/eng_44.tif" data/Apex-ground-truth/eng_44 --psm 13 lstm.train
      + tesseract data/Apex-ground-truth/eng_44.tif data/Apex-ground-truth/eng_44 --psm 13 lstm.train
      python3 shuffle.py 0 "data/Apex/all-lstmf"
      + head -n 90 data/Apex/all-lstmf
      + tail -n 10 data/Apex/all-lstmf
      combine_lang_model \
      --input_unicharset data/Apex/unicharset \
      --script_dir data/langdata \
      --numbers data/Apex/Apex.numbers \
      --puncs data/Apex/Apex.punc \
      --words data/Apex/Apex.wordlist \
      --output_dir data \
      \
      --lang Apex
      Failed to read data from: data/Apex/Apex.wordlist
      Failed to read data from: data/Apex/Apex.punc
      Failed to read data from: data/Apex/Apex.numbers
      Loaded unicharset of size 113 from file data/Apex/unicharset
      Setting unichar properties
      Other case É of é is not in unicharset
      Other case FI of fi is not in unicharset
      Setting script properties
      Failed to load script unicharset from:data/langdata/Latin.unicharset
      Warning: properties incomplete for index 3 = C
      Warning: properties incomplete for index 4 = H
      Warning: properties incomplete for index 5 = E
      Warning: properties incomplete for index 6 = S
      Warning: properties incomplete for index 7 = -
      Warning: properties incomplete for index 8 = R
      Warning: properties incomplete for index 9 = I
      Warning: properties incomplete for index 10 = K
      Warning: properties incomplete for index 11 = N
      Warning: properties incomplete for index 12 = G
      Warning: properties incomplete for index 13 = B
      Warning: properties incomplete for index 14 = 8
      Warning: properties incomplete for index 15 = 5

    • @АлексейПетров-ч1и5д
      @АлексейПетров-ч1и5д Рік тому

      @@datarkmveri2228 solved it: need to run in tesstrain folder:
      make leptonica tesseract
      make tesseract-langdata

  • @ikedoriens6149
    @ikedoriens6149 2 роки тому

    Jezus. Isn't there just a command line possibility like in Tesseract 4.0?
    This seems a bit complicated for someone who's not into programming.

  • @sebastianorzechowski4613
    @sebastianorzechowski4613 8 місяців тому

    Helloo is there anyone who tried to learn tesseract polish signs !. I have adjusted this split_training_text for Tesseract 5.0 to create lines of polish set and then teach tesseract. Problem is with font type i think, cause it should know how to recognize those special characters:
    Stripped 4 unrenderable word(s): 'unieważnienie SZKOŁAMI NADZIEJĘ, | '
    I can share my adjusted script to generate those lines with you if you want. I will try with another font. I tried HvDTrial Fabrikat Mono

  • @TuanLe-ve7lm
    @TuanLe-ve7lm Рік тому

    hi Gabo, May I please see your fonts.conf file?

    • @AstuteJoe
      @AstuteJoe  Рік тому

      Not even sure what is this file now but here you go, this one is on my home folder:
      /home/gabri/tesseract_training/apex_legends.otf

    • @AstuteJoe
      @AstuteJoe  Рік тому

      This one is on the tesseract project folder:

    • @TuanLe-ve7lm
      @TuanLe-ve7lm Рік тому

      I have made a good progress today, I am able to train the Apex font, however when I switch to another font Nato Sans, it's able to generate box and tff but it shows error while training "Makefile:219: *** found no data/Noto Sans-ground-truth/*.gt.txt for Sans/all-gt. Stop." . Seem it does not accept font's name with space in middle ..

    • @AstuteJoe
      @AstuteJoe  Рік тому

      @@TuanLe-ve7lm That could definitely be it, spaces and Linux (or Windows) don't mix well

  • @blndazeez1973
    @blndazeez1973 2 роки тому

    Hi Gabriel,
    Great Video! One questions, when I try to retrain Arabic model using this command
    "TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=ara TESSDATA=../tesseract/tessdata MAX_ITERATIONS=200"
    It gives me below error:
    "Error opening data file ../tesseract/tessdata/eng.traineddata"
    The problem I am not using the English model.
    hanks for the video again!

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      That's really odd, I see you changed the START_MODEL so it should work, not super sure now

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      Do you have ara.traineddata in the tessdata folder?

    • @blndazeez1973
      @blndazeez1973 2 роки тому

      @@AstuteJoe Yes I have and made sure of it couple of times

    • @AstuteJoe
      @AstuteJoe  2 роки тому

      @@blndazeez1973 Maybe it's because the Apex model was already created when you were trying it out? And it's already on top of the eng trained data?

    • @blndazeez1973
      @blndazeez1973 2 роки тому +1

      @@AstuteJoe I redo the steps with different model name but gives me the same error, that is strange.

  • @kurobane_sama
    @kurobane_sama 4 місяці тому

    Impossible to use another language than english :(

  • @PratibhaVaradkar
    @PratibhaVaradkar Рік тому

    Hi Gabriel (@AstuteJoe), thank you for the elaborate tutorial.
    I have a doubt though, once i followed the tutorial, generated the tif, gt.txt and .box manually. My training quits with a zero error rate before the max iterations. But when i use the generated trainneddata file, it gives the error "Error: Tesseract (legacy) engine requested, but components are not present in /use/share/tesseract-ocr/5/tessdata/lang_name.traineddata!! Failed loading language 'lang_name' Tesseract couldn't load any languages! Could not initialize tesseract."
    Can you please suggest what i missed?

  • @3ombieautopilot
    @3ombieautopilot 2 роки тому +1

    Hello! Can you make a video about how to make tesseract to recognize a character which is out of eng.traineddata? Like ± , Ó mixed with some english text

  • @ahmetfatih4121
    @ahmetfatih4121 Місяць тому +1

    I can feel your pain bro, my heart breaks everytime your voice breaks :( Dealing with all those endless instructions, terminal commands designed by some d*ck head to make life miserable for all of us and just all kinds of bullshit. You have my sympathy.

  • @rcraftg4mer42
    @rcraftg4mer42 11 місяців тому

    i love you

    • @AstuteJoe
      @AstuteJoe  11 місяців тому

      lol i love you too

  • @datarkmveri2228
    @datarkmveri2228 2 роки тому

    please help

  • @_nom_
    @_nom_ Рік тому

    No rule to make target 'data/eng-ground-truth/eng.training_text.lstmf'

  • @АлексейПетров-ч1и5д

    Hello, how to fix it?
    Failed to read data from: data/langdata/Apex/Apex.config
    Failed to read data from: data/langdata/radical-stroke.txt
    Error reading radical code table data/langdata/radical-stroke.txt
    make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1

  • @focusofLandD
    @focusofLandD Рік тому

    Hi, Gabriel: I am getting this error: at the last training step when I am trying to train a new font called Bender:
    Failed to read data from : data/bender/bender.worldlist
    Failed to read data from : data/bender/bender.punc
    Failed to read data from : data/bender/bender.numbers
    Failed to read data from : data/bender/bender.config
    Invalid format in radical table at line 0: 19886 3 23 6 3

    • @notAvn
      @notAvn Рік тому

      did you manage to train tesseract for bender yet?

  • @Kronzplayz.
    @Kronzplayz. Рік тому

    kindly help i'm getting an error while training plz @AstuteJoe
    Failed to read data from: data/OCRA/OCRA.wordlist
    Failed to read data from: data/OCRA/OCRA.punc
    Failed to read data from: data/OCRA/OCRA.numbers
    Loaded unicharset of size 112 from file data/OCRA/unicharset
    Setting unichar properties
    Other case É of é is not in unicharset
    Setting script properties
    Failed to load script unicharset from:data/langdata/Latin.unicharset
    Config file is optional, continuing...
    Failed to read data from: data/langdata/OCRA/OCRA.config
    Failed to read data from: data/langdata/radical-stroke.txt
    Error reading radical code table data/langdata/radical-stroke.txt
    make: *** [Makefile:293: data/OCRA/OCRA.traineddata] Error 1

  • @ganeshrajv130
    @ganeshrajv130 Рік тому

    read_params_file: Can't open make
    read_params_file: Can't open training
    read_params_file: Can't open MODEL_NAME=nakula_hin
    read_params_file: Can't open START_MODEL=hin
    read_params_file: Can't open TESSDATA=/usr/local/share/tessdata/
    read_params_file: Can't open MAX_ITERATIONS=10
    Error, cannot read input file TESSDATA_PREFIX: No such file or directory
    Error during processing. This is what the error I get even though i did followed ur step

  • @faint.2396
    @faint.2396 2 роки тому

    Hi I'm getting this error:
    Traceback (most recent call last):
    File "C:\Users\HAVASIZ\Desktop\tesseract_tutorial\split_training_text.py", line 34, in
    subprocess.run([
    File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 501, in run
    with Popen(*popenargs, **kwargs) as process:
    File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 969, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
    File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1438, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
    FileNotFoundError: [WinError 2]

    • @TuanLe-ve7lm
      @TuanLe-ve7lm Рік тому

      same to me, have you had a solution yet

    • @faint.2396
      @faint.2396 Рік тому

      @@TuanLe-ve7lm No, sadly I gave up on how to train Tesseract 5. I'm going to try to learn how to train Tesseract 4 because there are a lot more videos on youtube.

    • @faint.2396
      @faint.2396 Рік тому

      @@TuanLe-ve7lm I actually fixed the issue by using Linux. But now I get other errors lol

    • @abdeldjalilchougui
      @abdeldjalilchougui Рік тому

      @@faint.2396 Did you fix your problem ?

    • @sebastianorzechowski4613
      @sebastianorzechowski4613 7 місяців тому

      I think it could be related with text2image itself. You have to provide path to text2image.exe which in general is located in installed tesseract.

  • @utkarshmishra6194
    @utkarshmishra6194 Рік тому

    Hi Gabriel, hope you doing well
    I ran this command
    TESSDATA_PREFIX=/mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=eng TESSDATA=/mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata MAX_ITERATIONS=400
    But I am getting error
    Failed to read data from: data/Apex/Apex.wordlist
    Failed to read data from: data/Apex/Apex.punc
    Failed to read data from: data/Apex/Apex.numbers
    Failed to read data from: data/langdata/Apex/Apex.config
    Null char=2
    lstmtraining \
    --debug_interval 0 \
    --traineddata data/Apex/Apex.traineddata \
    --old_traineddata /mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata/eng.traineddata \
    --continue_from data/eng/Apex.lstm \
    --learning_rate 0.0001 \
    --model_output data/Apex/checkpoints/Apex \
    --train_listfile data/Apex/list.train \
    --eval_listfile data/Apex/list.eval \
    --max_iterations 1000 \
    --target_error_rate 0.01
    Failed to load list of training filenames from data/Apex/list.train
    make: *** [Makefile:319: data/Apex/checkpoints/Apex_checkpoint] Error 1

  • @athosmba1766
    @athosmba1766 Рік тому

    When I use the code TESSDATA_PREFIX=.../tesseract/tessdata make training model_NAME=Apex Start_MODEL=eng TESSDATA=.../tesseract/tessdata MAX_INTERATION=100 it's not work, giving an error about the comand TESSDATA=........

    • @athosmba1766
      @athosmba1766 Рік тому

      someone can help me?

    • @Ethiopic
      @Ethiopic Рік тому

      Are you getting "not recognized" error. I am getting the same error on Windows. The exact command works fine on the Mac. Very strange. Do you find a solution?

  • @vishnubalaji9500
    @vishnubalaji9500 2 роки тому +2

    understood jack shit from this video needs more dumbing down

    • @faint.2396
      @faint.2396 Рік тому +5

      fr and I did every step the same and I'm getting errors. Why isn't training Tesseract 5 simple as Tesseract 4? And the thing is there's only ONE video on how to train Tesseract 5 and its this one.

  • @sayantanbiswas9702
    @sayantanbiswas9702 7 місяців тому

    tesseract data/coc-ground-truth/eng_2.tif stdout --tessdata-dir /home/godmode2/tesseract_tutori
    al/tesstrain/data --psm 7 -l coc --loglevel ALL

  • @sayantanbiswas9702
    @sayantanbiswas9702 7 місяців тому

    TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=coc START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000