Python Extract Text from Scanned PDF | Python Extract Text from Image | Python Tesseract OCR Setup

Поділитися
Вставка
  • Опубліковано 4 гру 2024

КОМЕНТАРІ • 48

  • @anthonyavellanedapaitan7081
    @anthonyavellanedapaitan7081 4 місяці тому +4

    Could you share the poppler file that you downloaded on a google drive please, because the link you provide in the description is broken

  • @Anarky35
    @Anarky35 Рік тому +1

    Great thanks a lot, very helful

  • @reetanimesh4903
    @reetanimesh4903 2 роки тому

    Good Detailed Explanation

  • @khaledibrahim1065
    @khaledibrahim1065 10 місяців тому

    First of all, thank you for this excellent explanation, if I want to extract scanned PDF file contains complicated table to csv file, do you have any suggestions?

    • @Python2020
      @Python2020  10 місяців тому

      You have to use regex and then map va values to field for a line... Then appen new line... Direct from image to csv might not be possible using these libraries

  • @pradeepreddychejarla3753
    @pradeepreddychejarla3753 11 місяців тому

    Hi. Thank you for this video.Follwoed the exact steps. For images, the code is working and output is coming as expected but for scanned pdfs, the output is generated with random alphabets and missing out a lot of text. Why is that so? what is the solution for this issue ?

  • @manishtripathi1778
    @manishtripathi1778 2 роки тому

    Thanks for the video, i required some help in cropping the header and then extracting the text from the file, can you suggest how i can do that, my pdf files are a mix of scanned pdf and text pdf files

    • @Python2020
      @Python2020  2 роки тому

      When you will read scanned pdf with pypdf2 it should throw error inside catch you can can write code for reading scanned pdf... My way

  • @hanifadhithan
    @hanifadhithan 2 роки тому +1

    Is there a possible method to extract without using the tesseract model but using py libraries only

    • @Python2020
      @Python2020  2 роки тому +2

      There is a video on my UA-cam channel but it won't work for scanned pdf Or image pdf.. Only on text pdf.... ua-cam.com/video/7f0Bc2ateqQ/v-deo.html

    • @hanifadhithan
      @hanifadhithan 2 роки тому

      @@Python2020 thanks for the reply...but do you know is that possible to do with only libraries

  • @CSE-AshikAhamedP
    @CSE-AshikAhamedP Рік тому

    but where you trained images in Teseract OCR tools

    • @Python2020
      @Python2020  Рік тому

      We don't have to train OCR engine.. I think You are refering to train ML models to identify right lables

  • @Ramya-v6q
    @Ramya-v6q 8 місяців тому +1

    sir... thnks a lot for the info... can u provide me python script for extracting company details like companyname,adress and all from multiple images which are in pdf format... i went through lots of reasearch but didnt get. help me out

    • @Python2020
      @Python2020  8 місяців тому

      Would be complex, se there is one OCR video on this channel that might help

    • @Ramya-v6q
      @Ramya-v6q 8 місяців тому +1

      thnk you soo much sir@@Python2020

  • @DhanasekaranEsfita
    @DhanasekaranEsfita 5 місяців тому

    Hi Sir, I'm working with scanned PDFs (bank statements). The PDF quality is low, so I can't extract the text accurately. Are there any other ways to extract the text accurately?

  • @beatsbyharman7882
    @beatsbyharman7882 Рік тому

    Hi , if i have a pdf of 10 pages and i want to extract the data for only 7th page , how can i specify that ? Where do i need to do changes ? Thanks ! Waiting for the reply

    • @Python2020
      @Python2020  Рік тому

      Spilt pdf save splits in different folder..give name ending with index and use name ending with required index .. You still hv your original file

    • @beatsbyharman7882
      @beatsbyharman7882 Рік тому

      @@Python2020 in my pdf i have 10 pages and i want to extract ony 7th page . How can i do that?

  • @abdullahfahad6388
    @abdullahfahad6388 11 місяців тому

    Sir, will it work with pycharm community edition?

    • @Python2020
      @Python2020  11 місяців тому +1

      Yes... Pycharm is just code editor.. I have also used community version

    • @abdullahfahad6388
      @abdullahfahad6388 11 місяців тому

      @@Python2020 thank you

  • @user-fy6gi2vx8g
    @user-fy6gi2vx8g Рік тому

    I am also getting blank .txt files for PDFs. The code is working for images. Would you mind giving a more detailed suggestion for what I should do? Thanks so much!

    • @Python2020
      @Python2020  Рік тому

      Try to print before writing the text file... See if dara is coming in variable or not

    • @user-fy6gi2vx8g
      @user-fy6gi2vx8g Рік тому

      @@Python2020 Thank you so much for your prompt reply! I printed the variable, and it shows up blank for all the PDF files. In one of the PDF files, I put in an image I produced by typing something in Word and then turning it into an image using Snipping Tool, so I don't think image quality is causing the problem.

    • @user-fy6gi2vx8g
      @user-fy6gi2vx8g Рік тому

      @@Python2020 Nevermind. I discovered that I accidentally put in "5" as the page number instead of "500". I changed it to 500 and now it is working! I find this very interesting, though, because most of my PDFs are 5 pages long or shorter. Thanks so much for your help!

  • @robertcenusa8636
    @robertcenusa8636 2 роки тому

    12:35 my txt files are empty (in case of scanned PDF, or even text PDF). For Images is working.

  • @Android_19
    @Android_19 2 роки тому

    how to do image pre processing if the image quality is not good .

    • @Python2020
      @Python2020  2 роки тому

      use opencv enhanse resolution, or few paid services are also there like Google Vision, Textract,Flexicapture

  • @Hgrewssauujdkhvcjjipp
    @Hgrewssauujdkhvcjjipp 2 роки тому

    Cool 👍

  • @satyaprasadmohanty9093
    @satyaprasadmohanty9093 2 роки тому

    how can i store all the text of different pages extracted from pdf into a single text page?

    • @Python2020
      @Python2020  2 роки тому

      Use a single variable if pages are up to 10,20, once code goes out of the loop there you write the file

    • @satyaprasadmohanty9093
      @satyaprasadmohanty9093 2 роки тому

      @@Python2020 thanks for your response will you tell me exactly where i need to change the code resume_pdfs=glob.glob(r"/content/drive/MyDrive/New folder (2)")
      for pdf_path in resume_pdfs:
      pages=convert_from_path(pdf_path,500)
      for pageNum, imgBlob in enumerate(pages):
      text=pytesseract.image_to_string(imgBlob,lang='eng')
      with open(f'{pdf_path[:-4]}_page_{pageNum}.txt','w') as the_file:
      the_file.write(text)

  • @sushmahs7840
    @sushmahs7840 2 роки тому

    How to apply this method in anaconda for text extraction

    • @Python2020
      @Python2020  2 роки тому

      Code will remain same in any environment

  • @arunkumar-nb7be
    @arunkumar-nb7be 2 роки тому

    Is there any way I can install libraries with pip.install ocrmypdf, in my company laptop I can't run .exe files, but I can run library thro python..

    • @Python2020
      @Python2020  2 роки тому

      There is CMD way you have download the library file

    • @Python2020
      @Python2020  2 роки тому

      stackoverflow.com/questions/11091623/how-to-install-packages-offline/14447068#14447068

  • @tapanpati9452
    @tapanpati9452 9 місяців тому

    2nd link is not working

    • @vthelagan2198
      @vthelagan2198 4 місяці тому

      yes for me too

    • @Python2020
      @Python2020  4 місяці тому

      Tere might be some change as it is old video...

  • @ROKKor-hs8tg
    @ROKKor-hs8tg Рік тому

    أين الاكواد