Extracting text from a PDF file using the PyMuPDF library.

Extracting text from a PDF file using the pypdf library.

PyMuPDF is a Python library that supports file formats like XPS, PDF, CBR, and CBZ. But for now, in this article, we are going to concentrate on PDF (Portable Document Format) files.

Installation

pip install pymupdf
pip install fitz

To extract the text from the pdf, we need to follow the following steps:

Importing the library
Opening document
Extracting text

Note: We are using the sample.pdf here; to get the pdf, use the link below.

sample.pdf – Link

1. Importing the library

Python3

import fitz

2. Opening document

Python3

doc = fitz.open('sample.pdf')

Here we created an object called “doc,” and filename should be a Python string.

3. Extracting text

Python3

for page in doc: 
    text = page.get_text() 
    print(text) 

Here, we iterated pages in pdf and used the get_text() method to extract each page from the file.

All the Code to extract the text

Python3

import fitz 
doc = fitz.open('sample.pdf') 
text = "" 
for page in doc: 
   text+=page.get_text() 
print(text) 

Output:

Conclusion

We have seen two Python libraries, pypdf and PyMuPDF, that can extract text from a PDF file. Comment on your preferred library from the above two libraries.

Extract text from PDF File using Python

All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.

We will extract text from pdf files using two Python libraries, pypdf and PyMuPDF, in this article.

Tags:

#Python-projects #python-utility #Python #python

Extracting text from a PDF file using the pypdf library.

Extracting text from a PDF file using the PyMuPDF library.

Installation

Python3

Python3

Python3

Python3

Conclusion

Extract text from PDF File using Python

Similar Reads

Contact Us