Read a Particular Page from a PDF File in Python

Document processing is one of the most common use cases for the Python programming language. This allows the language to process many files, such as database files, multimedia files and encrypted files, to name a few. This article will teach you how to read a particular page from a PDF (Portable Document Format) file in Python.

Method 1: Using Pymupdf library to read page in Python

The PIL (Python Imaging Library), along with the PyMuPDF library, will be used for PDF processing in this article. To install the PyMuPDF library, run the following command in the command processor of the operating system:

pip install pymupdf

Note: This PyMuPDF library is imported by using the following command.

import fitz

Reading a page from a pdf file requires loading it and then displaying the contents of only one of its pages. This essentially makes that one-page equivalent of an image. Therefore, the page from the pdf file would be read and displayed as an image.

The following example demonstrates the above process:

Python3

import fitz

from PIL import Image

input_file = r"test.pdf"

file_handle = fitz.open(input_file)

page = file_handle[0]

page_img = page.get_pixmap()

page_img.save('PDF_page.png')

img = Image.open('PDF_page.png')

img.show()

Output:

Read a Particular Page from a PDF File in Python

Explanation:

Firstly the pdf file is opened, and its file handle is stored. Then the first page of the pdf (at index 0) is loaded using list indexing. This page’s pixel map (pixel array) is obtained using the get_pixmap function, and the resultant pixel map is saved in a variable. Then this pixel map is saved as a png image file. Then this png file is opened using the open function present in the Image module of PIL. In the end, the image is displayed using the show function.

Note: The first open function is used to open a pdf file, and the later one is used to open the png image file. The functions belong to different libraries and are used for different purposes.

Method 2: Reading a particular page from a PDF using PyPDF2

For the second example, the PyPDF2 library would be used. Which could be installed by running the following command:

pip install PyPDF2

The same objective could be achieved by using the PyPDF2 library. The library allows processing for pdf files and allows various operations such as reading, writing or creating a pdf file. For the task at hand, the use of the extract text function would be made to obtain the text from the PDF file and display it. The code for this is as follows:

Python3

import PyPDF2

input_file = r"test.pdf"

page = 4

pdfFileObj = open('test.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

pageObj = pdfReader.getPage(page)

data = pageObj.extractText()

pdfFileObj.close()

print(data)

Output:

He started this Journey with just one thought- every geek should have access to a never ending range of academic resources and with a lot of hardwork and determination, GeeksforGeeks was born.
Through this platform, he has successfully enriched the minds of students with knowledge which has led to a boost in their careers. But most importantly, GeeksforGeeks will always help students stay in touch with their Geeky side!
EXPERT ADVICE
CEO and Founder of GeeksforGeeks I understand that many students who come to us are either fans of the sciences or have been pushed into this feild by their parents.
And I just want you to know that no matter where life takes you, we at GeeksforGeeks hope to have made this journey easier for you.Mr. Sandeep Jain
3

Explanation:

Firstly the path to the input pdf and the page number are defined in separate variables. Then the pdf file is opened, and its file object is stored in a variable. Then this variable is passed as an argument to the PdfFileReader function, which creates a pdf reader object out of a file object. Then the data stored within the page number defined in the page variable is obtained and stored in a variable. Then the text is extracted from that PDF page, and the file object is closed. In the end, the extracted text data is displayed.