Converting PDF to Image in Python(the Right Way)
If you use your favorite search engine to learn how to “convert pdf to image” in Python, you will find many websites like GeeksforGeeks introducing pdf2image to accomplish this goal.
As you can in pdf2image’s PyPI page, it is just a wrapper around pdftoppm while
pdftoppm itself is part of poppler. Basically, pdf2image module invokes
pdftoppm with subprocess Python standard module which is for invoking subprocesses, controling them and capturing their output all from Python.
So I was wondering, why not use the poppler C++ library itself to convert PDF file to Image?
After some searching in PyPI I’ve found python-poppler which is a wrapper around poppler C++ library which is for reading PDF files, extracting text from them and converting their pages to image file formats.
Here I want to teach you how to use poppler from Python to convert the first page of a PDF to Image and save it somewhere in your
The zeroth step as always is installing the library:
pip3 install python-poppler --user
python-poppler relies on poppler C++ library to function and is just a wrapper for that so you need to install it as well. Here I assume you already have it installed on your computer. But in the case it isn’t, you will find pip3 cursing you with errors!
The first step is, of course, firing up the Python(3) thing and importing
Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from poppler import load_from_file, PageRenderer >>>
load_from_file is used to load a PDF file from filesystem as it sounds.
PageRenderer is used to render a page of some PDF document into an image.
In the case you need to load the PDF from bytes(e.g. when you are receiving it from network such as a web API which receives data from its clients) you must go with
load_from_data, instead(See here).
Next you must load the PDF file with its name:
>>> document = load_from_file("probability_cheatsheet.pdf")
document is a Document object with which you can read its title or number of pages and many other properties:
>>> document.pages 10 >>> document.page_layout <page_layout_enum.no_layout: 0> >>> document.page_mode <page_mode_enum.use_outlines: 1> >>> document.producer 'pdfTeX-1.40.14'
Next, you can create an instance of
PageRenderer and create an image of the first page:
>>> renderer = PageRenderer() >>> page1 = document.create_page(0) >>> renderer. >>> img = renderer.render_page(page1) >>> img <poppler.image.Image object at 0x7f3652f44310>
Now it is time to save the image as PNG:
>>> img.save("/tmp/pc.png", out_format="png") True
And voila! You’ve done that!