The Python Oracle

Extract a page from a pdf as a jpeg

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Hypnotic Puzzle2

--

Chapters
00:00 Extract A Page From A Pdf As A Jpeg
00:27 Accepted Answer Score 234
01:39 Answer 2 Score 30
02:06 Answer 3 Score 17
02:21 Answer 4 Score 155
03:00 Thank you

--

Full question
https://stackoverflow.com/questions/4618...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #image #pdf

#avk47



ACCEPTED ANSWER

Score 234


The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for count, page in enumerate(pages):
    page.save(f'out{count}.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.




ANSWER 2

Score 155


I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.

import fitz

pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.load_page(0)  # number of page
pix = page.get_pixmap()
output = "outfile.png"
pix.save(output)
doc.close()

Note: The library changed from using "camelCase" to "snake_cased". If you run into an error that a function does not exist, have a look under deprecated names. The functions in the example above have been updated accordingly.

The fitz.Document class supports a context manager initialization:

with fitz.open(pdffile) as doc:
   ...



ANSWER 3

Score 30


The Python library pdf2image (used in the other answer) in fact doesn't do much more than just launching pdttoppm with subprocess.Popen, so here is a short version doing it directly:

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

Here is the Windows installation link for pdftoppm (contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/.




ANSWER 4

Score 17


There is no need to install Poppler on your OS. This will work:

pip install Wand

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f.removesuffix(".pdf") + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)