How to Use Fitz (PyMuPDF) for PDF Handling in Python

Fitz, also known as PyMuPDF, is a powerful Python library for working with PDF documents. It allows you to open, manipulate, and extract information from PDF files with ease. In this article, we'll explore how to use Fitz in Python, complete with practical examples.

Key Takeaways

Fitz (PyMuPDF) simplifies PDF manipulation in Python, including text extraction, merging, and editing.
The library provides intuitive methods for extracting images and metadata from PDF documents.
Fitz allows modification and creation of PDFs with minimal code.

Installation

Before using Fitz, ensure that the PyMuPDF library is installed:

pip install pymupdf

Importing the Library

Begin by importing the library:

import fitz  # PyMuPDF

Opening a PDF Document

To open a PDF file, use the fitz.open() function:

pdf_document = fitz.open('example.pdf')

Getting Document Information

You can retrieve various metadata from the document:

# Number of pages
num_pages = pdf_document.page_count
print(f'The document has {num_pages} pages.')

# Metadata
metadata = pdf_document.metadata
print('Metadata:', metadata)

Extracting Text from Pages

To extract text from a specific page:

# Load a specific page (0-based index)
page_number = 0
page = pdf_document.load_page(page_number)

# Extract text
text = page.get_text()
print(f'Text on page {page_number + 1}:\n{text}')

To extract text from all pages:

for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    text = page.get_text()
    print(f'Text on page {page_num + 1}:\n{text}')

Extracting Images from Pages

To extract images from a specific page:

# Load the page
page = pdf_document.load_page(page_number)

# Get images on the page
image_list = page.get_images(full=True)
print(f'Found {len(image_list)} images on page {page_number + 1}.')

# Process each image
for img_index, img in enumerate(image_list, start=1):
    xref = img[0]
    base_image = pdf_document.extract_image(xref)
    image_bytes = base_image["image"]

    # Save the image
    with open(f'image_page{page_number + 1}_{img_index}.png', 'wb') as image_file:
        image_file.write(image_bytes)

Adding Text to a Page

To add text to a specific page:

# Load the page
page = pdf_document.load_page(page_number)

# Define the text and position
text = "Hello, PyMuPDF!"
position = fitz.Point(100, 100)

# Add text to the page
page.insert_text(position, text, fontsize=12, color=(0, 0, 0))

# Save the changes
pdf_document.save('modified_example.pdf')

Merging PDF Documents

To merge two PDF documents:

# Open the documents
pdf1 = fitz.open('document1.pdf')
pdf2 = fitz.open('document2.pdf')

# Insert pdf2 into pdf1
pdf1.insert_pdf(pdf2)

# Save the merged document
pdf1.save('merged_document.pdf')

Splitting a PDF Document

To extract specific pages into a new PDF:

# Open the original document
pdf_document = fitz.open('example.pdf')

# Create a new PDF for the extracted pages
new_pdf = fitz.open()

# Define the page range to extract (e.g., pages 2 to 4)
start_page = 1  # 0-based index
end_page = 4    # exclusive

# Insert the specified pages into the new PDF
new_pdf.insert_pdf(pdf_document, from_page=start_page, to_page=end_page)

# Save the new PDF
new_pdf.save('extracted_pages.pdf')

Closing the Document

After completing operations on the PDF, close the document to free resources:

pdf_document.close()

FAQs

Loop through pages using load_page() and call get_text() on each page.

Yes, use insert_pdf() to merge pages from one PDF into another.

Yes, use insert_text() or annotation methods to modify the document.

Conclusion

Fitz (PyMuPDF) offers a comprehensive set of tools for PDF manipulation in Python, making it a valuable resource for tasks such as text extraction, content modification, and document merging or splitting. For more detailed information and advanced functionalities, refer to the PyMuPDF documentation.

We are Leapcell, your top choice for deploying Python projects to the cloud.

Leapcell is the Next-Gen Serverless Platform for Web Hosting, Async Tasks, and Redis:

Multi-Language Support

Develop with Node.js, Python, Go, or Rust.

Deploy unlimited projects for free

pay only for usage — no requests, no charges.

Unbeatable Cost Efficiency

Pay-as-you-go with no idle charges.
Example: $25 supports 6.94M requests at a 60ms average response time.

Streamlined Developer Experience

Intuitive UI for effortless setup.
Fully automated CI/CD pipelines and GitOps integration.
Real-time metrics and logging for actionable insights.

Effortless Scalability and High Performance

Auto-scaling to handle high concurrency with ease.
Zero operational overhead — just focus on building.

Explore more in the Documentation!

How to Use Fitz (PyMuPDF) for PDF Handling in Python

Key Takeaways

Installation

Importing the Library

Opening a PDF Document

Getting Document Information

Extracting Text from Pages

Extracting Images from Pages

Adding Text to a Page

Merging PDF Documents

Splitting a PDF Document

Closing the Document

FAQs

Conclusion

We are Leapcell, your top choice for deploying Python projects to the cloud.

Share this article

More Posts from Leapcell

Getting Started with Python's tile-tools

How to Mock Async Functions in Python

Popular Posts

Key Takeaways

Installation

Importing the Library

Opening a PDF Document

Getting Document Information

Extracting Text from Pages

Extracting Images from Pages

Adding Text to a Page

Merging PDF Documents

Splitting a PDF Document

Closing the Document

FAQs

How do I extract all text from a PDF using Fitz?

Can Fitz be used to merge two PDF documents?

Is it possible to add text or annotations to a PDF?

Conclusion

We are Leapcell, your top choice for deploying Python projects to the cloud.

Share this article

More Posts from Leapcell

Getting Started with Python's tile-tools

How to Mock Async Functions in Python

Popular Posts