How to Use Fitz (PyMuPDF) for PDF Handling in Python
Grace Collins
Solutions Engineer · Leapcell
Fitz, also known as PyMuPDF, is a powerful Python library for working with PDF documents. It allows you to open, manipulate, and extract information from PDF files with ease. In this article, we'll explore how to use Fitz in Python, complete with practical examples.
Key Takeaways
- Fitz (PyMuPDF) simplifies PDF manipulation in Python, including text extraction, merging, and editing.
- The library provides intuitive methods for extracting images and metadata from PDF documents.
- Fitz allows modification and creation of PDFs with minimal code.
Installation
Before using Fitz, ensure that the PyMuPDF library is installed:
pip install pymupdf
Importing the Library
Begin by importing the library:
import fitz # PyMuPDF
Opening a PDF Document
To open a PDF file, use the fitz.open()
function:
pdf_document = fitz.open('example.pdf')
Getting Document Information
You can retrieve various metadata from the document:
# Number of pages num_pages = pdf_document.page_count print(f'The document has {num_pages} pages.') # Metadata metadata = pdf_document.metadata print('Metadata:', metadata)
Extracting Text from Pages
To extract text from a specific page:
# Load a specific page (0-based index) page_number = 0 page = pdf_document.load_page(page_number) # Extract text text = page.get_text() print(f'Text on page {page_number + 1}:\n{text}')
To extract text from all pages:
for page_num in range(pdf_document.page_count): page = pdf_document.load_page(page_num) text = page.get_text() print(f'Text on page {page_num + 1}:\n{text}')
Extracting Images from Pages
To extract images from a specific page:
# Load the page page = pdf_document.load_page(page_number) # Get images on the page image_list = page.get_images(full=True) print(f'Found {len(image_list)} images on page {page_number + 1}.') # Process each image for img_index, img in enumerate(image_list, start=1): xref = img[0] base_image = pdf_document.extract_image(xref) image_bytes = base_image["image"] # Save the image with open(f'image_page{page_number + 1}_{img_index}.png', 'wb') as image_file: image_file.write(image_bytes)
Adding Text to a Page
To add text to a specific page:
# Load the page page = pdf_document.load_page(page_number) # Define the text and position text = "Hello, PyMuPDF!" position = fitz.Point(100, 100) # Add text to the page page.insert_text(position, text, fontsize=12, color=(0, 0, 0)) # Save the changes pdf_document.save('modified_example.pdf')
Merging PDF Documents
To merge two PDF documents:
# Open the documents pdf1 = fitz.open('document1.pdf') pdf2 = fitz.open('document2.pdf') # Insert pdf2 into pdf1 pdf1.insert_pdf(pdf2) # Save the merged document pdf1.save('merged_document.pdf')
Splitting a PDF Document
To extract specific pages into a new PDF:
# Open the original document pdf_document = fitz.open('example.pdf') # Create a new PDF for the extracted pages new_pdf = fitz.open() # Define the page range to extract (e.g., pages 2 to 4) start_page = 1 # 0-based index end_page = 4 # exclusive # Insert the specified pages into the new PDF new_pdf.insert_pdf(pdf_document, from_page=start_page, to_page=end_page) # Save the new PDF new_pdf.save('extracted_pages.pdf')
Closing the Document
After completing operations on the PDF, close the document to free resources:
pdf_document.close()
FAQs
Loop through pages using load_page()
and call get_text()
on each page.
Yes, use insert_pdf()
to merge pages from one PDF into another.
Yes, use insert_text()
or annotation methods to modify the document.
Conclusion
Fitz (PyMuPDF) offers a comprehensive set of tools for PDF manipulation in Python, making it a valuable resource for tasks such as text extraction, content modification, and document merging or splitting. For more detailed information and advanced functionalities, refer to the PyMuPDF documentation.
We are Leapcell, your top choice for deploying Python projects to the cloud.
Leapcell is the Next-Gen Serverless Platform for Web Hosting, Async Tasks, and Redis:
Multi-Language Support
- Develop with Node.js, Python, Go, or Rust.
Deploy unlimited projects for free
- pay only for usage — no requests, no charges.
Unbeatable Cost Efficiency
- Pay-as-you-go with no idle charges.
- Example: $25 supports 6.94M requests at a 60ms average response time.
Streamlined Developer Experience
- Intuitive UI for effortless setup.
- Fully automated CI/CD pipelines and GitOps integration.
- Real-time metrics and logging for actionable insights.
Effortless Scalability and High Performance
- Auto-scaling to handle high concurrency with ease.
- Zero operational overhead — just focus on building.
Explore more in the Documentation!
Follow us on X: @LeapcellHQ