Deep Dive into Microsoft MarkItDown
Daniel Hayes
Full-Stack Engineer · Leapcell
What is MarkItDown?
MarkItDown is a Python library developed by Microsoft that converts various file formats into Markdown.
Since its release, it has quickly gained popularity, amassing over 25k stars on GitHub in only 2 weeks! 🤯
Why is MarkItDown So Popular?
MarkItDown supports an impressive range of file types, including:
- Office documents: PowerPoint, Word, Excel
- Rich media files: Images (with EXIF and image description), Audio (with transcription)
- Web and structured data: HTML, CSV, JSON, XML
- Archives: ZIP files
Not only can it handle common formats like Word and Excel, but it also supports multi-modal files by leveraging OCR and speech recognition for extracting content.
The ability to convert anything into Markdown makes MarkItDown a powerful tool for LLM training. By processing domain-specific documents, it provides rich context for generating more accurate and relevant responses in LLM-powered applications.
How to Use MarkItDown
Using MarkItDown is remarkably simple - just 4 lines of code:
from markitdown import MarkItDown md = MarkItDown() result = md.convert("test.xlsx") print(result.text_content)
Here's some use cases of MarkItDown.
Converting an Word file yields accurate Markdown output:
Multi-sheet Excel files are pieces of cake, of course:
It can also process ZIP files, all the contents within ZIP archives are parsed recursively:
Trying to extract content from images will return nothing:
Why it returns nothing? MarkItDown is designed to support image files!
The issue lies in the need for an LLM to extract image descriptions. Integrate a compatible LLM client with MarkItDown like below:
from openai import OpenAI client = OpenAI(api_key="i-am-not-an-api-key") md = MarkItDown(llm_client=client, llm_model="gpt-4o")
Once configured, the image content will be successfully converted:
Note: LLM only works for images. To extract content from PDFs, ensure the PDF has been pre-processed with OCR.
The text extracted from the PDF, however, will lose all its formatting, making no distinction between headings and regular text.
Limitations of MarkItDown
As is shown above, MarkItDown has some limitations:
- Non-OCRed PDFs cannot be processed.
- Formatting is not available when extracting from PDFs.
Since that MarkItDown is an open-source tool, it's highly extensible. Its neat codebase allows developers to add new features easily.
How Does MarkItDown Work?
MarkItDown’s architecture is simple and clean, with its core implementation contained in just one file.
In the codebase, a DocumentConverter
class is defined with a convert()
method:
class DocumentConverter: """Abstract superclass of all DocumentConverters.""" def convert( self, local_path: str, **kwargs: Any ) -> Union[None, DocumentConverterResult]: raise NotImplementedError()
Different converters inherit from this base class and are registered during initialization:
self.register_page_converter(PlainTextConverter()) self.register_page_converter(HtmlConverter()) self.register_page_converter(DocxConverter()) self.register_page_converter(XlsxConverter()) self.register_page_converter(Mp3Converter()) self.register_page_converter(ImageConverter()) # ...
This modular design makes MarkItDown highly extensible, allowing developers to create their own converters as needed.
How Different File Types Are Converted
Office Files
Office files are first converted to HTML using libraries like mammoth
, pandas
, and pptx
, then parsed into Markdown using BeautifulSoup
.
Audio Files
Audio files are processed using the speech_recognition
library, which leverages Google’s API for transcription.
(Microsoft, you still loyal to Azure, …right? 💔)
Images
Images are processed by calling an LLM with the prompt:
"Write a detailed caption for this image."
PDFs
PDFs are parsed using the pdfminer
library. However, there’s no built-in OCR, so you must ensure the PDF content is extractable beforehand.
Using MarkItDown as an API (and Host It at No Cost)
MarkItDown can run locally, but hosting it as an API unlocks additional flexibility, making it easy to integrate into workflows like Zapier, n8n, or even your own website that provides file conversion services.
Here’s a simple example of how to host MarkItDown as an API using FastAPI
:
import shutil from markitdown import MarkItDown from fastapi import FastAPI, UploadFile from uuid import uuid4 md = MarkItDown() app = FastAPI() @app.post("/convert") async def convert_markdown(file: UploadFile): hash = uuid4() folder_path = f"./tmp/{hash}" shutil.os.makedirs(folder_path, exist_ok=True) file_path = f"{folder_path}/{file.filename}" with open(file_path, "wb") as f: shutil.copyfileobj(file.file, f) result = md.convert(file_path) text = result.text_content shutil.rmtree(folder_path) return {"result": text}
You can call the API like this:
const formData = new FormData(); formData.append('file', file); const response = await fetch('http://localhost:8000/convert', { method: 'POST', body: formData, });
Hosting the API at No Cost
Hosting Python APIs can be tricky. Traditional services like AWS EC2 or DigitalOcean require renting an entire server, which is always costly.
But now, you can use Leapcell.
It's a platform which can host Python codebase in the serverless way - it charges only per API call, with a generous free-tier usage.
Just connect your GitHub repository, define build and start commands, and you’re all set:
Now you have a MarkItDown API that’s hosted in the cloud, ready for integration into your workflow, and most importantly, only charges when it's really called.
Start building your own MarkItDown API on Leapcell today! 😎