Document Structure Extraction with Kreuzberg
Extracting structured data from PDFs is one of the hardest problems in AI infrastructure. Most tools give you a text dump but no headings, no table boundaries, no distinction between a caption and ...

Source: DEV Community
Extracting structured data from PDFs is one of the hardest problems in AI infrastructure. Most tools give you a text dump but no headings, no table boundaries, no distinction between a caption and a footnote. When Docling launched, it changed the game with a genuinely good layout model. We want to be clear– Docling is a great project, and we have the greatest respect for the team at IBM for putting it out there. It’s also fully open-source under a permissive Apache-2.0 license. We integrated their model into Kreuzberg and embedded it into a Rust-native pipeline. Currently, it runs 2.8× faster with a fraction of the memory footprint. This post covers the behind-the-scenes part: what we used, what we rebuilt from scratch, and where the speed comes from. Why Document Structure Matters for AI and RAG Pipelines If you’re building AI infrastructure like RAG pipelines, document processing workflows, or any AI application that ingests PDFs at scale, flat text extraction isn’t enough anymore. C