Beyond OCR: AI Unlocks Document Structure and Vision

Practical AI · December 02, 2025 · Listen to Original Episode →

Original Title: Technical advances in document understanding

Resources

Tesseract - A classic OCR tool mentioned as an example of typical OCR models.
Paddle OCR - Another example of a typical OCR model.
Docling (IBM) - A toolkit and concept for document structure models, used to predict the structure of a document rather than just extracting text.
Marked-down (Microsoft) - A toolkit similar to Docling, used in retrieval augmented generation (RAG) systems to preserve document structure.
Quinn 25 Vision Language Model - A specific language vision model that the speakers have used, noted for being a good model.
DeepSeek OCR - A newer model that processes documents by splitting them into high-resolution image tokens combined with a global full-resolution view, aiming to preserve more detail than traditional vision language models.

IBM - Developed the Docling toolkit for document processing.
Hugging Face - Released a smaller Docling model suitable for constrained environments.
Microsoft - Developed the Marked-down toolkit, used in RAG systems.

practicalai.fm - The podcast's official website.
LinkedIn - A platform to connect with the podcast for updates and insights.
X - A platform to connect with the podcast for updates and insights.
Blue Sky - A platform to connect with the podcast for updates and insights.
predictionguard.com - Website for Prediction Guard, a partner providing operational support for the show.