Researchers from Alibaba have developed a powerful new model called mPLUG-DocOwl 1.5 that pushes the state-of-the-art in OCR-free document understanding.

The model leverages unified structure learning and a novel vision-to-text module to achieve impressive results across 10 benchmarks covering documents, tables, charts, webpages, and natural images – all without relying on optical character recognition (OCR).

Link to the research

What is OCR-Free document understanding?

OCR-free document understanding aims to comprehend the information in images of documents, tables, webpages, etc. without explicitly recognizing and extracting the text via OCR first. This is a challenging problem because the model needs to understand both the visual layout and structure as well as the textual content and semantics.

Most existing approaches rely on first running OCR to extract the text, then analyzing that text to understand the document.

In contrast, OCR-free methods like mPLUG-DocOwl 1.5 can directly understand document images end-to-end.

DocOwl 1.5 achieves state-of-the-art OCR-free performance

Compared with similar-size generalists, our DocOwl 1.5 achieves state-of-the-art OCR-free performance on 10 Visual Document Understanding benchmarks.

Check out the video I made about this approach:

Key innovations in mPLUG-DocOwl 1.5

Unified structure learning

A core contribution of this work is unified structure learning, which trains the model to parse the structure of documents across 5 domains:

  • Documents – learns to use spaces and line breaks to represent the layout
  • Tables – parses tables into structured markdown format
  • Charts – understands legends, axes, and values to parse charts into data tables
  • Webpages – handles webpage-specific elements like navbars, headers, etc.
  • Natural images – describes the image content in addition to reading scene text
Illustrations of the importance of structure information

Illustrations of the importance of structure information in Visual Document Understanding on documents (a), tables (b), webpages (c), infographics (d), and charts (e-f).

By learning these structure parsing tasks, the model gains a deep understanding of document layout that transfers well to downstream tasks like question answering.

Multi-grained text localization is also a key part of the structure learning. The model learns to both recognize text at word, phrase, line, and block levels as well as localize a given text span in the image. This teaches fine-grained grounding of text to image regions.

H-Reducer vision-to-text module

Another key innovation is the H-Reducer module that maps the visual features to textual features that the language model can understand. It uses convolution to merge features horizontally, preserving key layout information while reducing the sequence length. This outperforms other common vision-to-text approaches.

H-Reducer

The two-stage training framework (a) and overall architecture (b) of DocOwl 1.5. The global image and cropped images are processed independently by the Visual Encoder and H-Reducer

DocStruct4M and DocReason25K Datasets To enable the unified structure learning, the researchers compiled a new dataset called DocStruct4M by constructing structure-aware text sequences and bounding boxes for images from multiple existing datasets.

Detailed statistics of DocStruct4M

Detailed statistics of DocStruct4M

They also collected a smaller dataset called DocReason25K focused on triggering the language model’s reasoning abilities by providing step-by-step question answering explanations.

Training procedure

mPLUG-DocOwl 1.5 is trained in two stages:

  1. Structure pre-training on DocStruct4M with the vision encoder and H-Reducer
  2. Multi-task fine-tuning on downstream tasks with the vision encoder frozen

This two-stage approach worked better than joint training everything together.

State-of-the-art results

The resulting model achieves impressive results, setting a new state-of-the-art for OCR-free performance on 10 benchmarks:

DocOwl 1.5 achieves state-of-the-art OCR-free performance

It outperforms similar-sized models by over 10 points on 5 out of the 10 tasks. The strong results demonstrate the effectiveness of the unified structure learning and H-Reducer innovations.

mPLUG-DocOwl 1.5 also shows promising language reasoning capabilities, providing step-by-step explanations for its answers on the DocReason25K data:

Qualitative results of question answering with detailed explanations

Qualitative results of question answering with detailed explanations

However, like other large language models, it still sometimes hallucinates incorrect statements, which is an area for future work.

Conclusion

mPLUG-DocOwl 1.5 represents an exciting leap forward in OCR-free document understanding. By learning to comprehend document structure in a unified way across multiple domains, it achieves a new state-of-the-art on a wide range of benchmarks covering both understanding and reasoning. The innovations in unified structure learning and the H-Reducer vision-to-text architecture can likely inspire and pave the way for future advances in this challenging and impactful area of AI research.

Citation of the paper: Hu, A., Xu, H., Ye, J., Yan, M., Zhang, L., Zhang, B., Li, C., Zhang, J., Jin, Q., Huang, F. and Zhou, J., 2024. mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding. arXiv preprint arXiv:2403.12895.

Last Update: 27/05/2024