Key Points
- The research paper presents DocLLM, a lightweight extension to traditional large language models, designed for comprehending visually rich documents with complex layouts.
- DocLLM models both spatial layouts and text semantics, using bounding box information through optical character recognition (OCR) without requiring complex vision encoders.
- The spatial layout information is incorporated through disentangled attention mechanism, decomposing the attention mechanism in transformers to capture cross-alignment between text and spatial modalities in structured documents.
- The paper proposes a pre-training objective that focuses on learning to infill block texts to address irregular layouts and heterogeneous content frequently encountered in visual documents.
- The model is fine-tuned using a large-scale instruction dataset to cover four core document intelligence tasks: visual question answering, natural language inference, key information extraction, and document classification.
- The evaluation demonstrates that DocLLM outperforms state-of-the-art large language models on 14 out of 16 datasets across all tasks and generalizes well to 4 out of 5 previously unseen datasets.
- DocLLM is a lightweight and efficient model that offers an opportunity to change the landscape of generative pre-training by accommodating complex layout structures and documents with rich layouts.
- The paper describes the model architecture, pre-training and instruction tuning procedures, experimental settings, ablation studies, and potential areas for future work, such as infusing vision into DocLLM in a lightweight manner.
- The research paper was prepared by the Artificial Intelligence Research group of JPMorgan Chase & Co and its affiliates for information purposes and is not intended as investment research or investment advice.
Summary
The paper presents DocLLM, a lightweight extension to traditional large language models (LLMs) specifically designed for generative reasoning over visually rich documents with complex layouts. DocLLM incorporates spatial layout information without relying on a complex vision encoder and focuses exclusively on bounding box information to effectively capture the spatial layout structure. The model employs a disentangled attention approach and a pre-training objective that learns to infill block texts, addressing irregular layouts and heterogeneous content encountered in visual documents.
The pre-trained model is fine-tuned using a comprehensive instruction dataset, and evaluation across various document intelligence tasks demonstrates that DocLLM outperforms equivalent models on known tasks and exhibits robust generalization to previously unseen datasets. The paper also includes in-depth discussion on architecture, experiments, and ablation studies to validate the key contributions of DocLLM.
Reference: https://arxiv.org/abs/2401.00908