Understanding document images ( e.g. , invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document . Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of documents; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut , which stands for Do cume n t u nderstanding t ransformer. As the first step in OCR-free VDU research, we propose a simple architecture ( i.e. , Transformer) with a pre-training objective ( i.e., cross-entropy loss). Donut is conceptually simple yet effective.
2021: Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park
https://arxiv.org/pdf/2111.15664v2.pdf
view more