0
KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document
arXiv:2512.11849v1 Announce Type: new
Abstract: Automated document layout analysis remains a major challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present \textbf{KH-FUNSD}, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3) fine-grained classification that assigns specific semantic roles, such as field labels, values, headers, footers, and symbols. This multi-level approach supports both comprehensive layout analysis and precise information extraction. We benchmark several leading models, providing the first set of baseline results for Khmer business documents, and discuss the distinct challenges posed by non-Latin, low-resource scripts. The KH-FUNSD dataset and documentation will be available at URL.
Abstract: Automated document layout analysis remains a major challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present \textbf{KH-FUNSD}, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3) fine-grained classification that assigns specific semantic roles, such as field labels, values, headers, footers, and symbols. This multi-level approach supports both comprehensive layout analysis and precise information extraction. We benchmark several leading models, providing the first set of baseline results for Khmer business documents, and discuss the distinct challenges posed by non-Latin, low-resource scripts. The KH-FUNSD dataset and documentation will be available at URL.
No comments yet.