Journal "Computational Technologies"

Article information

2022 , Volume 27, № 5, p.69-78

Shigarov A.O., Paramonov V.V.

Page text segmentation in untagged PDF documents

Currently, a large amount of non-editable documents are published and distributed in PDF (Portable Document Format). Often, they are “untagged”, i. e. there are no annotation about their structure, including headings, paragraphs, tables, lists, figures, footers, etc. The document layout analysis consists in recognizing the listed elements of the structure. A basic part of this process is the segmentation of page text into blocks that can be classified as headings, paragraphs, table cells, etc. The well-known page segmentation algorithms are mainly designed to deal with either bitmap images of document pages or print-oriented ASCII text. Compared to these data formats, PDF provides additional information (rendering order, font metrics, ruling lines, etc.) that can improve document layout analysis. The paper describes our experience on the adaptation of some existing algorithms for segmenting page text in document images and ASCII text to make them applicable directly for PDF format — untagged cases

[full text] [link to elibrary.ru]

Keywords: document layout analysis, document segmentation, document images, PDF accessibility, document information processing

doi: 10.25743/ICT.2022.27.5.007

Author(s):
Shigarov Alexei Olegovich
PhD.
Position: Leading research officer
Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS
Address: 664033, Russia, Irkutsk
Phone Office: (3952) 45-31-07
E-mail: shigarov@icc.ru
SPIN-code: 5159-9006

Paramonov Vyacheslav Vladimirovich
PhD.
Position: Junior Research Scientist
Office: Institute for System Dynamics and Control Theory Siberian Branch of RAS
Address: 664033, Russia, Irkutsk
Phone Office: (3952) 453073
E-mail: slv@icc.ru
SPIN-code: 2364-8270

References:
[1] Binmakhashen G., Mahmoud S. Document layout analysis: a comprehensive survey. ACM Computing Surveys. 2020; 52(6):1–36. Article No.109. DOI:10.1145/3355610.

[2] Kieninger T. Table structure recognition based on robust block segmentation. Proceedings. Volume 3305, Document Recognition V. 1998. DOI:10.1117/12.304642.

[3] Kieninger T., Dengel A. The T-Recs table recognition and analysis system. DAS’98: Selected Papers from the 3rd IAPR Workshop on Document Analysis Systems: Theory and Practice. 1998: 255–269.

[4] Mikhailov A., Shigarov A., Rozhkov E., Cherepanov I. On graph-based verification for PDF table detection. 2020 Ivannikov ISPRAS Open Conference. 2020: 91–95. DOI:10.1109/ISPRAS51486.2020.00020.

[5] Mikhailov A., Shigarov A. Page layout analysis for refining table extraction from PDF documents. 2021 Ivannikov ISPRAS Open Conference. 2021: 114–119. DOI:10.1109/ISPRAS53967.2021.00021.

[6] Nganji J.T. The Portable Document Format (PDF) accessibility practice of four journal publishers. Library & Information Science Research. 2015; 37(3):254–262. DOI:10.1016/j.lisr.2015.02.002.

[7] Oliveira D.A.B., Viana M.P. Fast CNN-based document layout analysis. Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops. 2017: 1173–1180.

[8] Shafait F., Keysers D., Breuel T.M. Performance comparison of six algorithms for page segmentation. LNCS 3872. Document Analysis Systems VII. 2006: 368–379. DOI:10.1007/11669487_33.

[9] Shigarov A., Mikhailov A., Altaev A. Configurable table structure recognition in untagged PDF documents. Proceedings of the 16th ACM Symposium on Document Engineering. 2016: 119–122.

Bibliography link:
Shigarov A.O., Paramonov V.V. Page text segmentation in untagged PDF documents // Computational technologies. 2022. V. 27. № 5. P. 69-78