| 
					             Article information  
            2024 ,  Volume 29, ¹ 6, p.125-146
 Shigarov A.O.
Table recognition in untagged PDF documents using PDF-specific features
Nowadays, PDF is one of the most popular formats for distributing print-oriented documents  in the electronic environment. PDF documents are often untagged, i.e. pages are represented only  by low-level instructions for rendering text and graphics and are not accompanied by annotations  of their structural components (headings, paragraphs, tables, etc.). Automatic recovering for such  annotations can ensure the accessibility of structural components. The latter is possible as a result of  solving a number of tasks, one of which is recognizing tables in untagged PDF documents: detecting  the boundaries of their rows, columns, and cells. This paper proposes a method for recognizing  tables in untagged PDF documents. Unlike existing analogues, it is originally proposed to solve the  stated task based on the use of PDF-specific features such as text output order, pen movement  positions, etc. This proposal allowed adapting some known approaches and methods to the declared  task, initially oriented towards raster images and unformatted text, including “word clustering”,  “rows first” detection, whitespace segmentation, and connected component analysis. The presented  performance evaluation results demonstrate the effectiveness of solutions implementing this method.  The presented results of the performance evaluation demonstrate the efficiency of the solutions  implemented based on the proposed method. Quantitative comparison with analogues indicates  their compliance with the current level of technology development in the area under consideration.  At the same time, qualitative comparison reveals the following advantages over analogues. The  implementation of the proposed table recognition method does not require preliminary parameter  adjustment and supervised learning. However, if ready-to-use neural network models are available,  they can replace rule-based table detection algorithms. At the same time, the quality of the final  results can be improved by applying filtering of candidate cases.
  Keywords: table recognition, table extraction, unstructured data, document tables, document page layout analysis
  doi: 10.25743/ICT.2024.29.6.008
 Author(s): Shigarov Alexei Olegovich PhD. Position: Leading research officer Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS Address: 664033, Russia, Irkutsk 
Phone Office: (3952) 45-31-07 E-mail: shigarov@icc.ru SPIN-code: 5159-9006  Bibliography link:  Shigarov A.O. Table recognition in untagged PDF documents using PDF-specific features // Computational technologies. 2024. V. 29. ¹ 6. P. 125-146 					
 				 |