Article information
2020 , Volume 25, ¹ 3, p.99-110
Avdeenko T.V., Mezentsev Y.A.
Document clustering based on the semantic matrix of relationships for conceptual indexing
Purpose. The purpose of this work is to develop a method for document clustering based on conceptual indexing with the help of knowledge taxonomy. Methodology. Solving the problem of document clustering involves two fundamental stages. The first stage is preprocessing of a text document and representing it as a data table suitable for subsequent application of data analysis methods. The second stage is actually the optimization of clustering algorithm, which allows achieving optimal partitioning of the document collection in order to achieve, on the one hand, compactness of clusters, on the other hand, distinctness of clusters. We suggest a new approach to conceptual indexing of documents by transformation of a set of key terms to a weighted set of concepts for a certain hierarchical knowledge model of the application domain. The semantic matrix of document relationships with taxonomy concepts obtained as a result of the approach can be used as a data matrix for solving the clustering problem. For this purpose, we propose an original approach that uses the formalization of an NP-hard mixed programming problem, decomposition, and step-by-step solution that reduces its complexity. Results. As an example of applying the proposed approach, we consider the problem of clustering of 120 documents using 20 features that are terminal concepts of taxonomy. The result of direct clustering across 10 clusters did not guarantee closeness to the optimum, which caused the need for decomposition. The decomposition was carried out according to a two-stage hierarchical scheme: the first stage is the allocation of 2 clusters, the second stage is a sequential 5-clustering of each of the subsets formed at the first stage. The final result of clustering with two-step decomposition was about 20 percent better than the original one. Findings. The results of calculations confirmed the prospects of an optimization approach for clustering documents using a semantic matrix of relationships and revealed computational problems. In particular, the necessity of developing special computing tools and improving the formal statements themselves for reducing the overall complexity of calculations was revealed
[full text] [link to elibrary.ru]
Keywords: document clustering, conceptual indexing, taxonomy, ontology, mixed integer programming, NP-hard problem
doi: 10.25743/ICT.2020.25.3.011
Author(s): Avdeenko Tatiana Vladimirovna Dr. , Professor Position: Professor Office: Novosibirsk State Technical University Address: 630087, Russia, Novosibirsk, Karl Marx Ave. 20
E-mail: avdeenko@corp.nstu.ru Mezentsev Yuriy Anatolyevich Dr. , Associate Professor Position: Professor Office: Novosibirsk State Technical University Address: 630087, Russia, Novosibirsk, Karl Marx Ave. 20
E-mail: mezencev@corp.nstu.ru
References:
1. Bolelli L., Ertekin S., Lee G.C. Clustering scientific literature using sparse citation graph analysis.Proc. 10th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006). Germany; 2006: 3041.
2. Sedding J., Kazakov D. WordNet-based text document clustering. Proc. COLLING-2004 3rd Workshop on Robust Methods in Analysis of Natural Language Data. 2004: 104113.
3. Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys. 2002; (34):147.
4. Boubekeur F., Azzoug W. Concept-based indexing in text information retrieval. Intern. J. of ComputerScience and Information Technology (IJCSIT). 2013; 5(1):119136.
5. Scott S., Stan M. Text classification using wordnet hypernyms. WordNet@ACL/COLING, 1998: 4551.
6. Rajman M., Andrews P., Almenta M.D., Seydoux F. Conceptual document indexing using a largescale semantic dictionary providing a concept hierarchy. Proc. Applied Stochastic Models and Data Analysis (ASMDA 2005), France. 2005: 8805.
7. Lukashevich N.V. Modeli i metody avtomaticheskoy obrabotki nestrukturirovannoy informatsii na osnove bazy znaniy ontologicheskogo tipa [Models and methods for automatic processing of unstructured information based on an ontological knowledge base]. Moscow: VINITI RAN; 2014: 33. (In Russ.) 8. Barresi S., Nefti-Meziani S., Rezgui Y. A concept based indexing approach for document clustering. 2008 IEEE Intern. Conf. on Semantic Computing, Santa Clara, CA. 2008: 2633. DOI:10.1109/ICSC.2008.75.
9. Kelmanov A.V., Pyatkin A.V. NP-Difficulty of some Euclidean problems of partitioning a finite setof points. J. Computational Mathematics and Mathematical Physics. 2018; 58(5):852856.
10. Kelmanov A.V., Pyatkin A.V. On the complexity of some vector sequence clustering problems. J. ofApplied and Industrial Mathematics. 2013; 7(3):363369.
11. Mezentsev Yu.A., Razumnikova O.M., Tarasova I.V., Trubnikova O.A. On some problems of big dataclustering by minimax and additive criteria, application in medicine and neurophysiology. Information Technologies. 2019; 25(10):602608. (In Russ.)
12. MacQueen J. Some methods for classification and analysis of multivariate observations. Proc. of theFifth Berkeley Symposium on Mathematical Statistics and Probability. 1967; (1):281297.
13. Kohonen T. Self-organizing maps (Third extended edition). New York; 2001: 501.
14. Shen F., Ogurab T., Hasegawa O. An enhanced self-organizing incremental neural network for onlineunsupervised learning. Neural Networks. 2007; 20(8):893903.
15. Pinedo M. Scheduling. Theory, Algorithms, and Systems. Third Edition. New York: Springer; 2008: 672. DOI:10.1007/978-0-387-78935-4.
16. Lenstra J.K., Shmoys D.B., Tardos E. Approximation algorithms for scheduling unrelated parallelmachines. Mathematical Programming. 1987; (46):259271.
17. Mezentsev Yu.A. Binary cut-and-branch method for solving linear programming problems with booleanvariables. Proc. 9th Intern. Conf. on Discrete Optimization and Operations Research and Scientific School. 2016; (1623):7285.
18. Mezentsev Yu. Binary cut-and-branch method for solving mixed integer programming problems. Constructive Nonsmooth Analysis and Related Topics (Dedicated to the Memory of V.F. Demyanov), CNSA, 2017. https://doi.org/10.1109/cnsa.2017.7973989
19. Avdeenko T.V., Makarova E.S. The case-based decision support system in the field of IT-consulting.Journal of Physics: Conference Series. 2017; (803): 012008.
20. Timofeeva A.Y., Avdeenko T.V., Makarova E.S., Murtazina M.S. Combined use of correlation measures for selecting semantically close concepts of the ontology. CEUR Workshop Proceedings. 2018; (2212):349358. Bibliography link: Avdeenko T.V., Mezentsev Y.A. Document clustering based on the semantic matrix of relationships for conceptual indexing // Computational technologies. 2020. V. 25. ¹ 3. P. 99-110
|