Journal "Computational Technologies"

Article information

2026 , Volume 31, № 1, p.92-105

Berikov V.B., Kutnenko O.A.

Weakly supervised multiple instance learning based on informative feature selection and sample filtering

Purpose. The paper addresses the problem of weakly supervised multiple instance learning, where sets of objects referred to as “bags” are analyzed. Each object is represented by a set of observations of certain features. A binary classification case is considered: one class is conventionally labelled as positive, and the other as negative. A bag is labelled as positive if it contains at least one positive object (the specific object is unknown); otherwise, the bag is labelled as negative. The goal is to predict classes for new bags to achieve the best quality metrics.

Methodology. Machine learning methods adapted to the problem are employed: 𝑘-nearest neighbours (𝑘NN), informative feature selection, sample filtering, and an ensemble approach to constructing decision functions. Additionally, the Function of Rival Similarity (FRiS) is used to evaluate the degree of “unusualness” of bags. An experimental study and comparison with existing methods are conducted on the real-world problem of identifying proteins containing structures with a thioredoxin fold.

Findings. A method for solving the problem was developed, utilizing informative feature space selection, filtering (self-correction) of the training set, and the voting on a set of decision functions. The results of solving the protein identification problem were compared with a number of well-known algorithms using quality metrics.

Originality/value. The developed method enables the selection of the most informative feature sets, which is crucial for improving the quality and interpretability of solutions, as well as selfcorrection of the training set, reducing the impact of various errors associated with inaccurate labelling, outliers, etc. In numerical experiments with protein structure recognition data, comparison results with a number of well-known algorithms confirmed sufficiently high efficiency of the proposed method according to the balanced accuracy metric.

[link to elibrary.ru]

Keywords: weakly supervised learning, multi-instance classification, informative feature, filtering of sample objects

doi: 10.25743/ICT.2026.31.1.008

Author(s):
Berikov Vladimir Borisovich
Dr. , Associate Professor
Position: General Scientist
Office: Sobolev Institute of mathematics Siberian Branch of Russian Academy of Science
Address: 630090, Russia, Novosibirsk, 4, Acad. Koptyug Avenue
Phone Office: (383) 3297575
E-mail: berikov@math.nsc.ru
SPIN-code: 8108-2591

Kutnenko Olga Andreevna
PhD. , Associate Professor
Position: Senior Research Scientist
Office: Sobolev Institute of Mathematics Siberian Branch Russian Academy of Sciences
Address: 630090, Russia, Novosibirsk, 4, Acad. Koptyug Avenue
E-mail: olga@math.nsc.ru
SPIN-code: 7600-1424

References:

1. Zhou Z-H. A brief introduction to weakly supervised learning. National Science Review. 2018; 5(1):44–53.

2. Van Engelen J.E., Hoos H.H. A survey on semi-supervised learning. Machine Learning. 2020; 109(2):373–440.

3. Cohn D.A., Ghahramani Z., Jordan M.I. Active learning with statistical models. Journal of Artificial Intelligence Research. 1996; (4):129–145.

4. Foulds J., Frank E. A review of multi-instance learning assumptions. The Knowledge Engineering Review. 2010; 25(1):1–25.

5. Dietterich T.G., Lathrop R.H., Lozano-P´erez T. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence. 1997; 89(1–2):31–71.

6. Abusev R.A. On group choice procedures for problems of classification and reliability in the case of lognormal variance. Journal Mathematical Sciences. 2013; 189(6):911–918. DOI:10.1007/s10958-013-1231-y.

7. Fatima S., Ali S., Kim H.-C. A comprehensive review on multiple instance learning. Electronics. 2023; 12(20):4323. DOI:10.3390/electronics12204323.

8. Andrews S., Tsochantaridis I., Hofmann T. Support vector machines for multiple-instance learning. Advances in Neural Information Processing Systems. 2002; (15):561–568.

9. Jia Y., Zhang C. Instance-level semi supervised multiple instance learning. Proceedings of the 23rd National Conference on Artificial Intelligence. 2008; (2):640–645.

10. Doll´ar P., Babenko B., Perona P., Belongie S., Tu Z. Multiple component learning for object detection. European Conference on Computer Vision. 2008: 211–224. Available at: https://link.springer.com/chapter/10.1007/978-3-540-88688-4_16 (accessed on March, 2024).

11. Wang X., Yan Y., Tang P., Bai X., Liu W. Revisiting multiple instance neural networks. Pattern Recognition. 2018; (74):15–24. Available at: https://arxiv.org/pdf/1610.02501.

12. Carbonneau M.A., Cheplygina V., Granger E., Gagnon G. Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognition. 2018; (77):329–353.

13. Zhang M.L., Zhou Z.H. Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence. 2009; (31):47–68.

14. Tao Q., Scott S., Vinodchandran N.V., Osugi T.T. SVM-based generalized multiple instance learning via approximate box counting. Proceedings of the 21st Internation Conference on Machine Learning. 2004: 799–806. DOI:10.1145/1015330.1015405.

15. Tao Q., Scott S., Vinodchandran N.V., Osugi T.T., Mueller B. An extended kernel for generalized multiple-instance learning. Proceedings of the 16th IEEE International Conference on Tools with Articial Intelligence. IEEE; 2004: 272–277.

16. Tao Q., Scott S. A faster algorithm for generalized multiple-instance learning. Proceedings of the 17th International Florida Articial Intelligence Research Society Conference. 2004: 550–555.

17. Zhang Q., Goldman S.A. Em-dd: an improved multiple-instance learning technique. Advances in Neural Information Processing Systems. 2002; (14):1073–1080.

18. Chen Y., Wang J.Z. Image categorization by learning and reasoning with regions. Journal of Machine Learning Research. 2004: 913–939.

19. Maron O., Lozano-P´erez T. A framework for multiple-instance learning. Proceedings of the 10th International Conference on Neural Information Processing Systems. 1997; (10):570–576.

20. Gartner T., Flach P.A., Kowalczyk A., Smola A.J. Multi-instance kernels. International Conference on Machine Learning. 2002; 2(3):7.

21. Zhou Z.H., Zhang M.L. Solving multi-instance problems with classifier ensemble based on constructive clustering. Knowledge and Information Systems. 2007; (11):155–170.

22. Amores J. Multiple instance classification: review, taxonomy and comparative study. Artificial Intelligence. 2013; (201):81–105.

23. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA; 2016: 770–778.

24. Ilse M., Tomczak J., Welling M. Attention-based deep multiple instance learning. Proceedings of the International Conferenceon Machine Learning. Stockholm, Sweden; 2018: 2127–2136.

25. Scott S., Zhang J., Brown J. On generalized multiple-instance learning. International Journal of Computational Intelligence and Applications. 2005; 5(01):21–35.

26. Arkad’ev A.G., Braverman E.M. Obuchenie mashiny raspoznavaniyu obrazov [Machine learning for pattern recognition]. Moscow: Nauka; 1964: 112. (In Russ.)

27. Zagoruiko N.G. Prikladnye metody analiza dannykh i znaniy [Applied methods of data and knowledge analysis]. Novosibirsk: Izdatel’stvo Institutа Matematiki SO RAN; 1999: 270. (In Russ.)

28. Li Y., Li T., Liu H. Recent advances in feature selection and its applications. Knowledge and Information Systems. 2017; (53):551–577. DOI:10.1007/s10115-017-1059-8.

29. Zagoruiko N.G., Kutnenko O.A. Recognition methods based on the AdDel algorithm. Sibirskiy Zhurnal Industrial’noy Matematiki. 2004; 7(1(17)):39–47. (In Russ.)

30. Zagoruiko N.G., Borisova I.А., Dyubanov V.V., Kutnenko О.А. Methods of recognition based on the function of rival similarity. Pattern Recognition and Image Аnаlysis. 2008; 18(1):1–6. DOI:10.1134/S105466180801001X.

31. The data set Protein.csv. Available at: https://www.dropbox.com/scl/fo/t31oh0tnoqnch4h9n4w2n/
AM6vLJJtIAfxD0WE54xHkd0?dl=0&e=2&preview=Protein.csv&rlkey=lgb2kg8t8eldx7u2564aug3tq (accessed on April, 2024).

32. Wang C., Scott S., Zhang J., Tao Q., Fomenko D., Gladyshev V. A study in modeling low-conservation protein superfamilies. Technical Report TR-UNL-CSE-2004-0003. Department of Computer Science, University of Nebraska. Available at: https://www.researchgate.net/publication/249960205_A_Study_in_Modeling_Low-Conservation_Protein_Superfamilies (accessed on April, 2024).

Bibliography link:
Berikov V.B., Kutnenko O.A. Weakly supervised multiple instance learning based on informative feature selection and sample filtering // Computational technologies. 2026. V. 31. № 1. P. 92-105