Article information
2021 , Volume 26, ¹ 2, p.98-108
Zolotov S.Y., Turchanovskii I.Y.
Application of Apache Big Data technologies for the problems of climate monitoring
The core of the Apache Big Data stack consists of two technologies: Apache Hadoop for organizing distributed file storages of unlimited capacity and Apache Spark for organizing parallel computing on computing clusters. The combination of Apache Spark and Apache Hadoop is fully applicable for creating big data processing systems. The main idea implemented by Spark is dividing data into separate parts (partitions) and processing these parts in memory of many computers connected within a network. Data is sent only when needed, and Spark automatically detects when the exchange will take place. For testing, we chose the problem of calculating the monthly, annual, and seasonal trends in the temperature of the atmosphere of our planet for the period from 1960 to 2010 according to the NCEP/NCAR and JRA-55 reanalysis data. During the experiment, four variants of solving the test problem were implemented. The first variant represents the simplest implementation without parallelism. The second implementation variant assumes parallel reading of data from the local file system, aggregation, and calculation of trends. The third variant was the calculation of a test problem on a two-node cluster. NCEP and JRA-55 reanalysis files were placed in their original format in the Hadoop storage (HDFS), which combines the disk subsystems of two computers. The disadvantage of this variant is loading all reanalysis files completely into the random access memory of the workflow. The solution proposed in the fourth variant is to pre-convert the original file format to a form when reading from HDFS is selective, based on the specified parameters.
[full text] [link to elibrary.ru]
Keywords: climate monitoring, Apache Big Data technology, scaling computation
doi: 10.25743/ICT.2021.26.2.008
Author(s): Zolotov Sergey Yurievich PhD. , Associate Professor Position: Associate Professor Office: Federal Research Center for Information and Computational Technologies Address: 634055, Russia, Tomsk, Academichesky avenue, 10/3
E-mail: sergey-zo@yandex.ru SPIN-code: 101915Turchanovskii Igor Yurievich PhD. Position: Deputy Director on science Office: Federal Research Center for Information and Computational Technologies Address: 645055, Russia, Tomsk, Academicheskii avenue, 10/4
Phone Office: (3822) 49 17 74 E-mail: tur@hcei.tsc.ru SPIN-code: 6754-2929 References: 1. Otsenochnyy doklad ob izmeneniyakh klimata i ikh posledstviyakh na territorii Rossiyskoy Federatsii. Tom I. Izmeneniya klimata [Assessment report on climate changes and their consequences on the territory of the Russian Federation. Volume I. Climate changes]. Moscow: Rosgidromet; 2008: 228. (In Russ.)
2. Stocker T.F., Qin D., Plattner G.-K., Tignor M., Allen S.K., Boschung J., Nauels A., Xia Y., Bex V., Midgley P.M. Climate change 2013: The physical science basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge, United Kingdom: Cambridge University Press; 2013: 1535. DOI:10.1017/CBO9781107415324.
3. Apache Hadoop. Available at: https://hadoop.apache.org (accessed 2.12.2020).
4. Apache Spark. Available at: https://spark.apache.org (accessed 2.12.2020).
5. Big data architectures. Available at: https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data (accessed 2.12.2020).
6. Apache Zeppelin. Available at: https://zeppelin.apache.org (accessed 2.12.2020).
7. Kalnay E., Kanamitsu M., Kistler R., Collins W., Deaven D., Gandin L., Iredell M.,Saha S., White G., Woollen J., Zhu Y., Chelliah M., Ebisuzaki W., Higgins W., Janowiak J., Mo K.C., Ropelewski C., Wang J., Leetmaa A., Reynolds R., Roy J., Dennis J. The NCEP/NCAR 40-year reanalysis project. Bulletin of the American Meteorological Society. 1996; 77(3):437471.
8. Kobayashi S., Ota Y., Harada Y., Ebita A., Moriya M., Onoda H., Onogi K., Kamahori H., Kobayashi C., Endo H., Miyaoka K., Takahashi K. The JRA-55 reanalysis: General specifications and basic characteristics. Journal of the Meteorological Society of Japan. 2015; 93(1):548.
9. Yurchenko A.V. On the concept of information-analytical system for supporting data intensive science. Computational Technologies. 2017; 22(4):105120. (In Russ.)
10. Boychenko I.V., Turchanovskiy I.Yu. Designing of the data service in information systems for science research based on Big Data paradigm. Vestnik NSU. Series: Information Technologies. 2015; 13(2):2227. (In Russ.)
11. Boichenko I.V., Kataev M.Yu., Marichev V.N. Information system to analyze lidar ozonesonde data. Meteorologiya i Gidrologiya. 2001; (12):96105. (In Russ.)
12. Kataev M.Yu., Boichenko I.V., Maksyutov S. Software 7S for simulating of the solar reflected radiance transfer in atmosphere. Proceedings Twelfth Joint International Symposium on Atmospheric and Ocean Optics/Atmospheric Physics. 2006; (6160):616007. DOI:10.1117/12.675203. (In Russ.)
13. Boychenko I.V., Ka aev M.Yu. Program system of modeling of the satellite monitoring an atmosphere and earth surface. Proceedings of TUSUR University. Part 1. 2009; 1(19):8895. (In Russ.)
14. Boychenko I.V., Marichev V.N., Turchanovskiy I.Yu. Service-oriented approach for designing the software systems for the atmosphere lidar sounding data processing and analysis. Vestnik NSU. Series: Information Technologies. 2014; 12(4):512. (In Russ.)
15. The HDF5 library & file format. Available at: https://www.hdfgroup.org/solutions/hdf5 (accessed 2.12.2020).
16. A guide to the code form FM 92-IX Ext. GRIB. Available at: https://www.wmo.int/pages/prog/www/WDM/Guides/Guide-binary-2.html (accessed 2.12.2020).
17. Palamuttam R., Mogrovejo R.M., Mattmann C., Wilson B., Whitehall K., Verma R., McGibbney L., Ramirez P. SciSpark: Applying in-memory distributed computing to weather event detection and tracking. IEEE International Conference on Big Data. Santa Clara, CA, USA: IEEE; 2015: 20202026.
18. Wilson B., Palamuttam R., Whitehall K., Mattmann C., Goodman A., Boustani M., Shah S., Zimdars P., Ramirez P. SciSpark: Highly interactive in-memory science data analytics. IEEE International Conference on Big Data. Washington DC, USA: IEEE; 2016: 29642973. Bibliography link: Zolotov S.Y., Turchanovskii I.Y. Application of Apache Big Data technologies for the problems of climate monitoring // Computational technologies. 2021. V. 26. ¹ 2. P. 98-108
|