28.2.3 En - ИЗВЕСТИЯ КАБАРДИНО-БАЛКАРСКОГО НАУЧНОГО ЦЕНТРА РАН»

Architecture of a distributed storage and big data processing system based on Apache Ozone and Argo Workflows

K.A. Polyantseva, A.V. Komlev, M.G. Gorodnichev

Abstract. The article discusses the architecture of a distributed big data storage and processing system based on the integration of the Apache Ozone object storage and the Argo Workflows computing process orchestration system.
Aim. Development and research of the architecture of a distributed big data storage and processing system based on the integration of Apache Ozone and Argo Workflows, implementing the principle of separation of storage and computing functions, as well as evaluating the effectiveness of the proposed solution compared to the traditional Apache Hadoop architecture.
Methods. Methods of system analysis of big data architectures, comparative experimental testing of distributed information storage and processing systems, as well as mathematical modeling methods are used to formalize the processes of scaling resources, computing time, and data storage efficiency. The experimental evaluation is carried out on Apache Ozone and Apache Hadoop clusters using Apache Spark to perform computational tasks.
Results. A distributed system architecture has been developed that provides independent scaling of storage and computing subsystems through the use of Apache Ozone object storage and orchestration of computing processes based on Argo Workflows in the Kubernetes container environment. A method for integrating components without using an intermediate S3 gateway is proposed, which reduces the overhead costs of interaction. Experimental studies have shown comparable performance of the proposed solution with a Hadoop cluster for data reading, writing, and processing, as well as advantages in scaling flexibility and disk space efficiency when using erasure coding.
Conclusions. The results of the study confirm the prospects of using architecture based on Apache Ozone and Argo Workflows as an alternative to traditional big data platforms. The separate storage and computing architecture allow for increased infrastructure flexibility, optimized resource usage, and lower data storage costs while maintaining comparable performance levels. The proposed approach can be applied in the construction of corporate analytical platforms, big data processing systems and machine learning infrastructures.

Keywords: distributed storage systems, big data, Apache Ozone, Argo Workflows, Kubernetes, Apache Spark, object storage, separation of storage and computing, scalability, data processing, container computing, fault tolerance

For citation. Polyantseva K.A., Komlev A.V., Gorodnichev M.G. Architecture of a distributed storage and big data processing system based on Apache Ozone and Argo Workflows. News of the Kabardino-Balkarian Scientific Center of RAS. 2026. Vol. 28. No. 2. Pp. 34–50. DOI: 10.35330/1991-6639-2026-28-2-34-50

Content is available under license Creative Commons Attribution 4.0 License

References

Polyantseva K. A. High-load platform for aggregation and analysis of unstructured data on road surface condition. Avtomatizatsiya v promyshlennosti [Automation in Industry]. 2022. No. 5. Pp. 32–37. DOI: 10.25728/avtprom.2022.05.09. (In Russian)
Gorodnichev M.G., Titov D.V., Lipatova A.D. On problem of constructing independent data processing architectures in intelligent transport systems. Inzhenernyy vestnik Dona [Engineering Bulletin of the Don]. 2025. No. 11(131). Pp. 62–92. (In Russian)
Malik V. Hadoop Distributed file system (HDFS) with its architecture. International Journal for Research in Applied Science and Engineering Technology. 2025. Vol. 13. Pp. 6031–6034. DOI: 10.22214/ijraset.2025.71584
Kala Karun A., Chitharanjan K. A review on Hadoop – HDFS infrastructure extensions. 2013 IEEE Conference on Information & Communication Technologies, Thuckalay, India. Pp. 132–137. DOI: 10.1109/CICT.2013.6558077
Zhu Z., Tan L., Li Y., Ji C. PHDFS: Optimizing I/O performance of HDFS in deep learning cloud computing platform. Journal of Systems Architecture. 2020. Vol. 109. Article 101810. DOI: 10.1016/j.sysarc.2020.101810
Ievlev K.O., Gorodnichev M.G. Comparative analysis of HDFS and Apache Ozone data storage systems. Computational Nanotechnology. 2025. Vol. 12. No. 1. Pp. 26–33. DOI: 10.33693/2313-223X-2025-12-1-26-33. (In Russian)
Wilkinson S. R., Aloqalaa M., Belhajjame K. et al. Applying the FAIR principles to computational workflows. Scientific Data. 2025. Vol. 12. Article 328. DOI: 10.1038/s41597-025-04451-9
Gustafsson O.J.R., Wilkinson S.R., Bacall F. et al. WorkflowHub: a registry for computational workflows. Scientific Data. 2025. Vol. 12. Article 837. DOI: 10.1038/s41597-025-04786-3
Tourouta E., Gorodnichev M., Polyantseva K., Moseva M. Providing fault tolerance of cluster computing systems based on fault-tolerant dynamic computation planning. Lecture Notes in Information Systems and Organisation: 3rd. Virtual, Online, 2022. Pp. 143–150. DOI: 10.1007/978-3-030-94252-6_10
Kumar B., Verma A., Verma P. Introduction of kubernetes. Modern kubernetes: From core concepts to intelligent autoscaling for cloud applications. Cham: Springer, 2026. Pp. 1–15. (Studies in Autonomic, Data-driven and Industrial Computing). DOI: 10.1007/978-3-032-12972-7_1
Aqasizade H., Ataie E., Bastam M. Kubernetes in action: Exploring the performance of Kubernetes distributions in the cloud. Software: Practice and Experience. 2025. Vol. 55. Pp. 1711–1725. DOI: 10.1002/spe.70000
Lucani D., Feher M. HyRES: A hybrid replication and erasure coding approach to data storage. 2025. 14 p. arXiv: 2511.00896. URL: https://arxiv.org/abs/2511.00896 (аccessed: 22/02/2026)
Shen Z., Cai Y., Cheng K., Lee P. P. C., Li X., Hu Y., Shu J. A survey of the past, present, and future of erasure coding for storage systems. ACM Transactions on Storage. 2025. Vol. 21. No. 1. Article 4. 39 p. DOI: 10.1145/3708994
Ibrahim S., Darrous J. Erasure coding aware block placement for data-intensive applications. ACM SIGOPS Operating Systems Review. 2025. Vol. 59. No. 1. Pp. 62–69. DOI: 10.1145/3759441.3759451

Information about the authors

Ksenia A. Polyantseva, Candidate of Technical Sciences, Associate Professor of the Department of Data Mining, Moscow Technical University of Communications and Informatics;
8A, Aviamotornaya street, Moscow, 111024, Russia;
k.a.poliantseva@mtuci.ru, ORCID: https://orcid.org/0000-0002-7102-4208, SPIN-code: 8112-8560
Artem V. Komlev, Student, Moscow Technical University of Communications and Informatics;
8A, Aviamotornaya street, Moscow, 111024, Russia;
komlev1257@gmail.com
Mikhail G. Gorodnichev, Candidate of Technical Sciences, Associate Professor, Dean of the Faculty of Information Technology, Moscow Technical University of Communications and Informatics;
8A, Aviamotornaya street, Moscow, 111024, Russia;
m.g.gorodnichev@mtuci.ru, ORCID: https://orcid.org/0000-0003-1739-9831, SPIN-code: 4576-9642

Funding

The study was performed without external funding.

Indexing Databases

Architecture of a distributed storage and big data processing system based on Apache Ozone and Argo Workflows