News of the Kabardino-Balkarian Scientific Center of the Russian Academy of Sciences

Известия Кабардино-Балкарского научного центра РАН

1991-66392949-1940

294391

10.35330/1991-6639-2025-27-2-86-102

NZSEKR

Информатика и информационные процессы

Informatics and information processes

Research Article

On the application of reinforcement learning in the task of choosing the optimal trajectory

О применении обучения с подкреплением в задаче выбора оптимальной траектории движения

https://orcid.org/0000-0003-1739-9831

4576-9642

Gorodnichev

Mikhail G.

Городничев

Михаил Геннадьевич

Russian Federation

Candidate of Engineering Sciences, Associate Professor, Dean of the Faculty of Information Technology

канд. техн. наук, доцент, декан факультета «Информационные технологии»

m.g.gorodnichev@mtuci.ru

Moscow Technical University of Communications and Informatics, 8A Aviamotornaya stМосковский технический университет связи и информатики

11062025

2025

272

861023005202530052025

2025

Gorodnichev M.G.

Городничев М.Г.

https://creativecommons.org/licenses/by/4.0

https://journals.rcsi.science/1991-6639/article/view/294391

This paper reviews state-of-the-art reinforcement learning methods, with a focus on their application in dynamic and complex environments. The study begins by analysing the main approaches to reinforcement learning such as dynamic programming, Monte Carlo methods, time-difference methods and policy gradients. Special attention is given to the Generalised Adversarial Imitation Learning (GAIL) methodology and its impact on the optimisation of agents' strategies. A study of model-free learning is presented and criteria for selecting agents capable of operating in continuous action and state spaces are highlighted. The experimental part is devoted to analysing the learning of agents using different types of sensors, including visual sensors, and demonstrates their ability to adapt to the environment despite resolution constraints. A comparison of results based on cumulative reward and episode length is presented, revealing improved agent performance in the later stages of training. The study confirms that the use of simulated learning significantly improves agent performance by reducing time costs and improving decision-making strategies. The present work holds promise for further exploration of mechanisms for improving sensor resolution and fine-tuning hyperparameters.

В данной статье рассматриваются современные методы обучения с подкреплением, с акцентом на их применение в динамичных и сложных средах. Исследование начинается с анализа основных подходов к обучению с подкреплением, таких как динамическое программирование, методы Монте-Карло, методы временной разницы и градиенты политики. Особое внимание уделяется методологии Generalized Adversarial Imitation Learning (GAIL) и ее влиянию на оптимизацию стратегий агентов. Приведено исследование безмодельного обучения и выделены критерии выбора агентов, способных работать в непрерывных пространствах действий и состояний. Экспериментальная часть посвящена анализу обучения агентов с использованием различных типов сенсоров, включая визуальные, и демонстрирует их способность адаптироваться к условиям среды, несмотря на ограничения разрешения. Представлено сравнение результатов на основе кумулятивной награды и длины эпизода, выявляющее улучшение производительности агентов на поздних этапах обучения. Исследование подтверждает, что использование имитационного обучения значительно повышает эффективность агента, сокращая временные затраты и улучшая стратегии принятия решений. Настоящая работа открывает перспективы для дальнейшего изучения механизмов улучшения разрешающей способности сенсоров и тонкой настройки гиперпараметров.

обучение с подкреплениеминтеллектуальные агентыоптимальная траекториявысокоавтоматизированные транспортные средстваобучение на основе политикархитектуры актер-критикимитационное обучениесенсорынепрерывные состояниядискретные состоянияPPOSAC

reinforcement learningintelligent agentsoptimal trajectoryhighly automated vehiclespolicy-based learningactor-critic architecturessimulated learningsensorscontinuous statesdiscrete statesPPOSAC

Zhang S., Xia Q., Chen M., Cheng S. Multi-Objective Optimal Trajectory Planning for Robotic Arms Using Deep Reinforcement Learning. Sensors. 2023. Vol. 23. P. 5974. DOI: 10.3390/s23135974

Tamizi M.G., Yaghoubi M., Najjaran H. A review of recent trend in motion planning of industrial robots. International Journal of Intelligent Robotics and Applications. 2023. Vol. 7. Pp. 253–274. DOI:10.1007/s41315-023-00274-2

Kollar T., Roy N. Trajectory Optimization using Reinforcement Learning for Map Exploration. International Journal of Robotics Research. 2008. Vol. 27. No. 2. Pp. 175–196. DOI: 10.1177/0278364907087426

Acar E.U., Choset H., Zhang Y., Schervish M. Path planning for robotic demining: robust sensor-based coverage of unstructured environments and probabilistic methods. International Journal of Robotics Research. 2003. Vol. 22. No. 7–8. Pp. 441–466.

Cohn D.A., Ghahramani Z., Jordan M.I. Active learning with statistical models. Journal of Artificial Intelligence Research. 1996. No. 4. Pp. 705–712.

Axhausen K. et al. Introducing MATSim. In: Horni, A et al (eds.). Multi-Agent Transport Simulation MATSim. London: Ubiquity Press. 2016. Pp. 3–8. DOI: 10.5334/baw.1

Wu G., Zhang D., Miao Z., Bao W., Cao J. How to Design Reinforcement Learning Methods for the Edge: An Integrated Approach toward Intelligent Decision Making. Electronics. 2024. Vol. 13. P. 1281. DOI: 10.3390/electronics13071281

Zhou T., Lin M. Deadline-aware deep-recurrent-q-network governor for smart energy saving. IEEE Transactions on Network Science and Engineering. 2021. Vol. 9. Pp. 3886–3895. DOI: 10.1109/TNSE.2021.3123280

Yang Y., Wang J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv 2020, arXiv:2011.00583. DOI: 10.48550/arXiv.2011.00583

10.

Mazyavkina N., Sviridov S., Ivanov S., Burnaev E. Reinforcement learning for combinatorial optimization: A survey. Comput. Oper. Res. 2021. Vol. 134. P. 105400. DOI: 10.1016/j.cor.2021.105400

11.

Junwei Zhang, Zhenghao Zhang, Shuai Han, Shuai Lü, Proximal policy optimization via enhanced exploration efficiency. Information Sciences. 2022. Vol. 609. Pp. 750–765. ISSN 0020-0255. DOI: 10.1016/j.ins.2022.07.111

12.

Hessel M., Modayil J., H. van Hasselt, Schaul T. et al. Rainbow: Combining improvements in deep reinforcement learning. In AAAI Conference on Artificial Intelligence. 2018. Pp. 3215–3222. DOI: 10.1609/aaai.v32i1.11796

13.

Haarnoja T., Zhou A., Abbeel P., Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning. 2018. Pp. 1856–1865. DOI: 10.48550/arXiv.1801.01290

14.

Lillicrap T.P., Hunt J.J., Pritzel A. et al. Continuous control with deep reinforcement learning. arXiv:1509.02971v1. 2015. file:///C:/Users/%D0%90%D1%80%D1%81%D0%B5%D0%BD/ Downloads/1509.02971v1.pdf

15.

Chen Y., Lam C.T., Pau G., Ke W. From Virtual to Reality: A Deep Reinforcement Learning Solution to Implement Autonomous Driving with 3D-LiDAR. Applied Sciences. 2025. Vol. 15. No. 3. P. 1423. DOI: 10.3390/app15031423

16.

Guoyu Zuo, Kexin Chen, Jiahao Lu, Xiangsheng Huang. Deterministic generative adversarial imitation learning. Neurocomputing. 2020. Vol. 388. Pp. 60–69. ISSN 0925-2312. DOI: 10.1016/j.neucom.2020.01.016

17.

Sawada R. Automatic Collision Avoidance Using Deep Reinforcement Learning with Grid Sensor. In: Sato, H., Iwanaga, S., Ishii, A. (eds). Proceedings of the 23rd Asia Pacific Symposium on Intelligent and Evolutionary Systems. IES 2019. Proceedings in Adaptation, Learning and Optimization. Springer, Cham. 2020. Vol. 12. Pp. 17–32. DOI: 10.1007/978-3-030-37442-6_3

18.

Hachaj T., Piekarczyk M. On Explainability of Reinforcement Learning-Based Machine Learning Agents Trained with Proximal Policy Optimization That Utilizes Visual Sensor Data. Applied Sciences. 2025. Vol. 15. No. 2. P. 538. DOI: 10.3390/app15020538