<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root>
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" article-type="research-article" dtd-version="1.2" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">News of the Kabardino-Balkarian Scientific Center of the Russian Academy of Sciences</journal-id><journal-title-group><journal-title xml:lang="en">News of the Kabardino-Balkarian Scientific Center of the Russian Academy of Sciences</journal-title><trans-title-group xml:lang="ru"><trans-title>Известия Кабардино-Балкарского научного центра РАН</trans-title></trans-title-group></journal-title-group><issn publication-format="print">1991-6639</issn><issn publication-format="electronic">2949-1940</issn></journal-meta><article-meta><article-id pub-id-type="publisher-id">294391</article-id><article-id pub-id-type="doi">10.35330/1991-6639-2025-27-2-86-102</article-id><article-id pub-id-type="edn">NZSEKR</article-id><article-categories><subj-group subj-group-type="toc-heading" xml:lang="ru"><subject>Информатика и информационные процессы</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="en"><subject>Informatics and information processes</subject></subj-group><subj-group subj-group-type="article-type"><subject>Research Article</subject></subj-group></article-categories><title-group><article-title xml:lang="en">On the application of reinforcement learning in the task of choosing the optimal trajectory</article-title><trans-title-group xml:lang="ru"><trans-title>О применении обучения с подкреплением в задаче выбора оптимальной траектории движения</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-1739-9831</contrib-id><contrib-id contrib-id-type="spin">4576-9642</contrib-id><name-alternatives><name xml:lang="en"><surname>Gorodnichev</surname><given-names>Mikhail G.</given-names></name><name xml:lang="ru"><surname>Городничев</surname><given-names>Михаил Геннадьевич</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>Candidate of Engineering Sciences, Associate Professor, Dean of the Faculty of Information Technology</p></bio><bio xml:lang="ru"><p>канд. техн. наук, доцент, декан факультета «Информационные технологии»</p></bio><email>m.g.gorodnichev@mtuci.ru</email><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff-alternatives id="aff1"><aff><institution xml:lang="en">Moscow Technical University of Communications and Informatics, 8A Aviamotornaya st</institution></aff><aff><institution xml:lang="ru">Московский технический университет связи и информатики</institution></aff></aff-alternatives><content-language>ru</content-language><pub-date date-type="pub" iso-8601-date="2025-06-11" publication-format="electronic"><day>11</day><month>06</month><year>2025</year></pub-date><pub-date date-type="collection"><year>2025</year></pub-date><volume>27</volume><issue>2</issue><issue-title xml:lang="en"/><issue-title xml:lang="ru"/><fpage>86</fpage><lpage>102</lpage><history><date date-type="received" iso-8601-date="2025-05-30"><day>30</day><month>05</month><year>2025</year></date><date date-type="accepted" iso-8601-date="2025-05-30"><day>30</day><month>05</month><year>2025</year></date></history><permissions><copyright-statement xml:lang="en">Copyright ©; 2025, Gorodnichev M.G.</copyright-statement><copyright-statement xml:lang="ru">Copyright ©; 2025, Городничев М.Г.</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="en">Gorodnichev M.G.</copyright-holder><copyright-holder xml:lang="ru">Городничев М.Г.</copyright-holder><ali:free_to_read xmlns:ali="http://www.niso.org/schemas/ali/1.0/"/><license><ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">https://creativecommons.org/licenses/by/4.0</ali:license_ref></license></permissions><self-uri xlink:href="https://journals.rcsi.science/1991-6639/article/view/294391">https://journals.rcsi.science/1991-6639/article/view/294391</self-uri><abstract xml:lang="en"><p>This paper reviews state-of-the-art reinforcement learning methods, with a focus on their application in dynamic and complex environments. The study begins by analysing the main approaches to reinforcement learning such as dynamic programming, Monte Carlo methods, time-difference methods and policy gradients. Special attention is given to the Generalised Adversarial Imitation Learning (GAIL) methodology and its impact on the optimisation of agents' strategies. A study of model-free learning is presented and criteria for selecting agents capable of operating in continuous action and state spaces are highlighted. The experimental part is devoted to analysing the learning of agents using different types of sensors, including visual sensors, and demonstrates their ability to adapt to the environment despite resolution constraints. A comparison of results based on cumulative reward and episode length is presented, revealing improved agent performance in the later stages of training. The study confirms that the use of simulated learning significantly improves agent performance by reducing time costs and improving decision-making strategies. The present work holds promise for further exploration of mechanisms for improving sensor resolution and fine-tuning hyperparameters.</p></abstract><trans-abstract xml:lang="ru"><p>В данной статье рассматриваются современные методы обучения с подкреплением, с акцентом на их применение в динамичных и сложных средах. Исследование начинается с анализа основных подходов к обучению с подкреплением, таких как динамическое программирование, методы Монте-Карло, методы временной разницы и градиенты политики. Особое внимание уделяется методологии Generalized Adversarial Imitation Learning (GAIL) и ее влиянию на оптимизацию стратегий агентов. Приведено исследование безмодельного обучения и выделены критерии выбора агентов, способных работать в непрерывных пространствах действий и состояний. Экспериментальная часть посвящена анализу обучения агентов с использованием различных типов сенсоров, включая визуальные, и демонстрирует их способность адаптироваться к условиям среды, несмотря на ограничения разрешения. Представлено сравнение результатов на основе кумулятивной награды и длины эпизода, выявляющее улучшение производительности агентов на поздних этапах обучения. Исследование подтверждает, что использование имитационного обучения значительно повышает эффективность агента, сокращая временные затраты и улучшая стратегии принятия решений. Настоящая работа открывает перспективы для дальнейшего изучения механизмов улучшения разрешающей способности сенсоров и тонкой настройки гиперпараметров.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>обучение с подкреплением</kwd><kwd>интеллектуальные агенты</kwd><kwd>оптимальная траектория</kwd><kwd>высокоавтоматизированные транспортные средства</kwd><kwd>обучение на основе политик</kwd><kwd>архитектуры актер-критик</kwd><kwd>имитационное обучение</kwd><kwd>сенсоры</kwd><kwd>непрерывные состояния</kwd><kwd>дискретные состояния</kwd><kwd>PPO</kwd><kwd>SAC</kwd></kwd-group><kwd-group xml:lang="en"><kwd>reinforcement learning</kwd><kwd>intelligent agents</kwd><kwd>optimal trajectory</kwd><kwd>highly automated vehicles</kwd><kwd>policy-based learning</kwd><kwd>actor-critic architectures</kwd><kwd>simulated learning</kwd><kwd>sensors</kwd><kwd>continuous states</kwd><kwd>discrete states</kwd><kwd>PPO</kwd><kwd>SAC</kwd></kwd-group><funding-group/></article-meta></front><body></body><back><ref-list><ref id="B1"><label>1.</label><mixed-citation>Zhang S., Xia Q., Chen M., Cheng S. Multi-Objective Optimal Trajectory Planning for Robotic Arms Using Deep Reinforcement Learning. Sensors. 2023. Vol. 23. P. 5974. DOI: 10.3390/s23135974</mixed-citation></ref><ref id="B2"><label>2.</label><mixed-citation>Tamizi M.G., Yaghoubi M., Najjaran H. A review of recent trend in motion planning of industrial robots. International Journal of Intelligent Robotics and Applications. 2023. Vol. 7. Pp. 253–274. DOI:10.1007/s41315-023-00274-2</mixed-citation></ref><ref id="B3"><label>3.</label><mixed-citation>Kollar T., Roy N. Trajectory Optimization using Reinforcement Learning for Map Exploration. International Journal of Robotics Research. 2008. Vol. 27. No. 2. Pp. 175–196. DOI: 10.1177/0278364907087426</mixed-citation></ref><ref id="B4"><label>4.</label><mixed-citation>Acar E.U., Choset H., Zhang Y., Schervish M. Path planning for robotic demining: robust sensor-based coverage of unstructured environments and probabilistic methods. International Journal of Robotics Research. 2003. Vol. 22. No. 7–8. Pp. 441–466.</mixed-citation></ref><ref id="B5"><label>5.</label><mixed-citation>Cohn D.A., Ghahramani Z., Jordan M.I. Active learning with statistical models. Journal of Artificial Intelligence Research. 1996. No. 4. Pp. 705–712.</mixed-citation></ref><ref id="B6"><label>6.</label><mixed-citation>Axhausen K. et al. Introducing MATSim. In: Horni, A et al (eds.). Multi-Agent Transport Simulation MATSim. London: Ubiquity Press. 2016. Pp. 3–8. DOI: 10.5334/baw.1</mixed-citation></ref><ref id="B7"><label>7.</label><mixed-citation>Wu G., Zhang D., Miao Z., Bao W., Cao J. How to Design Reinforcement Learning Methods for the Edge: An Integrated Approach toward Intelligent Decision Making. Electronics. 2024. Vol. 13. P. 1281. DOI: 10.3390/electronics13071281</mixed-citation></ref><ref id="B8"><label>8.</label><mixed-citation>Zhou T., Lin M. Deadline-aware deep-recurrent-q-network governor for smart energy saving. IEEE Transactions on Network Science and Engineering. 2021. Vol. 9. Pp. 3886–3895. DOI: 10.1109/TNSE.2021.3123280</mixed-citation></ref><ref id="B9"><label>9.</label><mixed-citation>Yang Y., Wang J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv 2020, arXiv:2011.00583. DOI: 10.48550/arXiv.2011.00583</mixed-citation></ref><ref id="B10"><label>10.</label><mixed-citation>Mazyavkina N., Sviridov S., Ivanov S., Burnaev E. Reinforcement learning for combinatorial optimization: A survey. Comput. Oper. Res. 2021. Vol. 134. P. 105400. DOI: 10.1016/j.cor.2021.105400</mixed-citation></ref><ref id="B11"><label>11.</label><mixed-citation>Junwei Zhang, Zhenghao Zhang, Shuai Han, Shuai Lü, Proximal policy optimization via enhanced exploration efficiency. Information Sciences. 2022. Vol. 609. Pp. 750–765. ISSN 0020-0255. DOI: 10.1016/j.ins.2022.07.111</mixed-citation></ref><ref id="B12"><label>12.</label><mixed-citation>Hessel M., Modayil J., H. van Hasselt, Schaul T. et al. Rainbow: Combining improvements in deep reinforcement learning. In AAAI Conference on Artificial Intelligence. 2018. Pp. 3215–3222. DOI: 10.1609/aaai.v32i1.11796</mixed-citation></ref><ref id="B13"><label>13.</label><mixed-citation>Haarnoja T., Zhou A., Abbeel P., Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning. 2018. Pp. 1856–1865. DOI: 10.48550/arXiv.1801.01290</mixed-citation></ref><ref id="B14"><label>14.</label><mixed-citation>Lillicrap T.P., Hunt J.J., Pritzel A. et al. Continuous control with deep reinforcement learning. arXiv:1509.02971v1. 2015. file:///C:/Users/%D0%90%D1%80%D1%81%D0%B5%D0%BD/ Downloads/1509.02971v1.pdf</mixed-citation></ref><ref id="B15"><label>15.</label><mixed-citation>Chen Y., Lam C.T., Pau G., Ke W. From Virtual to Reality: A Deep Reinforcement Learning Solution to Implement Autonomous Driving with 3D-LiDAR. Applied Sciences. 2025. Vol. 15. No. 3. P. 1423. DOI: 10.3390/app15031423</mixed-citation></ref><ref id="B16"><label>16.</label><mixed-citation>Guoyu Zuo, Kexin Chen, Jiahao Lu, Xiangsheng Huang. Deterministic generative adversarial imitation learning. Neurocomputing. 2020. Vol. 388. Pp. 60–69. ISSN 0925-2312. DOI: 10.1016/j.neucom.2020.01.016</mixed-citation></ref><ref id="B17"><label>17.</label><mixed-citation>Sawada R. Automatic Collision Avoidance Using Deep Reinforcement Learning with Grid Sensor. In: Sato, H., Iwanaga, S., Ishii, A. (eds). Proceedings of the 23rd Asia Pacific Symposium on Intelligent and Evolutionary Systems. IES 2019. Proceedings in Adaptation, Learning and Optimization. Springer, Cham. 2020. Vol. 12. Pp. 17–32. DOI: 10.1007/978-3-030-37442-6_3</mixed-citation></ref><ref id="B18"><label>18.</label><mixed-citation>Hachaj T., Piekarczyk M. On Explainability of Reinforcement Learning-Based Machine Learning Agents Trained with Proximal Policy Optimization That Utilizes Visual Sensor Data. Applied Sciences. 2025. Vol. 15. No. 2. P. 538. DOI: 10.3390/app15020538</mixed-citation></ref></ref-list></back></article>
