REINFORCEMENT LEARNING FOR AUTONOMOUS NAVIGATION OF ROBOTIC PLATFORMS UNDER UNCERTAINTY: DOMAIN RANDOMIZATION AND SIM-TO-REAL TRANSFER
DOI:
https://doi.org/10.30857/2786-5371.2025.5.5Keywords:
reinforcement learning, autonomous navigation, Proximal Policy Optimization, Domain Randomization, Sim-to-Real transfer, ESP32, TensorFlow Lite Micro, embedded systemsAbstract
Purpose. To develop and experimentally evaluate the effectiveness of an autonomous navigation system for a four-wheeled mobile robotic platform based on the Proximal Policy Optimization (PPO) algorithm with subsequent deployment of the model on an ESP32 microcontroller and ensuring reliable Sim-to-Real transfer in an uncertain and dynamic environment.
Methodology. The study applied a deep reinforcement learning approach using the PPO algorithm. A comparative analysis was conducted with the TD3, SAC, DDPG, A2C, Bug Algorithm, and Random Policy algorithms. The evaluation was carried out according to the following indicators: goal achievement success, collision frequency, trajectory efficiency, and learning stability (average reward and standard deviation).
To bridge the gap between simulation and reality, the Domain Randomization method was used with variations in six physical parameters: friction coefficient, robot mass, gyroscope noise, engine delay, and obstacle size and number. A neural network with an architecture of 64→32 neurons was quantized to INT8 format using TensorFlow Lite Micro and optimized for execution on ESP32.
Findings. The PPO algorithm demonstrated the highest efficiency among all tested methods: success in achieving the goal of 82.0%, average reward – 847.3, the lowest variability of results (σ = 12.3).
Statistical analysis confirmed the significant advantage of PPO over alternative approaches (p < 0.01, Cohen’s d > 0.8). The model after quantization to INT8 decreased to 2.8 KB with a loss of accuracy of only 2.3%. The inference time on ESP32 was 1.2 ms, which ensures real-time operation on a resource-constrained platform.
Originality. A comprehensive approach to autonomous navigation using PPO with integration of Domain Randomization is proposed to improve the quality of Sim-to-Real transfer.
A systematic experimental comparison of modern deep learning algorithms with reinforcement in the autonomous navigation problem of a mobile robot is performed.
An effective quantization of the PPO model to the INT8 format with minimal loss of accuracy and successful deployment on the ESP32 microcontroller is implemented.
Practical value. The results obtained demonstrate the possibility of implementing deep reinforcement learning algorithms in real mobile robotic systems with limited computing resources. The developed system can be used in service robotics, small-class autonomous vehicles, monitoring and inspection systems, ensuring high navigation reliability and real-time performance.
In robotics, the challenge of training effective navigation models under conditions of environmental uncertainty remains critical. The purpose of this study was to analyse the capabilities of using Proximal Policy Optimization (PPO) algorithm for solving autonomous navigation problems with limited and uncertain sensory information. The research methodology was based on comprehensive analysis of reinforcement learning implementations in simulation environments with subsequent deployment on resource-constrained microcontrollers (ESP32). The main focus was on investigating Domain Randomization techniques, hyperparameter optimization strategies, and Sim-to-Real transfer under variable physical conditions. The research aimed to identify the architectural features of neural networks that can provide high navigation accuracy with minimal computational resources. It was established that the proposed approach demonstrated significant potential for effectively solving navigation problems under uncertainty. It was found that Domain Randomization mechanisms (surface friction variance ±30%, sensor noise up to 40%, motor delay 10-30 ms) significantly increased the generalization ability of models when transferring from simulation to reality. The results expanded the theoretical understanding of deep reinforcement learning methods for robotics and outlined promising areas for adapting algorithms to specific embedded system constraints. PPO was shown to have unique architectural features allowing efficient operation with continuous action spaces. It was found that the internal regularization mechanisms (clip range, entropy coefficient, GAE) provided resistance to policy collapse and high prediction accuracy. The potential of reinforcement learning for solving complex navigation problems in robotics, autonomous vehicles, and service robots with limited computational resources is demonstrated. The practical significance of the study was in developing methodological recommendations for selecting and configuring RL algorithms for various types of navigation problems on edge devices.