Open access peer-reviewed chapter - ONLINE FIRST

Deep Reinforcement Learning for Robot Navigation: Concepts, Current Trends, Challenges, and Future Directions

Written By

Nohaidda Sariff, Yahya Muhammad Adam, Intan Izafina Idrus, Zool Hilmi Ismail, Puteri Nor Aznie Fahsyar, Swee King Phang, Kok Seng Eu, Md Hasan Molla and Denesh Sooriamoorthy

Submitted: 08 August 2025 Reviewed: 19 January 2026 Published: 09 April 2026

DOI: 10.5772/intechopen.1014666

Multi-Agent Systems - From Basic Concepts to Cutting-Edge Technologies IntechOpen
Multi-Agent Systems - From Basic Concepts to Cutting-Edge Technol... Edited by Nohaidda Sariff

From the Edited Volume

Multi-Agent Systems - From Basic Concepts to Cutting-Edge Technologies [Working Title]

Nohaidda Sariff, Zool Hilmi Ismail, Puteri Nor Aznie Fahsyar and Denesh Sooriamoorthy

Chapter metrics overview

2 Chapter Downloads

View Full Metrics

Abstract

Deep reinforcement learning (DRL) has emerged as a prominent framework in the field of autonomous robot navigation, enabling agents to acquire complex decision-making capabilities and learn optimal policies through continuous interaction with their environment. This chapter provides a comprehensive review of deep reinforcement learning (DRL) in recent robot navigation research within real-time dynamic environments, addressing the gap caused by the limited existing reviews in this area. It begins with fundamental concepts, highlights current trends, discusses key challenges, and concludes with insights into future research directions. Current studies emphasize a shift from static to dynamic environments, improvements in sample efficiency, integration with visual perception, multi-agent systems, multi-objective navigation, and bridging the gap between simulation and real-world applications. These trends underscore the importance of enhancing robot adaptability, learning efficiency, robustness, and scalability, enabling robots to reach their targets while avoiding obstacles effectively. Significant challenges remain, including handling continuous action spaces, designing effective reward functions to balance exploration and exploitation, and addressing learning issues in both dynamic and real-world settings. These challenges will be examined in detail within this review. Furthermore, the chapter will explore future research directions, such as addressing dynamic and actively changing obstacle configurations, integrating DRL with other artificial intelligence techniques, improving learning efficiency across varying scales, and developing strategies for cooperative multi-agent systems. Throughout this review, key limitations and research gaps are identified, with the aim of advancing toward more autonomous, reliable, and scalable DRL-based navigation systems capable of operating effectively and efficiently in real-time environments.

Keywords

  • deep reinforcement learning
  • robot navigation
  • path planning
  • obstacles avoidance
  • mobile robot

1. Introduction

Deep reinforcement learning (DRL) is a subset of machine learning. Machine learning focuses on developing algorithms that learn from data to make predictions, make decisions, and detect patterns without explicit programming. Beyond supervised and unsupervised learning, reinforcement learning involves an agent learning to act by interacting with an environment in order to maximize rewards [1]. Deep learning, on the other hand, focuses on representing policies, value functions, or environment models using deep neural networks. In other words, DRL is a revolutionary area of artificial intelligence that combines reinforcement learning with deep neural networks. The primary goal of DRL is to maximize cumulative rewards in complex environments, enabling agents to perform advanced decision-making and acquire sophisticated strategies.

The main components of deep reinforcement learning (DRL) that ensure successful operation include the agent, environment, state st, action at, reward rt, and policy. The agent interacts with the environment, while the environment provides the context and feedback to the agent. The state represents the current situation or condition of the environment, based on which the agent selects actions and makes decisions. An action is a signal sent to the actuators, determined by the policy, and the reward serves as feedback indicating the quality of the agent’s performance. The policy is the strategy, often represented by a neural network, that maps states to actions, as illustrated in Figure 1. Through this structure, DRL can handle high-dimensional, continuous, dynamic, and complex inputs by leveraging sophisticated policies and value functions. This is in contrast to traditional reinforcement learning, which performs well mainly in small, discrete action spaces. Moreover, DRL has the capability to learn directly from raw sensor data and derive optimal policies without requiring an accurate model of the environment.

Figure 1.

General framework of deep reinforcement learning [2]. *Reprinted from Ref. [2], MDPI, CC BY.

In the context of robotics, deep reinforcement learning (DRL) enables robots to acquire complex behaviors through interaction with the environment by combining the decision-making capabilities of reinforcement learning with the perceptual strengths of deep learning [3, 4]. Applications of DRL in robotics include mobile robot navigation such as path planning [3] and obstacle avoidance [5, 6] as well as robotic manipulation tasks, including arm control [7], assembly and stacking, quadruped walking [7] and jumping [8], and humanoid balancing [9]. DRL has also been applied to multi-robot coordination for swarm behaviors and cooperative tasks [6], as well as human-robot interaction to develop adaptive and safe behaviors around humans [10]. The contribution of DRL to the advancement of robot navigation research is well documented. For example, Zhang et al. [3] reported an increasing trend in research on path planning using DRL, as opposed to conventional algorithms, based on their review of studies published between 1990 and 2024.

Popular DRL algorithms used for robotic path planning applications include Deep Q-Network (DQN)) [11], Deep Deterministic Policy Gradient (DDPG) [12], Proximal Policy Optimization (PPO) [13], and Soft Actor-Critic (SAC) [14]. Two main factors to consider when selecting an algorithm are the type of action space, whether discrete or continuous, and the policy type whether on-policy or off-policy. Other considerations include exploration strategy, sample efficiency, training stability and simplicity, as well as computational complexity. Generally, DQN is most suitable for small, discrete action spaces, DDPG is preferable when deterministic policies are required in continuous action spaces, PPO is valued for its stability and simplicity in both discrete and continuous spaces, and SAC is ideal for high-performance exploration in continuous action spaces with strong sample efficiency.

Figure 2 illustrates how DRL operates in a robotic navigation scenario. The agent, or robot, observes states by interacting with the environment, selects actions, and receives rewards. At each time step, the robot perceives the current state, processes it through a policy network to select an action, and obtains a scalar reward from the environment based on the outcome. This learning process continues until a termination condition is met, such as exceeding time limits, encountering an obstacle, or reaching the goal. To evaluate training duration and policy effectiveness across different navigation scenarios, it is essential to relate the step-by-step actions and feedback to the overall performance metrics measured at the end of a complete run or trial.

Figure 2.

Framework of deep reinforcement learning in robotics research [4]. *Reprinted from Ref. [4], MDPI, CC BY.

With the adaptive self-learning capability of DRL (Figures 1 and 2), its use in robotic navigation research has significantly increased. Therefore, further contributions in this area are essential, not only to provide a wider range of DRL algorithms for various robot navigation applications but also to explore the recent, unresolved issues and challenges in DRL. This will help enhance robot performance in real-time dynamic environments, leading to more robust and efficient navigation. The databases used to find the articles were Scopus and ScienceDirect. The search keywords ‘Deep Reinforcement Learning’ and ‘Robot Navigation with Deep Reinforcement Learning’ were used to review related articles for this systematic review.

Advertisement

2. Current research trends in deep reinforcement learning for robotic navigation

2.1 Transition of path planning from static to dynamic environments

Robot navigation initially focused on static environments before advancing to dynamic, real-time scenarios, with various path planning algorithms proposed over time [15, 16]. In situations without prior knowledge of the environment, a robot can determine its local path reactively by relying on sensor information, an approach known as behavior-based navigation [17, 18] or the artificial potential field method [19]. These are considered reactive approaches in the early evolution of mobile robot local path planning [20, 21]. However, such systems have limitations. While they can help a robot avoid obstacles in basic dynamic environments within a local domain, they are prone to issues such as the local minima problem and cannot guarantee the optimality of the global path.

By utilizing information about a known or partially known environment where a complete or partial map is available to the robot, global path planning can be applied in static environments [20, 22]. In such cases, appropriate algorithms are used to determine the robot’s global path on the given map. Common and well-known classical and heuristic-based algorithms include A* [23, 24], D*, distance wave transform, Dijkstra’s algorithm [25], RRT (rapidly exploring random trees) [26, 27, 28], and PRM (Probabilistic Roadmap). With the evolution of artificial intelligence, other optimization algorithms have been introduced, such as Genetic Algorithm (GA) [29, 30], Ant Colony Optimization (ACO) [31, 32], and Bee Colony Optimization [33], among others. These algorithms have been shown to produce optimal paths with the shortest travel distance. This pre-route computation allows the robot system to rely more on offline processing rather than online computation. Additionally, various intelligent controllers can be embedded into the system to enhance performance, including fuzzy logic [34, 35, 36], neural networks [37, 38], model predictive control (MPC) [39], and proportional-integral-derivative (PID) control [40], and many others.

Recognizing the advantages of having two separate systems, one for global path planning and another for local path planning, many researchers have sought alternative approaches to combine both into a single hybrid system. Such a system enables greater flexibility for robots operating in dynamic environments. In this approach, both systems complement each other: the optimal path to the goal is pre-calculated offline using the global planner before the robot begins moving, while the local reactive method takes over during execution to handle unexpected dynamic obstacles within the environment. As a result, various hybrid methods have been applied in robot navigation, such as improved A* with the Fuzzy Dynamic Window Approach (FDWA) [41], Particle Swarm Optimization (PSO) combined with Simulated Annealing (SA) [42], and Fuzzy Logic integrated with Artificial Potential Field (APF)) [43]. These methods have demonstrated improvements in optimizing travel time and distance through the global planner while ensuring safe obstacle avoidance via the local planner. Despite these successes, challenges remain, particularly the high computational cost in large-scale dynamic environments, where mapping choices and algorithm selection significantly affect performance. Furthermore, the local minima problem may still occur as obstacle complexity increases, especially in real-time dynamic scenarios.

With the introduction of deep reinforcement learning (DRL), robotics navigation has gained significant advantages over conventional path planning and rule-based methods. DRL is well-suited for real-world applications as it can handle continuous action spaces and does not require complete prior knowledge of the environment. This adaptability makes it particularly effective in complex and uncertain scenarios. Over recent years, DRL has been successfully applied to various robot navigation case studies, demonstrating strong performance in challenging environments [44]. It has proven effective in complete navigation systems [45, 46], path planning tasks [47, 48], and obstacle avoidance [10], ensuring safety-aware navigation that prevents collisions and unsafe paths. DRL has also been applied to address specific motion planning problems [49] and to navigate different terrain types [50]. Several examples highlight DRL’s versatility. In a static environment, Geetha Sharma et al. [5] showed that autonomous unmanned aerial vehicles (UAVs) using the Soft Actor-Critic (SAC) algorithm could efficiently plan paths, avoid obstacles, and optimize trajectories. In dynamic environments with varying numbers of obstacles, TurtleBot 3 employed the Deep Deterministic Policy Gradient (DDPG) algorithm combined with Voronoi diagrams, demonstrating high effectiveness [51]. In highly dynamic scenarios, UAVs equipped with a layered Deep Q-Network (DQN) achieved faster convergence, improved time and speed performance, and better generalization capabilities [52].

The traditional robot navigation framework is the approach where the robot creates a map of its surroundings using LIDAR or camera inputs through simultaneous localization and mapping (SLAM) [53]. Based on the generated local and global maps, the path to the goal point is planned while avoiding obstacles in the environment. However, deep reinforcement learning (DRL) has emerged as a powerful alternative due to its ability to handle high-dimensional problems and process large amounts of input data through deep neural networks. As a result, DRL has been used to replace or integrate with traditional navigation frameworks, particularly in navigation tasks. Figure 3 demonstrates the interaction process between an agent and the environment in a typical DRL-based navigation framework. In this setup, the agent assumes roles traditionally handled by separate modules such as robot localization, map building, and local path planning.

Figure 3.

Deep reinforcement learning-based navigation architecture [54]. *Reprinted from Ref. [54], IEEE, CC BY license.

In summary, the evolution of path planning algorithms has progressed in proportion to the complexity of the environments in which robots operate. The more challenging and dynamic the environment, especially in real-time scenarios with moving objects and dense surroundings, the more advanced and adaptable the algorithms need to be. Such algorithms must be flexible enough to adapt to changes over time, learn from experience, generalize to unseen scenarios, and handle dynamics and uncertainty without requiring explicit mapping. This contrasts with conventional algorithms, which rely on predefined maps and world models and perform well only in known, static environments. Thanks to these capabilities, DRL has made significant contributions to the field. Krishna Teja et al. [16] presents a graph showing the number of cases and algorithms applied across different environmental domains. According to the reviewed literature, DRL is the most widely applied method for robot path planning in dynamic and hybrid environments, outperforming many conventional algorithms in terms of both effectiveness and efficiency [44].

2.2 Sample inefficiency

To learn an effective policy, deep reinforcement learning (DRL) requires extensive sampling and interaction with the environment. However, sample efficiency is often low due to the large number of samples needed. Several strategies can improve sample efficiency, such as using off-policy algorithms (e.g., DQN, DDPG, SAC) that reuse past experiences [55]. Compared to on-policy algorithms like PPO, SAC often performs better. Experience replay [56, 57, 58] further enhances learning by enabling more training per sample. Improving exploration can also prevent sample wastage; for example, SAC leverages entropy regularization to encourage more effective exploration [57].

In addition, imitation learning, model-based reinforcement learning, and improved reward design [3, 59] can further boost efficiency. Imitation learning and model-based DRL allow agents to avoid starting from scratch, reducing the number of interactions required. Proper reward design significantly affects the learning process, as sparse or poorly shaped rewards hinder efficiency. Metaheuristic algorithms can be integrated with DRL to optimize reward shaping or path parameters [59, 60, 61] fine-tuning reward weights or structures to make them more informative. This accelerates convergence, improves generalization, and reduces wasted samples from misleading rewards—benefits that are particularly valuable in real-world robotics and simulation-to-real transfer, where sample collection is costly.

To address instability in DRL for continuous control, Gwang Jong Ko et al. [62] proposed a two-phase optimization framework, where a swarm intelligence algorithm is applied at the end of the learning phase. Experiments demonstrated that this approach achieved superior and more stable performance than conventional DRL methods in robot locomotion tasks. Finally, simulators such as GAZEBO [14, 63] and NVIDIA Isaac SIM [9] can significantly improve sample efficiency by allowing extensive pre-training in virtual environments before deployment in the real world. This enables agents to generalize better with fewer real-world samples.

2.3 Integration with perception and mapping

To improve the robustness and efficiency of deep reinforcement learning (DRL) in robotics, tighter integration with core robot subsystems such as perception and mapping is essential. For example, snake robots navigating in unknown dynamic environments can leverage multiple onboard sensors to reach a goal while avoiding obstacles [64]. A deep Q-Learning-based path planning method combined with simultaneous localization and mapping (SLAM) and visual perception from 2D LiDAR and IMU enables end-to-end navigation. Results show that active SLAM improves the success rate and reduces collisions compared with traditional A* and sampling-based RRT* algorithms.

Rui Wang et al. [65] demonstrated this with an aerial robot exploring an unknown environment. SLAM was employed to map the area, while an improved frontier-based exploration strategy integrated with DRL and a neural network architecture guided the robot toward target points. The proposed method consistently explored environments of varying sizes, requiring shorter travel distances than competing approaches.

Hybrid models also show significant benefits. In a simulated indoor maze with moving obstacles, a mobile robot using LiDAR and IMU inputs achieved higher speed and goal-reaching success rates than standard DRL. In this case, Proximal Policy Optimization (PPO) learned obstacle avoidance strategies, while a Genetic Algorithm optimized hyperparameters such as discount factors, reward weights, and network layer sizes before training. This integration not only improved sample efficiency and robustness but also enhanced dynamic obstacle avoidance.

2.4 Multi-agent robot research

A current research trend focuses on multi-agent systems, where the number of robots is increasing. Commonly referred to as swarm or cooperative navigation tasks as well as collaborative and competitive multi-robot systems [66, 67, 68, 69]. When multiple agents work cooperatively, tasks can be distributed among them, improving overall performance performances [70]. To enable agents to learn effectively and adapt to various domains, DRL has been applied in multi-agent case studies to enhance scalability as the number of agents increases [45].

Multi-robot path finding (MRPF) [47] has been proposed to achieve efficient coordination and collaboration in high-dimensional multi-agent systems. The framework is built on DRL using a two-layer concept, and by integrating this with a hierarchical reward mechanism, the success rate improves by more than 20% compared to existing algorithms, leading to better generalization and efficiency.

An innovative DRL variant, known as the DRL-MPCGNNs model, integrates DRL with model predictive control (MPC) and graph neural networks (GNN) for optimized path planning and task allocation in multi-agent systems. Proposed by ZhiXian Li et al. [71], this approach significantly improves path planning efficiency, task allocation effectiveness, and inter-robot collaboration performance.

2.5 Multi-objective navigation

As robot navigation tasks grow increasingly complex, research focus has shifted from achieving single objectives to addressing multiple objectives through advanced variations of Deep Reinforcement Learning (DRL). For example, Chien-Lun Cheng et al. [53] proposed a crowd-aware, multi-objective DRL-based navigation system called multi-objective dual-selection reinforcement learning (MODSRL), designed to navigate efficiently in crowded environments. By employing an appropriately designed reward function, MODSRL simultaneously achieves multiple objectives, including safety, time efficiency, collision avoidance, and smooth trajectory generation, demonstrating strong system robustness.

In the domain of aerial robotics, Jiahao Wu et al. [71] introduced a multi-objective navigation reinforcement learning (MONRL) framework for drones to navigate and avoid obstacles in unknown environments. Leveraging DRL, the drone learns navigation policies that optimize path planning while mitigating wind disturbances using camera input. The method was evaluated in both a virtual environment and a real-world model of New York City, achieving effective and adaptive navigation performance.

2.6 Simulation-to-real transfer

Agents are typically trained in simulation to avoid the high cost, risk, and inefficiency associated with real-world training. However, generalization often fails when transferring policies from simulation to real-world environments due to the simulation-to-reality gap. To address this challenge, several approaches can be adopted, including domain adaptation, domain randomization, world models, and simulation fidelity improvement.

For domain adaptation, methods such as feature-level alignment [72], adversarial training, or fine-tuning can be applied to map sensor inputs such as camera or LiDAR data from simulated to real-world domains. Domain randomization improves robustness and generalization by exposing the robot to randomized simulation conditions during training, including variations in lighting, obstacle properties, and dynamics.

Using high-fidelity simulators such as NVIDIA Omniverse, Gazebo, MuJoCo, and Webots [73] which incorporate realistic noise, latency, and sensor inaccuracies helps bridge the gap by better matching real-world physics and sensor data. In practice, most researchers train all policies in simulation before transferring them to real robots, and this approach has proven effective in achieving generalization across quadruped, mobile, and humanoid robot platforms.

Advertisement

3. Challenges in deploying deep reinforcement learning for robotic navigation

3.1 Challenge working in continuous action spaces

The real environment, which matches continuous action spaces, poses several challenges for deep reinforcement learning, such as high computational complexity, instability of control, exploration difficulties, and low sampling efficiency [3]. A few algorithms have been designed to handle continuous actions without discretization, including Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). In these methods, the policy is modeled using a continuous distribution such as a Gaussian, from which the agent samples an action and then deterministically outputs that action.

3.1.1 Exploration difficulty and high computational complexity

In continuous action spaces, where the number of possible states is infinite, the agent must explore numerous combinations to identify the optimal strategy. This significantly prolongs training time, increases the risk of converging to local optima, and demands extra effort. As the dimensionality of the action space increases, the agent must interact more with the environment to learn an effective policy. This increases data complexity, leading to higher computational requirements, which may be impractical for real-world robotic tasks.

3.1.2 Slow and unstable learning rate

In continuous domains, the real-valued actions taken by each agent depend on its policy, which determines the action to be taken in each state. The policy directly affects the gradients and learning process. Poor parameter choices can cause exploding or vanishing gradients, making learning slow and unstable. This instability is especially problematic when the policy diverges or when updates have negligible effects. Since policy design often involves neural networks with specific input–output configurations, researchers have addressed these issues by carefully initializing weights, applying output normalization, selecting appropriate algorithms, and employing entropy regularization [74] to improve stability.

3.1.3 Low sampling efficiency

Sampling efficiency is reduced because agents require a large number of samples across the infinite action space, which also increases computational demands. To ensure sufficient learning, the agent must gather enough interactions with the environment. To boost sampling efficiency (learning more from each sample), researchers have applied techniques such as experience replay and meta-learning [59]. Experience replay stores and reuses past samples instead of discarding them, improving stability and convergence. Meta-learning, on the other hand, enables the agent to adapt to new tasks quickly with only a few samples, rather than exploring the entire space. In this approach, the agent starts closer to the optimal policy, and with just a few gradient steps, it can learn parameters that adapt rapidly to a new task.

3.2 Time intensive

Training robots in real-world environments is time-consuming due to the complexity of the process. Transfer learning offers an alternative by enabling models to be trained in simulations before fine-tuning in the real world, thereby accelerating learning and reducing data requirements. This approach also improves the adaptability and responsiveness of DRL models to dynamic and changing real-world scenarios. Consequently, many researchers focus on simulation-based training [10, 48, 75, 76] or a combination of simulation and experimental setups [45] to ensure robust generalization. Studies have shown that DRL models trained in simulations can perform effectively in real-world environments [10, 14], demonstrating that good generalization is achievable. To bridge the simulation-to-reality gap, validation must be conducted in both domains to ensure effective domain adaptation.

3.3 Transfer from simulation to the real world

Testing in simulation differs significantly from real-world deployment due to variations in sensors, noise, dynamics, and environment. These discrepancies can cause learned policies to fail when applied to real robots, potentially leading to unsafe and unpredictable conditions. Additionally, latency issues in real-time hardware and controllers may impair DRL policy performance. Policies trained in one environment often struggle to generalize to new or slightly altered settings, necessitating transfer learning [77] and domain adaptation, both challenging for DRL.

To address these challenges, several techniques are commonly employed and should be continuously refined. These include training robust policies that are less sensitive to small environmental changes [78], using high-fidelity simulators [73], initiating training in simulation followed by fine-tuning with real-world data, and applying domain randomization during simulation to improve policy generalization across diverse robotic settings.

3.4 Exploration vs. exploitation: Complexities in reward design

In reinforcement learning, an agent must balance exploiting known strategies to maximize rewards while exploring new actions to discover better ones. Deep reinforcement learning (DRL) trains agents by maximizing cumulative rewards received from the environment, enabling them to learn desired behaviors. However, balancing exploration and exploitation is challenging, especially in high-dimensional or sparse-reward environments [58]. Poorly designed rewards can lead to suboptimal learning, unintended behaviors, or agents getting stuck in local optima due to inadequate exploration. For example, in navigation tasks, rewards may be higher when the agent is near the goal and penalized when far away [14].

In Soft Actor-Critic (SAC), the reward function directly influences exploitation and indirectly affects exploration [79]. SAC’s trade-off is based on maximum entropy reinforcement learning, where the agent maximizes both expected returns and policy entropy (encouraging stochastic actions). Fine-tuning techniques such as adaptive and dynamic entropy adjustment have been developed to optimize SAC’s performance. Improper entropy tuning can cause excessive randomness, undermining learning efficiency.

While entropy encourages exploration, it may also lead to suboptimal actions in some applications. A well-shaped reward function accelerates learning by providing clear signals for exploitation, driving the agent toward desired behavior. Without meaningful rewards, the agent relies heavily on entropy-driven exploration, slowing progress. In summary, designing appropriate reward functions is a complex but crucial aspect of effective DRL training.

3.5 Dynamic environments: Dense and disaster-focused scenarios

Dynamic environments, characterized by changing obstacles, lighting, or terrain pose significant challenges for deep reinforcement learning (DRL) [79]. The agent must continuously re-explore as the environment evolves, complicating exploration and increasing sample requirements. This leads to longer, more expensive training due to the greater number of interactions needed to learn effective policies across varying conditions.

Adapting quickly to such changes is difficult; policies may diverge instead of converging, especially for off-policy algorithms like deep Q-networks (DQN), whose replay buffers can contain outdated experiences that no longer represent the current dynamics [80]. The increased variance in reward signals caused by environmental changes further complicates stable policy learning. Unlike static settings, the same state in a dynamic environment may yield different outcomes at different times, causing the agent to struggle in adapting to new conditions.

This unpredictability increases the risk of outdated or mislearned policies, potentially leading to unsafe robot behaviors such as collisions or damage. Dense environments such as crowded pedestrian areas [75, 76] and disaster scenarios [46, 80] amplify these challenges, requiring robust and adaptive DRL solutions to ensure safe and effective navigation.

3.6 Scalability

Scalability refers to a DRL system’s ability to learn efficiently as the number of agents, tasks, or environmental complexities increase, while maintaining reasonable time and resource requirements—especially in high-dimensional action and complex state spaces. In swarm robotics, navigation and task allocation become increasingly challenging due to dynamic environments [81]. Niyazi Furkan Bar et al. [6] proposed a DRL framework based on distributed architecture combined with adaptive reward mechanisms, demonstrating improved navigation success rates as the swarm size grows exponentially.

Martinez et al. [82] evaluated DRL performance in highly dynamic, real-world-like environments and confirmed its robustness in crowded and previously unseen scenarios. In contrast, Qiang Gao et al. [83] introduced Multi-Agent Deep Deterministic Policy Gradient (MADDPG) for multi-unmanned surface vehicle (USV) path planning, showing enhanced training effectiveness and scalability, yielding more efficient and practical path planning.

To further improve scalability in algorithms like Soft Actor-Critic, strategies such as minimizing communication overhead, employing complex policy optimization, decentralized learning, graph-based policies, and federated reinforcement learning have been proposed [84]. These approaches help DRL handle growing system complexity while preserving performance and learning efficiency.

Advertisement

4. Future directions of deep reinforcement learning in robotics research

4.1 Dynamic environment

The ultimate goal of autonomous robotic systems is to navigate unknown and dynamic real-world environments. Deep reinforcement learning (DRL) has been widely applied in this domain to achieve adaptive and intelligent control. However, significant gaps remain, particularly in disaster [80] and emergency scenarios [85], where unpredictable and continuous changes demand highly adaptive robot control systems. To prevent accidents and ensure practical efficiency, these systems must be further enhanced to cope with such challenging environments. Approaches like adversarial training and uncertainty-aware policy learning show promise for improving robustness in unpredictable settings.

In disaster environments, where human access is limited, robots are expected to navigate complex and hazardous terrain efficiently. Their role includes real-time monitoring, hazard identification, and survivor detection, supported by artificial intelligence and advanced onboard sensors. Such capabilities contribute to more resilient and responsive disaster management. DRL particularly deep Q-networks has demonstrated significantly higher rescue success rates across various natural disaster scenarios compared to traditional methods [80].

4.2 Time-varying obstacles environment

Highly dynamic environments are characterized by frequently changing obstacle configurations, where both the shape and position of obstacles vary continuously over time. For instance, pedestrian environments [75] exhibit constantly shifting movements of people [76] while logistics and intelligent transportation settings feature dynamic obstacles such as both pedestrians and vehicles. The high level of dynamism in these environments demands safe and efficient navigation strategies. The risk of failure increases significantly in densely populated dynamic spaces, where obstacle density is high. Therefore, intensive research efforts should focus on these complex scenarios rather than on simpler obstacle configurations when applying deep reinforcement learning (DRL) [75].

4.3 Increasing learning efficiency through sample and data optimization

Learning with deep reinforcement learning (DRL) demands substantial computational resources due to the vast amounts of data generated from interactions among environment, actions, and rewards. In high-dimensional state and action spaces, the exponential growth of possible states leads to increased data usage, longer training times, and higher computational costs. To mitigate this complexity, future strategies could include improved exploration methods, model-based reinforcement learning, integration of prior knowledge, and extensive simulation training.

Exploration remains a critical challenge, especially in complex environments where extensive exploration is necessary to discover effective strategies, directly impacting learning efficiency. For example, exploration in Soft Actor-Critic (SAC) can be enhanced by increasing target entropy, augmenting reward terms, and incorporating model-based approaches for sparse-reward tasks [57].

4.4 Advanced robotics application

The application of deep reinforcement learning (DRL) has demonstrated significant success in enhancing the adaptability and efficiency of robot navigation in challenging environments. Integrating DRL within robotic systems improves perception [3], instruction interpretation, and symbolic reasoning for high-level planning. For instance, Li BoXin and Wang Zhaokui [86] proposed a two-stage DRL control method combining Vision-TD3 and Force-TD3 networks, which leverage hybrid visual and force feedback for end-to-end lunar robot assembly operations, enabling autonomous, intelligent task execution without relying on model-based control.

To accelerate learning and align robot behavior with human intentions, incorporating human feedback such as reward shaping, demonstrations, and corrections has proven effective [87]. Chang Chiu Liu et al. [88] introduced a mixed perception-based human-robot collaborative maintenance system employing a hierarchical structure, where online DRL aids decision-making. Experimental results demonstrated competitive performance compared to state-of-the-art approaches.

Hybridizing Soft Actor-Critic (SAC) with blockchain technology further enhances decentralized decision-making, security, and computational efficiency. By integrating SAC with distributed cloud frameworks, real-time deployment becomes feasible on autonomous ground vehicles, drones, and robotic arms, even on low-power processors [3].

As reviewed by Saba Waseem et al. [89], the future of advanced robotic manipulators depends on intelligent processing technologies and expert systems. Emphasis should be placed on robust controllers capable of adapting to unforeseen scenarios, improving human-robot collaboration, and ensuring operational safety and efficiency. Mohsen Soori et al. [90] highlighted applications of AI, machine learning, and DRL across advanced robotics industries, underscoring the need to continually enhance system robustness and performance.

Beyond navigation, DRL is increasingly applied to collaborative manipulation [91, 92], warehouse automation [93, 94], and swarm robot exploration [6], with future efforts focused on extending DRL to multi-robot cooperative and competitive scenarios to meet complex real-world demands.

4.5 Bridging the gap between simulation and real robot transfer

Bridging the reality gap between simulation and physical robots remains a critical challenge in robotics research. Future directions are likely to emphasize strategies such as learning invariant representations, domain randomization, domain adaptation, and the incorporation of real-world data into simulated training [45] Given the high computational cost of DRL algorithms and their limited feasibility on embedded robotic platforms, developing efficient DRL methods that can run on constrained hardware is an important priority.

For example, the COBOT robot using Soft Actor-Critic (SAC) for obstacle avoidance has demonstrated efficiency in both simulation and real-world experiments [10]. To further bridge the simulation-to-reality gap, the official Unified Robot Description Format (URDF) is employed to enhance simulation fidelity. Physical parameters such as maximum joint speeds, accelerations, rotational limits, momentum, and center of mass are calibrated to match real robotic arm characteristics, significantly improving the accuracy of the simulation.

In another study, SAC was tested on the TurtleBot3 Waffle Pi, equipped with a Raspberry Pi, using the Gazebo emulator within the Robot Operating System (ROS) framework. The hardware included an NVIDIA GeForce GTX 1080 GPU and Intel Core i7 CPU (2.9 GHz). Wen et al. further advanced DRL-based path planning in unknown environments through Twin Long Short-Term Memory Delayed Deep (TLSMD) reinforcement learning, validated via numerous 3D simulations [14, 63].

Advertisement

5. Conclusion

Robot navigation remains a prominent research topic focused on achieving efficient and effective movement toward a goal while avoiding collisions with obstacles in the environment. Numerous algorithms have been proposed over time to generate optimal and safe paths based on their evolving capabilities. To address increasingly challenging scenarios such as dense, dynamic obstacles and unpredictable emergency conditions, robot navigation algorithms must enable continuous learning in continuous action spaces, making complex decisions and optimizing policies. Deep reinforcement learning (DRL), a subset of artificial intelligence and machine learning, offers promising solutions to these demands. This chapter provides a comprehensive review of DRL in robot navigation research, beginning with fundamental concepts, followed by current technological trends. Key challenges in this field are discussed alongside insights and future directions drawn from recent studies. With the goal of advancing autonomous systems for robots and related applications, this review aims to help researchers understand DRL’s capabilities and encourage its adoption in diverse domains. There remains significant potential to expand DRL applications into more complex environments while maintaining robustness, scalability, and efficiency. Furthermore, the learning efficiency of DRL algorithms is typically evaluated using metrics such as cumulative reward, sample efficiency, convergence time, success rate, learning stability, exploration efficiency, and policy robustness. These metrics should be incorporated to comprehensively assess real-time applications. To ensure the robustness and effectiveness of learning algorithms such as Soft Actor-Critic, they can be benchmarked against other DRL algorithms, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Proximal Policy Optimization (PPO).

Advertisement

Acknowledgments

The authors acknowledge the use of artificial intelligence-based language tools, specifically ChatGPT (OpenAI), to assist with language polishing, grammar checking, and clarity improvement during the preparation of this chapter. The use of these tools was limited to language refinement only. All technical content, analysis, interpretations, and conclusions were entirely developed by the authors, who take full responsibility for the accuracy and integrity of the work.

References

  1. 1. Ekundayo OS, Ezugwu AE. Deep learning: Historical overview from inception to actualization, models, applications and future trends. Applied Soft Computing. 2025;181:113378. DOI: 10.1016/j.asoc.2025.113378
  2. 2. del Real Torres A, Andreiana DS, Roldán ÁO, Bustos AH, Galicia LEA. A review of deep reinforcement learning approaches for smart manufacturing in industry 4.0 and 5.0 framework. Applied Sciences. 2022;12(23):12
  3. 3. Zhang Y, Zhao W, Wang J, Yuan Y. Recent progress, challenges and future prospects of applied deep reinforcement learning: A practical perspective in path planning. Neurocomputing. 2024;608:128423. DOI: 10.1016/j.neucom.2024.128423
  4. 4. Zhu Y, Wan Hasan WZ, Harun Ramli HR, Norsahperi NMH, Mohd Kassim MS, Yao Y. Deep reinforcement learning of mobile robot navigation in dynamic environment: A review. Sensors. 2025;25:3394. DOI: 10.3390/s25113394
  5. 5. Sharma G, Jain S. Deep reinforcement learning-based framework for path planning of AUAVs. Procedia Computer Science. 2025;258:1112-1122. DOI: 10.1016/j.procs.2025.04.346
  6. 6. Bar NF, Karakose M. Collaborative approach for swarm robot systems based on distributed DRL. Engineering Science and Technology, an International Journal. 2024;53:101701. DOI: 10.1016/j.jestch.2024.101701
  7. 7. Tran DT et al. Central pattern generator based reflexive control of quadruped walking robots using a recurrent neural network. Robotics and Autonomous Systems. 2014;62(10):1497-1516. DOI: 10.1016/j.robot.2014.05.011
  8. 8. Bellegarda G, Nguyen C, Nguyen Q. Robust quadruped jumping via deep reinforcement learning. Robotics and Autonomous Systems. 2024;182:104799. DOI: 10.1016/j.robot.2024.104799
  9. 9. Baltes J, Christmann G, Saeedvand S. A deep reinforcement learning algorithm to control a two-wheeled scooter with a humanoid robot. Engineering Applications of Artificial Intelligence. 2023;126:106941. DOI: 10.1016/j.engappai.2023.106941
  10. 10. Xia W, Lu Y, Xu W, Xu X. Deep reinforcement learning based proactive dynamic obstacle avoidance for safe human-robot collaboration. Manufacturing Letters. 2024;41:1246-1256. DOI: 10.1016/j.mfglet.2024.09.151
  11. 11. Gök M. Dynamic path planning via dueling double deep Q-network (D3QN) with prioritized experience replay. Applied Soft Computing. 2024;158:111503. DOI: 10.1016/j.asoc.2024.111503
  12. 12. Deshpande SV, Harikrishnan R, Babul Salam KSMKI, Ponnuru MDS. Mobile robot path planning using deep deterministic policy gradient with differential gaming (DDPG-DG) exploration. Cognitive Robotics. 2024;4:156-173. DOI: 10.1016/j.cogr.2024.08.002
  13. 13. Chen X, Yin S, Li Y, Xiang Z. Dynamic path planning for multi-USV in complex ocean environments with limited perception via proximal policy optimization. Ocean Engineering. 2025;326:120907. DOI: 10.1016/j.oceaneng.2025.120907
  14. 14. Wen S, Shu Y, Rad A, Wen Z, Guo Z, Gong S. A deep residual reinforcement learning algorithm based on soft actor-critic for autonomous navigation. Expert Systems with Applications. 2025;259:125238. DOI: 10.1016/j.eswa.2024.125238
  15. 15. Loganathan A, Ahmad NS. A systematic review on recent advances in autonomous mobile robot navigation. Engineering Science and Technology, an International Journal. 2023;40:101343. DOI: 10.1016/j.jestch.2023.101343
  16. 16. Krishna Teja G, Mohanty PK, Das S. Review on path planning methods for mobile robot. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science. 2025;239(14):5547-5580. DOI: 10.1177/09544062251330083
  17. 17. Yeon ASA, Visvanathan R, Mamduh SM, Kamarudin K, Kamarudin LM, Zakaria A. Implementation of behaviour based robot with sense of smell and sight. Procedia Computer Science. 2015;76:119-125. DOI: 10.1016/j.procs.2015.12.300
  18. 18. Jiang S, Arkin RC. SLAM-based spatial memory for behavior-based robots. IFAC-PapersOnLine. 2015;48(19):195-202. DOI: 10.1016/j.ifacol.2015.12.033
  19. 19. Lazarowska A. Discrete artificial potential field approach to mobile robot path planning. IFAC-PapersOnLine. 2019;52(8):277-282. DOI: 10.1016/j.ifacol.2019.08.083
  20. 20. Sariff N, Buniyamin N. An overview of autonomous mobile robot path planning algorithms. In: SCOReD 2006 - Proceedings of 2006 4th Student Conference on Research and Development “Towards Enhancing Research Excellence in the Region”. USA: IEEE; 2006. DOI: 10.1109/SCORED.2006.4339335
  21. 21. Buniyamin N, Wan Ngah WAJ, Muhammad SNZ. A simple local path planning algorithm for autonomous mobile robots. International Journal of Systems Applications, Engineering & Development (ISAED). 2011;5(2):151-159
  22. 22. Sariff N, Buniyamin N. Evaluation of robot path planning algorithms in global static environments: Genetic algorithm vs ant colony optimization algorithm. International Journal of Electrical and Electronic Systems Research. 2010;3(1):1-12
  23. 23. Duchoň F et al. Path planning with modified a star algorithm for a mobile robot. Procedia Engineering. 2014;96:59-69. DOI: 10.1016/j.proeng.2014.12.098
  24. 24. Huang J, Chen C, Shen J, Liu G, Xu F. A self-adaptive neighborhood search A-star algorithm for mobile robots global path planning. Computers and Electrical Engineering. 2025;123:110018. DOI: 10.1016/j.compeleceng.2024.110018
  25. 25. Ahmad J, Nadhir Ab Wahab M. Enhancing the safety and smoothness of path planning through an integration of Dijkstra’s algorithm and piecewise cubic Bezier optimization. Expert Systems with Applications. 2025;289:128315. DOI: 10.1016/j.eswa.2025.128315
  26. 26. Yang H et al. Research on multi-objective point path planning for mobile inspection robot based on multi-informed-rapidly exploring random tree*. Engineering Applications of Artificial Intelligence. 2025;151:110645. DOI: 10.1016/j.engappai.2025.110645
  27. 27. Xu C, Zhu H, Zhu H, Wang J, Zhao Q. Improved RRT* algorithm for automatic charging robot obstacle avoidance path planning in complex environments. CMES - Computer Modeling in Engineering and Sciences. 2023;137(3):2567-2591. DOI: 10.32604/cmes.2023.029152
  28. 28. Ge L, Phang SW, Sariff N. DPF-Bi-RRT∗: An improved path planning algorithm for complex 3D environments with adaptive sampling and dual potential field strategy. IEEE Access. 2025;13:35958-35972
  29. 29. Sariff NB, Buniyamin N. Genetic algorithm versus ant colony optimization algorithm - Comparison of performances in robot path planning application. In: ICINCO 2010 - Proceedings of the 7th International Conference on Informatics in Control, Automation and Robotics. Setúbal, Portugal: SciTePress; 2010
  30. 30. Sariff NB, Buniyamin N. Comparative study of genetic algorithm and ant colony optimization algorithm performances for robot path planning in global static environments of different complexities. In: Proceedings of IEEE International Symposium on Computational Intelligence in Robotics and Automation, CIRA. USA: IEEE; 2009. DOI: 10.1109/CIRA.2009.5423220
  31. 31. Buniyamin N, Sariff N, Wan NWAJ, Mohamad Z. Robot global path planning overview and a variation of ant colony system algorithm. International Journal of Mathematics and Computers in Simulation. 2011;5(1):9-15
  32. 32. Sariff NB, Buniyamin N. Ant Colony system for robot path planning in global static environment. In: International Conference on System Science and Simulation in Engineering – Proceedings. USA: World Scientific and Engineering Academy and Society; 2010
  33. 33. Cui Y, Hu W, Rahmani A. Multi-robot path planning using learning-based artificial bee Colony algorithm. Engineering Applications of Artificial Intelligence. 2024;129:107579. DOI: 10.1016/j.engappai.2023.107579
  34. 34. Mohammad SHA, Jeffril MA, Sariff N. Mobile robot obstacle avoidance by using fuzzy logic technique. In: Proceedings - 2013 IEEE 3rd International Conference on System Engineering and Technology, ICSET 2013. USA: IEEE; 2013. DOI: 10.1109/ICSEngT.2013.6650194
  35. 35. Sariff N, Xing BTS. A wheeled mobile robot obstacles avoidance for navigation control in a static and dynamic environments. Journal of Physics Conference Series. 2023;2523(1):012028. DOI: 10.1088/1742-6596/2523/1/012028
  36. 36. Adam YM, Sariff NB, Algeelani NA. E-puck mobile robot obstacles avoidance controller using the fuzzy logic approach. In: 2021 2nd International Conference on Smart Computing and Electronic Enterprise (ICSCEE). USA: IEEE; 2021. pp. 107-112. DOI: 10.1109/ICSCEE50312.2021.9497939
  37. 37. Jeffril MA, Sariff N. The integration of fuzzy logic and artificial neural network methods for mobile robot obstacle avoidance in a static environment. In: Proceedings - 2013 IEEE 3rd International Conference on System Engineering and Technology, ICSET 2013. USA: IEEE; 2013. DOI: 10.1109/ICSEngT.2013.6650193
  38. 38. Sariff NB, Wahab NHNBA. Automatic mobile robot obstacles avoidance in a static environment by using a hybrid technique based on fuzzy logic and artificial neural network. In: Proceedings - 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, ICAIET 2014. USA: IEEE; 2014. DOI: 10.1109/ICAIET.2014.31
  39. 39. Prados C, Hernando M, Gambao E. Path and footfall planning for N-legged and climbing robots — A model predictive control approach. Robotics and Autonomous Systems. 2025;194:105119. DOI: 10.1016/j.robot.2025.105119
  40. 40. Si G, Zhang R, Jin X. Path planning of factory handling robot integrating fuzzy logic-PID control technology. Systems and Soft Computing. 2025;7:200188. DOI: 10.1016/j.sasc.2025.200188
  41. 41. Wang Y, Fu C, Huang R, Tong K, He Y, Xu L. Path planning for mobile robots in greenhouse orchards based on improved A* and fuzzy DWA algorithms. Computers and Electronics in Agriculture. 2024;227:109598. DOI: 10.1016/j.compag.2024.109598
  42. 42. Lin S, Liu A, Wang J, Kong X. An intelligence-based hybrid PSO-SA for mobile robot path planning in warehouse. Journal of Computer Science. 2023;67:101938. DOI: 10.1016/j.jocs.2022.101938
  43. 43. Hu L, Wei C, Yin L. Fuzzy A* quantum multi-stage Q-learning artificial potential field for path planning of mobile robots. Engineering Applications of Artificial Intelligence. 2025;141:109866. DOI: 10.1016/j.engappai.2024.109866
  44. 44. Li S, Li X, Wang P. Application of robot autonomous navigation based on reinforcement learning. Procedia Computer Science. 2025;262:1402-1409. DOI: 10.1016/j.procs.2025.05.188
  45. 45. Lv X, Xiong W. Robot autonomous navigation path planning based on deep reinforcement learning. Procedia Computer Science. 2025;262:1130-1136. DOI: 10.1016/j.procs.2025.05.151
  46. 46. Li C-Y, Zhang F, Chen L. Robot-assisted pedestrian evacuation in fire scenarios based on deep reinforcement learning. Chinese Journal of Physics. 2024;92:494-531. DOI: 10.1016/j.cjph.2024.09.008
  47. 47. Huo L, Mao J, San H, Li R, Zhang S. Deep reinforcement learning of group consciousness for multi-robot pathfinding. Engineering Applications of Artificial Intelligence. 2025;155:110978. DOI: 10.1016/j.engappai.2025.110978
  48. 48. Tao B, Kim J-H. Deep reinforcement learning-based local path planning in dynamic environments for mobile robot. Journal of King Saud University - Computer and Information Sciences. 2024;36(10):102254. DOI: 10.1016/j.jksuci.2024.102254
  49. 49. Li Y et al. Peduncle collision-free grasping based on deep reinforcement learning for tomato harvesting robot. Computers and Electronics in Agriculture. 2024;216:108488. DOI: 10.1016/j.compag.2023.108488
  50. 50. Huang S, Xiao Z, Zheng M, Shi W. Hierarchical reinforcement learning for enhancing stability and adaptability of hexapod robots in complex terrains. Biomimetic Intelligence and Robotics. 2025;5:100231. DOI: 10.1016/j.birob.2025.100231
  51. 51. Zhao H, Guo Y, Liu Y, Jin J. Multirobot unknown environment exploration and obstacle avoidance based on a Voronoi diagram and reinforcement learning. Expert Systems with Applications. 2025;264:125900. DOI: 10.1016/j.eswa.2024.125900
  52. 52. Guo T, Jiang N, Li B, Zhu X, Wang Y, Du W. UAV navigation in high dynamic environments: A deep reinforcement learning approach. Chinese Journal of Aeronautics. 2021;34(2):479-489. DOI: 10.1016/j.cja.2020.05.011
  53. 53. Cheng C-L, Hsu C-C, Saeedvand S, Jo J-H. Multi-objective crowd-aware robot navigation system using deep reinforcement learning. Applied Soft Computing. 2024;151:111154. DOI: 10.1016/j.asoc.2023.111154
  54. 54. Zhu K, Zhang T. Deep reinforcement learning based mobile robot navigation: A review. Tsinghua Science and Technology. 2021;26(5):674-691. DOI: 10.26599/TST.2021.9010012
  55. 55. Yin Y, Chen Z, Liu G, Yin J, Guo J. Autonomous navigation of mobile robots in unknown environments using off-policy reinforcement learning with curriculum learning. Expert Systems with Applications. 2024;247:123202. DOI: 10.1016/j.eswa.2024.123202
  56. 56. Luo Y et al. Relay hindsight experience replay: Self-guided continual reinforcement learning for sequential object manipulation tasks with sparse rewards. Neurocomputing. 2023;557:126620. DOI: 10.1016/j.neucom.2023.126620
  57. 57. Zhao T, Wang M, Zhao Q, Zheng X, Gao H. A path-planning method based on improved soft actor-critic algorithm for mobile robots. Biomimetics. 2023;8:481
  58. 58. Xiao W, Yuan L, Ran T, He L, Zhang J, Cui J. Multimodal fusion for autonomous navigation via deep reinforcement learning with sparse rewards and hindsight experience replay. Displays. 2023;78:102440. DOI: 10.1016/j.displa.2023.102440
  59. 59. Khan MR, Mohd Ibrahim A, Al Mahmud S, Samat FA, Jasni F, Mardzuki MI. Advancing mobile robot navigation with DRL and heuristic rewards: A comprehensive review. Neurocomputing. 2025;652:131036. DOI: 10.1016/j.neucom.2025.131036
  60. 60. Guo Q, Zhao W, Lyu Z, Zhao T. A GAN enhanced meta-deep reinforcement learning approach for DCN routing optimization. Information Fusion. 2025;121:103160. DOI: 10.1016/j.inffus.2025.103160
  61. 61. Kharitonov A, Abani JI, Nahhas A, Turowski K. Literature survey on combining machine learning and metaheuristics for decision-making. Procedia Computer Science. 2025;253:199-208. DOI: 10.1016/j.procs.2025.01.083
  62. 62. Ko G-J, Huh J. Metaheuristic-based weight optimization for robust deep reinforcement learning in continuous control. Swarm and Evolutionary Computation. 2025;95:101920. DOI: 10.1016/j.swevo.2025.101920
  63. 63. Wen T, Wang X, Zheng Z, Sun Z. A DRL-based path planning method for wheeled mobile robots in unknown environments. Computers and Electrical Engineering. 2024;118:109425. DOI: 10.1016/j.compeleceng.2024.109425
  64. 64. Liu X, Wen S, Hu Y, Han F, Zhang H, Karimi HR. An active SLAM with multi-sensor fusion for snake robots based on deep reinforcement learning. Mechatronics. 2024;103:103248. DOI: 10.1016/j.mechatronics.2024.103248
  65. 65. Wang R, Zhang J, Lyu M, Yan C, Chen Y. An improved frontier-based robot exploration strategy combined with deep reinforcement learning. Robotics and Autonomous Systems. 2024;181:104783. DOI: 10.1016/j.robot.2024.104783
  66. 66. Blais M-A, Akhloufi MA. Reinforcement learning for swarm robotics: An overview of applications, algorithms and simulators. Cognitive Robotics. 2023;3:226-256. DOI: 10.1016/j.cogr.2023.07.004
  67. 67. Ali ZA, Israr A, Hasan R. Survey of methods applied in cooperative motion planning of multiple robots. In: Ali ZA, Israr A, editors. Motion Planning for Dynamic Agents. Rijeka: IntechOpen; 2023. DOI: 10.5772/intechopen.1002428
  68. 68. Zhang J, Jia Q, Zhang S, Chen G. Dynamic and prioritized task scheduling of heterogeneous multi-robot systems using deep reinforcement learning. Neurocomputing. 2025;638:130184. DOI: 10.1016/j.neucom.2025.130184
  69. 69. Sariff NB, Ismail ZH. A survey and analysis of cooperative multi-agent robot systems: Challenges and directions. In: Gorrostieta Hurtado E, editor. Applications of Mobile Robots. Rijeka: IntechOpen; 2018. DOI: 10.5772/intechopen.79337
  70. 70. Sariff NB, Ismail ZH, Sooriamoorthy D, Syed Mahadzir PNAF, Md Yasir ASH. Multi-agent robot motion planning for rendezvous applications in a mixed environment with a broadcast event-triggered consensus controller. In: Ali ZA, Israr A, editors. Motion Planning for Dynamic Agents. Rijeka: IntechOpen; 2023. DOI: 10.5772/intechopen.1002494
  71. 71. Wu J, Ye Y, Du J. Multi-objective reinforcement learning for autonomous drone navigation in urban areas with wind zones. Automation in Construction. 2024;158:105253. DOI: 10.1016/j.autcon.2023.105253
  72. 72. Jang Y, Baek J, Jeon S, Han S. Bridging the simulation-to-real gap of depth images for deep reinforcement learning. Expert Systems with Applications. 2024;253:124310. DOI: 10.1016/j.eswa.2024.124310
  73. 73. Farley A, Wang J, Marshall JA. How to pick a mobile robot simulator: A quantitative comparison of CoppeliaSim, Gazebo, MORSE and Webots with a focus on accuracy of motion. Simulation Modelling Practice and Theory. 2022;120:102629. DOI: 10.1016/j.simpat.2022.102629
  74. 74. Hu Y, Fu J, Wen G, Lv Y, Ren W. Distributed entropy-regularized multi-agent reinforcement learning with policy consensus. Automatica. 2024;164:111652. DOI: 10.1016/j.automatica.2024.111652
  75. 75. Kim J, Hwang H-S, Seok J. Adversarial environment design for crowd navigation based on deep reinforcement learning. Engineering Applications of Artificial Intelligence. 2025;159:111621. DOI: 10.1016/j.engappai.2025.111621
  76. 76. Mishra N, Yamaguchi T, Okuda H, Suzuki T. Combination of reinforcement learning models towards considerate motion planning for multiple pedestrians. IFAC-PapersOnLine. 2023;56(2):3616-3621. DOI: 10.1016/j.ifacol.2023.10.1523
  77. 77. Wen S, Wen Z, Zhang D, Zhang H, Wang T. A multi-robot path-planning algorithm for autonomous navigation using meta-reinforcement learning based on transfer learning. Applied Soft Computing. 2021;110:107605. DOI: 10.1016/j.asoc.2021.107605
  78. 78. Joshi Kumar V, Elumalai VK. A proximal policy optimization based deep reinforcement learning framework for tracking control of a flexible robotic manipulator. Results in Engineering. 2025;25:104178. DOI: 10.1016/j.rineng.2025.104178
  79. 79. Yang L, Bi J, Yuan H. Dynamic path planning for mobile robots with deep reinforcement learning. IFAC-PapersOnLine. 2022;55(11):19-24. DOI: 10.1016/j.ifacol.2022.08.042
  80. 80. Lei Y, Liu J, Ke Y. Multi-disaster emergency response decision support based on reinforcement learning algorithm. Procedia Computer Science. 2025;261:887-895. DOI: 10.1016/j.procs.2025.04.418
  81. 81. Khaldi B, Harrou F, Sun Y. Collaborative swarm robotics for sustainable environment monitoring and exploration: Emerging trends and research progress. Energy Nexus. 2025;17:100365. DOI: 10.1016/j.nexus.2025.100365
  82. 82. Martinez-Baselga D, Riazuelo L, Montano L. RUMOR: Reinforcement learning for understanding a model of the real world for navigation in dynamic environments. Robotics and Autonomous Systems. 2025;191:105020. DOI: 10.1016/j.robot.2025.105020
  83. 83. Gao Q, Li S, Ji Y, Liu J, Song Y. Scalable path planning algorithm for multi-unmanned surface vehicles based on multi-agent deep deterministic policy gradient. Ocean Engineering. 2025;320:120243. DOI: 10.1016/j.oceaneng.2024.120243
  84. 84. Hersi AH, Divya Udayan J. Efficient and robust multirobot navigation and task allocation using soft actor critic. Procedia Computer Science. 2024;235:484-495. DOI: 10.1016/j.procs.2024.04.048
  85. 85. Dong D, Wang Z, Guan J, Xiao Y, Wang Y. Research on key technology and application progress of rescue robot in nuclear accident emergency situation. Nuclear Engineering and Technology. 2025;57(6):103457. DOI: 10.1016/j.net.2025.103457
  86. 86. Li B, Wang Z. Two-stage DRL with hybrid perception of vision and force feedback for lunar construction robotic assembly control. Acta Astronautica. 2025;229:357-373. DOI: 10.1016/j.actaastro.2025.01.017
  87. 87. Shah R, Doss ASA, Lakshmaiya N. Advancements in AI-enhanced collaborative robotics: Towards safer, smarter, and human-centric industrial automation. Results in Engineering. 2025;27:105704. DOI: 10.1016/j.rineng.2025.105704
  88. 88. Liu C, Zhang Z, Tang D, Nie Q, Zhang L, Song J. A mixed perception-based human-robot collaborative maintenance approach driven by augmented reality and online deep reinforcement learning. Robotics and Computer-Integrated Manufacturing. 2023;83:102568. DOI: 10.1016/j.rcim.2023.102568
  89. 89. Waseem S, Adnan M, Iqbal MS, Amin AA, Shah A, Tariq M. From classical to intelligent control: Evolving trends in robotic manipulator technology. Computers and Electrical Engineering. 2025;127:110559. DOI: 10.1016/j.compeleceng.2025.110559
  90. 90. Soori M, Arezoo B, Dastres R. Artificial intelligence, machine learning and deep learning in advanced robotics, a review. Cognitive Robotics. 2023;3:54-70. DOI: 10.1016/j.cogr.2023.04.001
  91. 91. Maldonado-Ramirez A, Rios-Cabrera R, Lopez-Juarez I. A visual path-following learning approach for industrial robots using DRL. Robotics and Computer-Integrated Manufacturing. 2021;71:102130. DOI: 10.1016/j.rcim.2021.102130
  92. 92. Zheng P, Li S, Fan J, Li C, Wang L. A collaborative intelligence-based approach for handling human-robot collaboration uncertainties. CIRP Annals. 2023;72(1):1-4. DOI: 10.1016/j.cirp.2023.04.057
  93. 93. Konishi M, Sasaki T, Cai K. Efficient safe control via deep reinforcement learning and supervisory control – Case study on multi-robot warehouse automation. IFAC-PapersOnLine. 2022;55(28):16-21. DOI: 10.1016/j.ifacol.2022.10.318
  94. 94. Hosseini M, Chalil Madathil S, Khasawneh MT. Reinforcement learning-based simulation optimization for an integrated manufacturing-warehouse system: A two-stage approach. Expert Systems with Applications. 2025;290:128259. DOI: 10.1016/j.eswa.2025.128259

Written By

Nohaidda Sariff, Yahya Muhammad Adam, Intan Izafina Idrus, Zool Hilmi Ismail, Puteri Nor Aznie Fahsyar, Swee King Phang, Kok Seng Eu, Md Hasan Molla and Denesh Sooriamoorthy

Submitted: 08 August 2025 Reviewed: 19 January 2026 Published: 09 April 2026