System Architecture for Training and Evaluating the RL-Based Chaser Drone Policy with Gazebo, ROS, YOLOv5, DDPG, and Ardupilot
System Architecture for Training and Evaluating the RL-Based Chaser Drone Policy with Gazebo, ROS, YOLOv5, DDPG, and Ardupilot
Abstract
Unmanned aerial vehicles (UAVs) are fast becoming a low-cost, affordable tool for various security and surveillance tasks. It has led to the use of UAVs (drones) for unlawful activities such as spying or infringing on restricted or private air spaces. This rogue use of drone technology makes it challenging for security agencies to maintain the safety of many critical infrastructures. Additionally, because of the drones’ varied low-cost design and agility, it has become challenging to identify and track them using conventional radar systems. This paper proposes a deep reinforcement learning-based approach for identifying and tracking an intruder drone using a chaser drone. Our proposed solution employs computer vision techniques interleaved with a deep reinforcement learning control for tracking the intruder drone within the chaser’s field of view. The complete end-to-end system has been implemented using robot operating system and Gazebo, with an Ardupilot-based flight controller for flight stabilization and maneuverability. The proposed approach has been evaluated on multiple dynamic scenarios of intruders’ trajectories and compared with a proportional-integral-derivative-based controller. The results show that the deep reinforcement learning policy achieves a tracking accuracy of 85%. The intruder localization module is able to localize drones in 98.5% of the frames. Furthermore, the learned policy can track the intruder even when there is a change in the speed or orientation of the intruder drone.
1 Introduction
Unmanned aerial vehicles (UAVs), commonly known as drones, have quickly become a tool for carrying out remote surveillance missions. They are attributed to having almost undetectable radar signatures and can perform various controlled maneuvers that are both complex and very unpredictable. These maneuvers can include rapid accelerations, sharp turns, and sudden changes in altitude, making drones highly agile and difficult to track or intercept using traditional methods like patrol officers, radars, or passive surveillance systems like CCTV. These capabilities of drones are further enhanced by their ability to be programmed for complex waypoint-based missions and autonomous operations without direct human intervention. In addition, the ability of drones to carry out operations autonomously complicates detection strategies because they can make real-time decisions based on the data they collect, alter their flight paths, and come up with new tactics to evade detection and countermeasures. This presents unique challenges for legacy defense systems that rely on predictable flight dynamics, controlled waypoint execution, and human-in-the-loop oversight to detect and track an intruder in the airspace. In recent years, many drone applications have emerged, from the delivery of medical supplies in remote locations to security applications such as border patrol and surveillance [1]. In addition, drones have proven their utility and effectiveness in search and rescue operations during natural disasters [2], highlighting their cost effectiveness and adaptability as an emerging technology.
Due to the diverse capabilities inherent in drones, they are increasingly being used for illicit activities, such as unauthorized surveillance [3], intrusion into secure spaces, and covert transport of contraband and weapons. Current research addressing the application of pursuit-evasion techniques for the timely detection and mitigation of intrusions caused by these drones is limited [4]. The complexity of the issue is increased because intruder drones often evade traditional radar systems due to their varied miniature designs, configurations, and ability to fly at high-speeds at low altitudes with minimal acoustic emissions. This requires reviewing traditional defense systems and developing new techniques and strategies capable of countering the unique challenges posed by drones.
In Ref. [5], we investigated the scenario of chasing intruder drones using a monocular camera, the deep reinforcement learning (RL) model, and the onboard computation unit to ensure security and timely response to threats imposed by drones to a secured area. Extending our work, this paper deals with comprehensive reward formulation, scaling up the simulation for new chasing strategies, implementing novel training environments, evaluating policy in dynamic environments similar to real-world environments, and comparisons with a proportional-integral-derivative (PID)-based controller. In highly dynamic pursuit-evasion environments characterized by rapidly evolving intruder positions, conventional drone control methods prove insufficient. Hence, a learning-based framework presents a promising approach for controlling chaser drones and intercepting unpredictable target drones. Specifically, RL offers an adaptive technique to derive optimal policies in non-stationary and dynamic environments. Unlike hard-coded logical programs or simple control flow scripts, RL systems enable agents to optimize behavior through iterative interactions with a simulated environment. This requires real-time sensing and trajectory modeling to estimate intruder motions continuously. To achieve the task of tracking and following an intruder drone, our approach involves rapid processing of incoming visuals of the chaser drone, continuous estimation of subsequent movement actions, a high-level controller for the drone’s movements, and extensive testing to fine-tune the learned control policy further. The extended contributions of this work are as follows:
The reward function optimizes the chaser drone’s alignment and speed during pursuit.
A penalty is added when the chaser gets too close to the intruder to ensure safe dynamics.
Multiple dynamic trajectories are used to train a robust chaser drone policy.
Extensive experiments are conducted in varied environmental settings to evaluate the proposed model, along with comparisons with a PID-based controller on the same task.
2 Related Work
Various technological approaches have been proposed to track an intruder drone using the camera feed from a flying UAV. References [6] and [7] have explored deep learning for detecting and tracking cooperative and non-cooperative UAVs using visual cameras. In Ref. [8], authors have developed an autonomous long-range drone detection system with a high accuracy of 95.5% at 250 m. In Ref. [9], the authors present a novel approach utilizing a stereo camera system to detect, track, and intercept a faster UAV by reconstructing the intruder’s trajectory. This technique is suitable for medium-range detection of drones and can be used to identify the trajectory the intruder follows. Furthermore, Ref. [10] proposed an approach to detect flying objects such as UAVs and aircraft when they occupy a small portion of the field of view (FOV), possibly moving against complex backgrounds and are filmed with a moving camera. In the context of tracking moving targets with UAV-borne radars, Ref. [11] presented improved Kalman filter variants for UAV tracking with radar motion models, which could provide insight into integrating radar-based tracking with visual feed for enhanced intruder drone chase. In Ref. [12], authors have proposed using a grid to partition the environment into 25 distinct states and assign Q values. Although this approach may be viable in basic, indoor, and familiar settings, its effectiveness is limited when applied to real-world scenarios. The study specifically focuses on a relatively small state space (), which is inadequate for most practical applications. In Ref. [13], authors have presented a solution for tracking drones using an action–decision network approach. It helps determine the optimal placement of the boundary box for subsequent steps. Although the method outlined in the paper accounts for the likely location of the chaser in the FOV and the direction of the next search, it does not address the need for real-time decisions to continuously pursue the intruder. In Ref. [14], the authors have proposed a system capable of identifying objects in the sky. It can distinguish between birds, clouds, and drones. It can detect false positives and provide accurate detection of drones. In Ref. [15], the author has published a data set of 500 video pairs, along with around 580k manually annotated boundary boxes. This data set is used to benchmark various drone detection and tracking methods. The paper uses a dual-flow semantic consistency method for drone tracking. In Ref. [16], authors have proposed a method for real-time agricultural surveillance using drones to detect, classify, and track objects while comparing various models such as Yolov7, SSD, Mask R-CNN, and Faster R-CNN for object detection. In Ref. [17], the authors introduced a novel framework to improve drone navigation. This system dynamically adjusts task execution locations, input resolution, and image compression ratios to achieve low inference latency, high prediction accuracy, and extended flight distances. As previously discussed, while there are various techniques for detecting and localizing drones through visual data, there remains a notable gap in the research surrounding the continuous intercepting and responsive counter-attack for intruder drones via reinforcement learning-based controls. Our proposed approach addresses this issue and establishes a reliable solution.
3 System Description
In Eq. (1), represents central coordinates of ’s FOV and represents coordinates of in ’s FOV. and represent the width and height of ’s FOV in pixel units. The size of in ’s FOV defined as provides localization and noisy depth estimation of and , denoted by as represented in Eq. (2). The distance between and is shown in Fig. 1.
4 Proposed Methodology
This work proposes an RL-based framework for autonomous tracking and chasing an intruder drone by a chaser drone. An end-to-end pipeline is designed for this work, which can capture images from the chaser drone’s via robot operating system (ROS), run a computer vision framework to detect from ’s FOV using YOLOv5,2 feed required state information into RL framework and translate the output into appropriate high-level control signals for which are then fed into quadcopter running Ardupilot. YOLOv5 outputs a boundary box of , which is then preprocessed and fed into an RL-based framework. The source code for the implementation and the annotated data set used to train YOLOv5 is available publicly.3 The next section defines a computer vision-based localization module, as used in the framework.
4.1 Intruder Localization.
The chase drone,, captures raw frames subscribed using RosTopic . For detecting in , the you only look once (YOLOv5) object detection framework is employed. It takes a raw image as input, preprocesss it, and provides boundary box coordinates of the detected intruder represented by . The localization module is trained by manually annotating 8000 images from real-world images and frames captured from the Gazebo simulation. During the creation of the dataset, various orientations and heights of the chaser drone are used along with multiple weather conditions and backgrounds. The dataset is expanded to 24,000 images using various transformation techniques, such as rotation, flipping, occlusion, and cropping. The resultant model can localize drones in 98.5% of the frames. The localization module can also be seamlessly integrated into real-world drone tracking applications. The localization module can be swapped with other frameworks, such as YOLOv4, SSDs, or R-CNN, without significant changes to the overall system architecture. In Ref. [18], the authors have compared YOLOv5 with YOLOv4 and YOLOv3 to detect landing zones for UAVs. The results showed improvements in accuracy, precision, and recall when using YOLOv5.
4.2 Markov Decision Process.
Typically, an RL problem is framed as a Markov decision process (MDP), with the agent having full access to the environment’s state. However, in this problem, the environment is partially observable; therefore, a limited amount of historical information is encoded in the state to make the system Markovian. An MDP is a tuple , where denotes the set of environment states, refers to the agent’s permissible actions, represents the environment’s transition probabilities, and is the reward set for the agent’s actions. However, in this particular problem, the environment’s transition probabilities are not known. Consequently, drones must learn a policy by repeatedly interacting with and sampling the environment.
States: In the framework, has access to the images captured by the mounted camera and its velocity. The camera image is communicated using ROS to the intruder module, which is processed using YOLOv5. This network localizes in and returns the boundary box coordinates. Then, these pixel coordinates are used as the state space for . The overall observation space of is a total of eight components represented as , where are the velocities of along the axes and represents the current orientation of . Furthermore, we kept five such previous observation tuples to form a single state of the environment at any point of time, represented by
Actions: The action space of the chaser drone is defined as , where the first three values represent velocities in forward, lateral, and vertical directions. represents the yaw of the drone. All the parameters of the action tuple are continuous. The range of each component is clipped between the range of .
4.3 Learning the Tracking Policy.
As we are dealing with a continuous action space, , rather than discrete, policy gradient methods fit better in this setting, particularly deep deterministic policy gradients (DDPG) [19]. DDPG consists of two deep neural networks, the actor and critic network, as shown in Fig. 2. The actor’s policy network denoted as is a function of state space that gives an action as output given a state; these actions are executed by the in the environment, while the critic network is used to evaluate the viability of the actions generated by the actor. DDPG is a suitable framework for our objective of controlling continuously according to the visual feed and generating continuous actions in the directions, as well as controlling the orientation of . There is a replay buffer , which collects and stores previous samples from the environment in the form of , where is the current state of the drone, is the action taken in , is the reward observed after taking in , and is the next state that the drone landed up in. The replay buffer addresses the issues of sample inefficiency and makes updates more productive.

The proposed DDPG-based model consists of actor, critic, and target networks for learning the control policy for the chaser drone
Detailed instructions and the environment setup scripts are available publicly.4
5 Experimental Setup
To train and assess the proposed approach, we implemented the system for chaser and intruder drones using Gazebo and ROS. Gazebo, a 3D simulator with a robust physics engine, allows realistic simulation of various scenarios and interactions, incorporating sensors such as cameras, LiDARs, and GPS. ROS, a widely used open-source middleware for robotic functionalities, adopts a subscriber–publisher model with libraries and tools that facilitate communication among different modules within a robotic system. It promotes the development of reusable code in a standardized API format, enhancing the construction, modification, and interaction with robots.
Ardupilot, an open-source flight controller, was employed to control the drone. Its functionalities include GPS navigation, waypoint movement, return to launch, hovering, and an inertial measurement unit model. In our implementation, Gazebo serves as the 3D simulation platform, ROS facilitates communication between the chaser drone and the Gazebo environment, and Ardupilot guides the flight maneuvers of the chaser drone based on learned control from the DDPG model. Figure 3 shows the overall system architecture. ROS is central to the implementation, providing middleware support through its publisher–subscriber framework. Various ROS topics, as shown in Fig. 3, perform specific functions such as drone image capture, drone detection, training, and translation of actions for the drone.

Comprehensive system architecture for training and evaluating the RL-based chaser drone policy using Gazebo, ROS, YOLOv5, DDPG, and Ardupilot
5.1 Training Simulation.
The comprehensive training of the DDPG model is conducted on a DELL Server equipped with an Intel Xeon Processor, NVIDIA Quadro RTX A4000 8 GB graphics card, and 64 GB RAM. The Gazebo simulation runs on an ASUS system that features an AMD 5800H processor, 16 GB RAM, and an NVIDIA RTX 3060 6 GB graphics card. Communication between DDPG training and Gazebo simulation utilized python API calls through the Flask Framework. This distributed implementation streamlines the processes and establishes a controller–responder architecture, facilitating scalability to more clusters when necessary. The intruder localization module is tested under various environmental conditions. The module was tested in broad daylight, night, fog, and mist conditions. The module had difficulty detecting intruders in foggy conditions but performed well in the other three conditions. The module is further tested in various environments, including rural versus urban environments and hilly areas. To deploy the solution in the real world, a UAV with an onboard camera unit and a computation unit is required. The computing unit is responsible for intruder localization and policy execution. The updated velocity changes are sent to Ardupilot Firmware, which in turn is responsible for the movement of the UAV. Sim2Real Transfer is a challenging part, but by designing the state space to be independent of environmental variations, fine-tuning can be minimized. Including a velocity clipping function, which clips the model output in the range of m/s, ensures no abrupt change in the movement of UAV may arise from real-world testing. Including safety in real-world deployment is part of our other ongoing research. The specific hyperparameter values are detailed in Table 1, which includes fine-tuning parameters for the deep RL framework and the upper and lower limits for the reward structure.
Hyper-parameters for DDPG model training
Hyper-parameter | Value |
---|---|
Discount factor () | 0.99 |
Mini-batch size () | 128 |
Actor learning rate ( ) | 0.001 |
Critic learning rate ( ) | 0.001 |
Replay buffer size () | 100,000 |
Target update parameter () | 0.001 |
range () | |
range () | |
range () | ( |
Time-step for penalty () | 50 |
Reward function () |
Hyper-parameter | Value |
---|---|
Discount factor () | 0.99 |
Mini-batch size () | 128 |
Actor learning rate ( ) | 0.001 |
Critic learning rate ( ) | 0.001 |
Replay buffer size () | 100,000 |
Target update parameter () | 0.001 |
range () | |
range () | |
range () | ( |
Time-step for penalty () | 50 |
Reward function () |
5.2 Deep Network Architectures.
DDPG utilizes two principal neural network architectures: the actor network and the critic network. The actor network directly maps states to actions and outputs the best-learned action for any given state, aiming to maximize the policy’s performance. The critic network evaluates the action output by the actor by computing the value function, which estimates the quality of the action taken from a particular state. Both networks update their weights to better predict and evaluate actions. The key hyperparameters in DDPG include separate learning rates for the actor and critic networks, which determine how quickly the networks adjust during training. The discount factor , usually set between 0.9 and 0.99, balances immediate and future rewards. The size of the replay buffer influences the range of experiences for learning, while the batch size dictates the number of experiences sampled for network updates. The value, usually around 0.001, controls the rate at which the target networks are updated. Finally, the noise processes, defined by mean, , and parameters, govern the exploration behavior using the OU process.
5.3 Testing Environments.
In this section, we discuss the various complex environments ranging from low-speed straight-line trajectories to very high-speed circular trajectories of intruder drones that are to be stress tested and evaluate the proposed chaser drone model. During the training phase, multiple trajectories of must be included to make the training more robust and dynamic. If random starting locations of are not used along with random trajectories that the is taking, it may cause the framework to learn a suboptimal policy, which does not capture typical evading trajectories that the may take during real-world deployment.
Straight Path: In this scenario, the intruder drone navigates primarily along a straight path with slight turns in between. The speed of the intruder is fixed at 5 m/s, and is equipped with a mechanism for dynamic evasion tactics whenever the approaches close to .
Zig-Zag Path: In this scenario, the intruder moves with a velocity of 5 m/s in a zig-zag path. This path introduces rapid changes in direction, challenging the chaser drone to adapt quickly to the unpredictable movements of the Intruder.
Circular Path: In this scenario, the intruder follows a smooth, continuous circular path, which introduces a new challenge in chasing strategy. Here, the trajectory is described by significant curvature in a smooth manner, unlike scenarios with abrupt turns or zig-zag patterns.
Sinusoidal Trajectory: In this scenario, the intruder follows a trajectory similar to a sine wave whose amplitude and frequency are given by . It oscillates three times a minute and has an amplitude of 50 m.
Random Trajectory: In this scenario, the intruder attempts a highly unpredictable trajectory with a lot of turns and sudden speed changes, which introduces a high uncertainty in movements. This scenario is designed to simulate situations where the intruder may employ very erratic and unpredictable movements to evade.
High-speed: In this scenario, the chaser policy is evaluated at a steady speed of 10 m/s, testing for cases where high-speed can be used to evade or cover large areas rapidly. A chaser’s ability to adapt to this increased speed is crucial to maintaining an effective pursuit.
Varying Speed: In this section, the intruder’s speed is varied pseudorandomly in the range m/s so that it can drift apart from . This scenario demands extensive use of the component of total reward to track the intruder properly.
Occlusions: In this scenario, while the intruder is being followed, it gets occluded behind some buildings and is not visible in the FOV of chaser. This situation poses a new challenge where the chaser needs to extensively use the previously available information to estimate the approximate trajectory of the intruder and use it to continuously track it until the intruder is again visible.
5.4 Performance Metrics.
During training, we tracked the progress of our DDPG model using some metrics, although it is challenging to accurately evaluate the output policy. We describe some of the metrics that helped us keep track of the training process:
Total reward: The sum of rewards over time reflects policy performance. Alone, it does not guarantee the robustness of policy.
Critic loss: This metric indicates information on convergence during training.
Absolute value error: Discrepancy between the actual return and the predicted Q-value. It assesses the agent’s understanding of the environment.
Average mean trajectory error: Measures the alignment between chaser and intruder UAVs during evaluation.
Episode length: Duration of chaser tracking intruder. It reflects policy improvement.
Average mean chase distance error: Measures proximity without collision during the chase for successful pursuit.
Next, we present the results gathered during the training and further evaluation of the trained policy of the chaser drone on various test scenarios.
6 Results
In this section, we present the results of the performance of our proposed approach on various metrics as described in Sec. 5.4. During the performance evaluation of the chaser drone, the policy is not updated and only trained weights are used to generate control actions () for the chaser drone. Multiple evaluation episodes are executed with random starting locations of and to evaluate the effectiveness of the learned control policy.
6.1 Policy Improvement During Training.
The training is performed in Gazebo and ROS, where the DDPG-based model trains the chaser control policy. The simulation runs for 5000 episodes until convergence in total episodic return. Figure 4(a) represents the total reward collected in every episode during training. From the plot, it can be seen that as more and more episodes are finished, the total reward shows improvement, which correlates with episode length. During the training phase, there is a provision for exploration to find alternative ways for tracking . A persistent exploration parameter, introduced by the OU noise, ensures that continues to experiment with various strategies for tracking . The gradual enhancement with higher reward values in the reward graph indicates that the chaser drones have acquired the ability to effectively track , leading to a higher accumulation of rewards in later episodes.
Figure 4(b) depicts the critic’s loss from the DDPG model. This graph essentially shows the performance of the critic model that learns the value function of an action. As noted, as the number of episodes increases, the critic loss decreases and reaches a very low value. This shows that the critic model can better estimate the actions; hence, the DDPG can learn a good control policy for the chaser drone.
Regarding the influence of and on overall reward , initial observations indicate that, in the early stages, the contribution of to total reward is minimal. However, as the episodes unfold, there is a continuous improvement in as becomes more adept at following , thereby leading to an overall increase in . Initially, there are instances where prioritizes the maximization of at the expense of . However, with the progression of episodes, there is a notable improvement in , accompanied by a slight reduction in . The total reward plot indicates that aligns properly with . In later episodes, the focus shifts to minimizing the distance between and for consistent tracking. This underscores the effectiveness of the proposed approach in learning a control policy for the pursuit of an intruder drone. The DDPG model demonstrates convergence toward a more refined tracking and following policy, emphasizing the suitability of the proposed reward function for the given task.
6.2 Policy Performance During Testing.
To test the policy learned from the previous section, we executed more than 2500 episodes of test runs in which the policy parameters were not updated in the DDPG model. Each episode of test runs involves a mixture of trajectories from the considered testing environments mentioned in Sec. 5.4.
Figure 5(a) shows the average reward per episode received by the chaser drone while tracking for a 2500-episode test run. From the graph, it can be observed that displays a consistent performance wherein the variations in per episode total reward are within a fixed range. Figure 5(b) shows the absolute value error for during the test episodes. Absolute value error also displays consistent performance across 2500 episodes. These plots indicate the stable system performance of the chaser drone in the identification and tracking tasks of the intruder.

Total reward per episode during evaluation, when the trained policy was used without any updates and the absolute value error during this process
Figure 6 shows the visualization of the chaser drone FOV for a single test episode where is moving in a zig-zag path. Beneath the FOV snapshot of the chaser drone, the track reward and align reward values are also mentioned along with the action vector of the chaser drone policy. The trajectory locations from where the chaser drone’s FOV snapshots are taken are shown in the graph below, numbered 1–8. As can be observed, the size of the as visible in the ’s FOV greatly affects the track reward, while the distance of from the center of the FOV affects the align reward.

’s view and trajectory followed. The top part shows the FOV view, and the bottom part shows the trajectory of and . The X- and Y-axis represent distance in meters.
6.3 Testing Endurance.
We also ran an extended endurance test for the chaser drone in which a single episode continuously ran for 4 h (30,000 time-steps) in our ROS & Gazebo simulation. Figure 7 shows the performance in a continuous series of episodes for 30,000 time-steps in a single episode of a varying speed test environment. The graphs show the total reward (in the middle) per step, the align reward (in the top), and the track reward (in the bottom). As can be observed, the chaser drone performed consistently during the long endurance test till the end of 30,000 steps. The align reward also shows consistent performance, while the track reward fluctuates, reaching a maximum of around 12,500 steps and then coming down. This is an expected performance, as the align reward keeps close to the center of ’s FOV. However, the track reward focuses on following the and gets affected by the speed of and its sudden turns or changes in trajectory. The proposed DDPG-based learned policy can keep a fine trade-off between align and track rewards such that is always in the FOV of . This result also shows that the proposed approach can nullify any effects of the compounding of errors in the long run chase of the . With usually 30–60 min of flight time for commercial drones on average, the learned chaser policy can run continuously until the intruder runs out of battery power. We further noted that the average mean trajectory error and average mean distance error are 37.5m and 57.2m, respectively, for the long-endurance episode. These errors are well within the maximum error ranges of 50 m and 75 m, respectively.

Reward spread along with its sub-components for long endurance test during evaluation and average mean trajectory and distance errors when the episode ran continuously for 4 h in Gazebo
6.4 Performance in Different Testing Environments.
We test the chaser drone policy on various testing environments as described in Sec. 5.3 using 4000 test episodes.
The graph in Fig. 8(a) depicts the average mean trajectory error in different trajectories adopted by the intruder. As can be observed, the average mean trajectory error is the least for the zig-zag, straight path, and high-speed scenarios. This observation parallels the common intuition, as this scenario requires fewer adjustments to keep the intruder in the center of . The average mean trajectory error is greater in the case of sinusoidal and circular as these scenarios require constant realignment of the chaser so that the Intruder can be kept close to the center of . In the random trajectory of the intruder and the trajectory with occlusions, the average mean trajectory error is the highest (20 m), mainly due to the intruder not being in FOV for many time-steps. In the variable speed instance, the error is greater because once chasing is going on, a sudden change in speed requires realignment and leads to the drifting of the intruder from the center of FOV. However, all trajectory errors are within less than half of the maximum limit of trajectory error, which is 50 m. This shows the robustness of the chaser drone policy in tracking and following the intruder drone in varying circumstances.
Figure 8(b) depicts the average mean chase distance error during the test runs in various environments. As observed, the distance is the least for the straight path scenario, as chaser can now focus on decreasing the distance while few decisions need to be made for realignment. In the case of the zig-zag path scenario, for both 5 m/s and 10 m/s speeds, the error is 30 m on average, which indicates the chaser is required to make a few decisions for realignment as well. In variable-speed scenarios, the path remains mostly the same; only the speed is adjusted, leading to quick adjustments to chaser’s velocity. In the case of circular and sinusoidal trajectories, the reported values are 37.673 m and 35.847 m, respectively. In these scenarios, the chaser UAV must constantly adjust its speed and orientation to keep up with the intruder UAV, leading to a slight increase in the average distance. In the case of random trajectories, the reported value is 40.583, which is more than in the scenarios discussed above. This is due to the high amount of orientation adjustments required to keep track of the intruder UAV. In the case of occlusions, the highest error is reported because it causes the chaser UAV to perform multiple trajectory adjustments to relocate the intruder UAV and resume the chase. With a typical UAV’s visual range of 100–150 m, error rates are well below the requirement and lead to neither the sudden sighting loss of intruder UAV nor decision-making time limitations.
The graph in Fig. 9 shows the total reward collected per episode by while chasing on various trajectories as described in Sec. 5.3. As can be seen in the graph, can consistently have good performance for straight and zig-zag paths. In the evaluation phase, can track circular trajectories due to the inclusion of the history of the five recent frames in the state space. This helps to identify a general direction of movement of the intruder based on historical data and track it effectively. The total reward accumulation is the lowest in the case of the circular and sinusoidal trajectory of the intruder.

Total reward per episode during evaluation episodes of the chaser drone in various testing environments
6.5 Comparison With a Proportional-Integral-Derivative Controller.
In this section, we compare the performance of the proposed RL based chaser drone policy against a PID based chaser drone controller. The proposed RL-based method can handle complex, non-linear systems, and real-world scenarios where states evolve over time. On the other hand, the PID controller calculates an “error,” which is the difference between the measured state and the desired set-point. This method is widely used in industrial control systems where the relationship between the input and the output is known and relatively stable. To perform the comparison, both controllers are evaluated on a test trajectory of , which consists of various portions, including straight, zig-zag, circular, sinusoidal, and random paths. Figure 10 shows the total reward per time-step during the evaluation of our proposed DDPG-based chaser policy and PID-based controller. The chase starts with a straight path for the first 2000 time-steps, then a zig-zag path with a speed of 5 m/s till 6000 time-steps. During the initial evaluation, can track in the case of DDPG and PID. However, the total reward generated for them has an average difference of 30 reward points. For the next 4000 steps, the is moving in a sinusoidal path until 10,000 time-steps and then again in a zig-zag path with a speed of 10 m/s until 12,000 time-steps. In this evaluation episode, the gap between the DDPG policy and the PID controller’s reward decreased mainly due to sudden maneuvers required to keep up with . After that, is moving in a straight path with varied velocity up to 14,000 time-steps, and lastly, it is moving in a pseudo-random path up to 17,500 time-steps. In this evaluation track, the PID controller cannot keep up with ; this is mainly because once is lost from the ’ , PID controller is not able to start tracking it again. In the last phase of the evaluation track, is able to chase consistently until the end of the evaluation. Overall, one can note that the DDPG controller maintains a higher reward than the PID controller and can chase for a longer duration in varied trajectories.

Comparison of the learned policy versus the PID controller during evaluation for variable trajectories. The graph shows the total reward accumulated over time, highlighting the effectiveness of the RL-based policy in maintaining the intruder within the FOV and adapting to sudden changes in trajectory and speed.

Comparison of the learned policy versus the PID controller during evaluation for variable trajectories. The graph shows the total reward accumulated over time, highlighting the effectiveness of the RL-based policy in maintaining the intruder within the FOV and adapting to sudden changes in trajectory and speed.
6.6 Total Tracking Time for Comparison With Proportional-Integral-Derivative Controller.
In addition to the reward-based evaluation, a new performance metric, total tracking time (TTT), has been introduced to provide a more comprehensive comparison between the proposed RL based chaser drone policy and the PID controller. The TTT metric measures the duration for which the intruder drone remains visible in the FOV of the chaser drone across various trajectories. Figure 11 presents the tracking counts of the proposed RL-based and PID-based approach, highlighting the significant differences in their tracking capabilities. While chasing , DDPG learned policy is able to detect it for 1846 instances versus 1421 of the PID controller for the initial straight-line trajectory. For the zig-zag path, DDPG tracks for 1879 instances versus 1511 for the PID Controller. Similarly, for sinusoidal and high-speed zig-zag paths, DDPG tracks for 1726 and 1300 instances, respectively, while PID controller tracks for 1695 and 157 instances. During the final stage of the evaluation trajectory, PID is unable to track , while DDPG tracks for 1654 and 1718 instances. From the graph, it can be seen that the DDPG learned policy can maintain consistent tracking instances for a longer duration.

Comparison of learned policy versus PID controller during evaluation based on TTT. When the is detected in FOV, the instance is recorded and shown in the graph.
7 Limitations
This section presents the scenarios in which may lose track of . In scenarios where the speed of ’ is too high compared to the speed of , it cannot be tracked for a longer time. The ’s camera can handle 30 frames per second, and if movement is fast enough not to be captured by the camera, can again escape. These cases may be handled with more training and further fine-tuning of the policy. In case of extreme environmental conditions, also, the localization module can provide incorrect location information, leading to moving in the opposite direction.
8 Conclusions and Future Work
Deep reinforcement learning methods have increasingly performed better in various control tasks. This paper focuses on the problem of tracking an intruder drone using a chaser drone using deep reinforcement learning based chaser drone policy. The reward function has been formulated using the FOV of the chaser drone’s camera view. Training and evaluation of the proposed approach is achieved using Gazebo and ROS, along with Ardupilot as the flight controller. The deep reinforcement learning-based chaser drone policy has been evaluated in various test environments using various evaluation metrics. The learned policy is also compared to a PID controller. The results show that the learned policy is robust to various maneuvers of the intruder drone and can continue tracking the intruder for a longer duration. Performance against the PID controller further validates the adaptability and better performance of the proposed approach.
As part of future work, the deployment of chaser swarms for intruder pursuit and neutralization with a larger combined FOV can be explored. Furthermore, exploration of hierarchical policies for autonomous takeoff, pursuit, and return to recharging stations for drone swarms is envisioned. This approach seeks to advance intelligent pursuit strategies and effectively enhance restricted airspace protection.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The data and information that support the findings of this article are freely available online.5
Footnotes
See Note 4.