Abstract
Modeling and control of soft robotic arms are challenging due to their complex deformation behavior. Kinematic models offer strong interpretability but are limited by low accuracy, while model-free reinforcement learning (RL) methods, though widely applicable, suffer from inefficiency and require extensive training. To address these issues, we propose a residual reinforcement learning (RRL) modeling and control framework incorporating an inverse kinematic model as prior knowledge to enhance RL training efficiency. Despite the kinematic model producing high mean absolute errors (MAEs) ranging from 33.8 mm to 57.4 mm, it significantly accelerates RL training. Using the Proximal Policy Optimization (PPO) algorithm, our method achieves a 90% reduction in training time and decreases MAEs to 4.8 mm–7.6 mm with just 30,000 iterations. This significantly enhances control precision over inverse kinematic methods while improving efficiency compared to conventional RL approaches.
1 Introduction
Soft robots, a subset of industrial robots, are capable of achieving arbitrary spatial positions and postures, making them well-suited for tasks that require flexible grasping and precise manipulation. As advancements continue in the field of advanced intelligent robotics, the imperative to develop control strategies that are specifically tailored for soft robots becomes increasingly pronounced [1,2]. Soft robots utilize a variety of actuation methods, including pneumatic actuation [3–5], artificial muscles [6,7], electro-active polymers, and low melting point alloys, with pneumatic actuation being the most prevalent. However, the deformation characteristics of pneumatic soft robots, characterized by their nonlinear behavior and inherent hysteresis, present significant challenges in developing robust and reliable control frameworks [8]. Addressing these challenges is essential for the continued progress of soft robotics and its practical applications.
Kinematic modeling of soft robots typically relies on curvature assumption models, such as the constant curvature model [9], the piecewise constant curvature model [10], and the variable curvature model [11]. For instance, Wen et al. [12] applied the constant curvature assumption to design an underwater robotic arm, simplifying the structure for symmetric deformation, which is effective in underwater environments but introduces inaccuracies when used in air. To mitigate these errors, sensor-based compensation methods have been employed [13,14]. On the other hand, Mahl et al. [15,16] adopted a variable curvature model, dividing the soft arm into segments to enhance accuracy. However, both sensor-based and advanced modeling techniques are limited by the inherent accuracy and complexity of the physical model, and they lack robustness in real-world conditions, particularly under variable payloads and self-weight.
After establishing the forward kinematic model, an inversion process can be undertaken to develop the corresponding inverse kinematic model. This process facilitates the mapping from the task space of the soft robotic arm to the configuration space of the drive, which is crucial for accurate control and operation [9]. The inverse kinematics method for soft robots relies on the Jacobian matrix inversion technique. It allows for solving the actuator configuration space of the robot under any initial conditions, such as initial arm position, initial arm velocity, and target point position. By introducing the concept of instantaneous kinematics, the robot can be viewed as undergoing instantaneous changes in a sufficiently short time frame, enabling iterative solutions. In this Jacobian-based inverse kinematics approach, actuator constraints can be incorporated into the robot system, ensuring that the robot's trajectory remains achievable in the real robot system [17]. Yan et al. [18] introduced a medical soft robot designed for minimally invasive surgery inside the human body, utilizing the Jacobian matrix to establish an inverse kinematic model for static control. Additionally, the segmented affine curvature model can also be applied to the inverse kinematic model, enhancing its accuracy and providing a control framework [19,20]. Xie et al. [21] established a kinematic model for underwater soft robotic arms and, by inversely determining the drive space parameters, such as the cavity length, through shape parameters, enabled static control. However, while inverse kinematic models provide strong interpretability, they are often limited by low accuracy in practical control scenarios.
Reinforcement learning (RL) methods enhance the adaptability of systems in complex, dynamic environments [22,23], particularly in highly nonlinear, nonuniform, gravity-influenced, or unstructured settings where accurate modeling is difficult [24]. Wu et al. [25] applied a multilayer perceptron combined with RL to model and control a soft robotic arm within a neural network-based simulation. The Deep Q-Network algorithm was used to optimize control parameters, benefiting from the cable-driven robotic arm's robustness, which enabled repeated iterations without damage. However, pneumatic soft robotic arms are more susceptible to damage from repeated trials, which severely limits the number of allowable training iterations. This constraint necessitates the development of an effective algorithmic framework that can achieve high accuracy within a reduced training period, maximizing performance under such limitations.
Model-based methods offer better sample efficiency but often involve significant workload due to the necessity of building and maintaining accurate environmental models [26]. On the other hand, model-free methods often face the challenge of extended training times. To address these issues, physics-informed reinforcement learning incorporates physical priors into the learning algorithm. The core framework involves observation, induction, and learning [27], enhancing physics-based models' scalability, accuracy, and data efficiency. Additionally, guided reinforcement learning further improves training efficiency by integrating diverse prior knowledge into the learning process [28]. Integrating physical priors into reinforcement learning frameworks bridges the gap between traditional physics-based methods and data-driven approaches. This integration accelerates the learning process and enhances the model's generalization capabilities in complex and dynamic environments, showing great potential for application in future complex robotic systems.
Despite continuous improvements in training accuracy and efficiency through reinforcement learning strategies, minimizing training time remains a major challenge in soft robotic control. Excessive training time not only increases computational costs but also accelerates material degradation due to creep effects, which are particularly problematic for soft materials with nonlinear and viscoelastic properties. Creep-induced deformations can lead to significant alterations in mechanical responses over time, even within short durations. For example, Honji et al. proposed a stochastic viscoelastic model capturing joint displacement drift under constant load, demonstrating that creep effects are observable within minutes of use [29]. Furthermore, Kim et al. emphasized that time-dependent characteristics like creep and drift are fundamental challenges in soft actuator control, often leading to control inconsistency during extended learning sessions [30]. These challenges are especially pronounced during idle periods in extended simulation training, where material relaxation continues to alter actuator properties, even without active actuation. This time-dependent deformation introduces discrepancies between simulated and real-world performance, widening the Sim-to-Real gap and undermining control accuracy. Consequently, reducing training duration is not just computationally efficient but physically necessary to maintain material integrity and ensure reliable real-world performance.
In this study, we propose a novel modeling and control framework as illustrated in Fig. 1. Although inverse kinematic models typically exhibit substantial errors in soft robotics, these errors can be effectively mitigated using the residual reinforcement learning (RRL) [31] framework. RRL optimizes the RL training process by refining the initial control strategy derived from the inverse kinematic model. By using the inverse kinematic model as prior knowledge, the learning process is effectively guided at the beginning of each training episode, significantly enhancing training efficiency and control accuracy. The remainder of this paper is organized as follows: Sec. 2 presents the forward kinematics model; Sec. 3 describes the inverse kinematics model; Sec. 4 introduces the RRL framework and simulation results; and Sec. 5 concludes the study.
The main contributions of this paper are summarized as follows: (1) We propose a novel soft robotic arm control framework integrating an inverse kinematic model with RRL, substantially improving the efficiency of RL training. (2) By introducing the inverse kinematic model as prior knowledge, we significantly reduce training time by approximately 90%. (3) The effectiveness of the proposed method is validated under various load conditions. Compared to traditional inverse kinematics method, the trajectory tracking mean absolute errors (MAEs) were reduced from 33.8–57.4 mm to 4.8–7.6 mm, demonstrating high accuracy and robustness in complex trajectory tracking tasks.
2 Forward Kinematic of Soft Robotic Arm
2.1 Structure of Soft Robotic Arm.
As shown in Fig. 2, the soft robotic arm is constructed from two modular three-chamber soft actuators connected by three-dimensional (3D)-printed connectors. Each three-chamber soft actuator (Fig. 2(a)) is made of silicone and comprises three independent chambers internally. External to the chambers, Kevlar lines are wound around, and the low elasticity of these Kevlar lines restricts inflation expansion, promoting axial elongation deformation of the chambers. Each three-chamber actuator has a section at the bottom and top without chambers, serving as a base. These base sections remain uninflated and unchanged, functioning as bottom base fixation and load mounting. The fabrication process of the soft manipulator was detailed in Ref. [32].

Soft robotic arm schematic diagram: (a) three-chamber actuator module, (b) segmented actuators are assembled into a soft robotic arm, and (c) soft robotic arm carrying a 100 g load
2.2 Forward Kinematics.
The mathematical model of the kinematics of the robotic arm maps the relationship between the input air pressure and the coordinates of the robotic arm's end-effector. The forward kinematics can be divided into three mapping relationships:
The relationship between air pressures and the lengths of chambers .
The relationship between chamber lengths and arc parameters , which represent the arc length, the arc rotation angle around the z-axis, and the radius of curvature.
The relationship between arc parameters and the end-effector coordinates.
where represents the shear modulus of the material, represents the principal stretch ratio, denotes the cross-sectional area of the chamber, and is the cross-sectional area of the wall.
In the first step, the mapping of air pressure to air cavity length was implemented using the Neo-Hookean model [34]. The geometry of the constant curvature assumption for the bending deformation of the robotic arm is illustrated in Fig. 3. Due to fiber constriction, the cavity hardly changes in the circumferential direction during pressurization. Hence, it is expressed as a circumferential stretch ratio of 1.
The piecewise constant curvature model is developed based on the following assumptions:
The soft body actuator has a homogeneous shape with a symmetric drive design.
External load and gravity effects are negligible, and there is no torsional deformation.
This approximation simplifies the kinematic model by reducing its complexity from multiple dimensions to three dimensions, allowing for easier modeling.
where denotes the rotation angle of the plane containing the arc about the Z-axis, with counterclockwise as positive and clockwise as negative. In addition, b is the distance from the origin to the center point of the chamber, denotes the length of the chamber, and represents the length of the arc.
where the matrices and represent the homogeneous transformation matrices of the two segments comprising the soft robotic arm itself, respectively.
3 Inverse Kinematics of Soft Robot Arm
3.1 Inverse Kinematics.
In the formula, the index i ranges from 1 to 3, corresponding to the three independently controlled chambers. The parameter represents the shear modulus of the material, reflecting the shear deformation characteristics of the air chamber material.
3.2 Control Simulation Based on Inverse Kinematics.
The accuracy of the inverse kinematics model was evaluated in a simulation environment. Circular trajectories with a radius of 100 mm and Viviani trajectories were used as target paths. For each trajectory, 50 points were uniformly selected for testing. The errors are depicted in Fig. 4, and the MAEs of inverse kinematic measurement ranged from 33.8 mm to 57.4 mm.
It is observed that the average distance errors of the two trajectories, determined by solving the target points using the inverse kinematics model, increase with the load. This error can be attributed to the simplifying assumptions employed in the modeling process, which deviate from the actual kinematics of the soft robotic arm. This is because the inverse kinematics model assumes that there is no effect of gravity and load. The greater the mass of the load, the greater the effect on the shape of the soft robotic arm.
Additionally, it is also evident from Fig. 4 that the MAEs of the Viviani curve are smaller compared to those of the circular trajectory. This discrepancy arises because many points on the Viviani curve are positioned closer to the origin. When the soft robotic arm is in proximity to the origin, the gravitational torque and load exert less influence, thereby resulting in a smaller error. Conversely, the points along the circular trajectory are distributed 100 mm away from the origin, making them more susceptible to the effects of gravity, consequently leading to a larger error.
4 Residual Reinforcement Learning
4.1 Markov Decision Process.
To apply reinforcement learning in robotic systems, a Markov Decision Process is employed. The environment and Markov Decision Process are characterized by states (S), actions (A), and rewards (R), which are represented as follows:
- The state space S of the robotic arm under a given pressure can be described as the current coordinates , the deviation from the coordinates of the target point , the load m, and the air pressures of the 6 air chambers . The state is hence defined as(17)
- The action space A is defined as the change in pressure value for each pneumatic chamber. In the proposed RRL framework, instead of directly outputting absolute pressure values, the policy network learns to generate residual actions , which are incremental adjustments applied to the initial pressure computed from the inverse kinematic model. This formulation allows reinforcement learning to iteratively refine the control solution while leveraging the inverse kinematic prior. To ensure stable exploration and prevent excessive fluctuations, we further constrain the residual actions within a normalized action space, where action values are represented as numbers between −1 and 1 and subsequently scaled by a predefined pressure range of 10 kPa. This normalization effectively restricts the magnitude of residual actions while maintaining sufficient exploration around the inverse kinematic predictions, ensuring system stability and control precision(18)
- Rewards quantify the impact of each action on the overall system. The optimization objective during reinforcement learning training is to maximize the rewards. Throughout the reinforcement learning process, actions that bring the endpoint of the robotic arm closer to the target point receive higher rewards(19)
4.2 Proximal Policy Optimization and Training Overview.
where represents the clipped surrogate objective, denotes the policy parameters, is the ratio of probabilities of the new policy to the old policy, is the advantage function, and is the clipping parameter.
An intuitive overview of the training process is shown in Fig. 5. The policy network functions as both a controller and an agent for reinforcement learning. It receives state inputs and outputs actions to the environment, which then parameterizes the reward function. The critic network generates loss values to aid in updating the policy network. In each initialization round of training, the initial air pressure is determined by solving the inverse kinematic model using the current target point. Subsequently, the current position of the robotic arm's endpoint in the virtual environment is computed based on this air pressure. Then, the current error is calculated and subsequently integrated into the initial state.
To clarify the integration of the inverse kinematic model into the reinforcement learning process, we employ a structured training procedure. At the beginning of each training episode, initial pressures are computed using the inverse kinematic model based on the target position. The reinforcement learning strategy follows the RRL framework, where the policy iteratively refines the initial pressures by generating small corrective adjustments. This approach allows the RL framework to leverage the inverse kinematic solution as prior knowledge while focusing on compensating for model inaccuracies and unmodeled dynamics, rather than searching for an optimal policy from scratch. Throughout training, actions are constrained to ensure stable exploration while maintaining system reliability. The detailed training steps are summarized in Algorithm 1.
Input: Target trajectory points , maximum episodes N, |
maximum steps per episode T , PPO hyperparameters () |
Output: Optimized policy parameters |
Initialize policy network and critic network |
for episode = do |
Calculate initial position via inverse kinematics |
Reset environment, set = initial position |
for to Tdo |
Compute initial pressure via inverse kinematics |
Observe state |
Generate action (pressure residual) |
Apply pressure , simulate deformation |
Calculate reward , observe new state |
Store transition in buffer |
end for |
Update policy network and critic network using PPO clipped |
objective |
end for |
Return: Optimized policy parameters |
Input: Target trajectory points , maximum episodes N, |
maximum steps per episode T , PPO hyperparameters () |
Output: Optimized policy parameters |
Initialize policy network and critic network |
for episode = do |
Calculate initial position via inverse kinematics |
Reset environment, set = initial position |
for to Tdo |
Compute initial pressure via inverse kinematics |
Observe state |
Generate action (pressure residual) |
Apply pressure , simulate deformation |
Calculate reward , observe new state |
Store transition in buffer |
end for |
Update policy network and critic network using PPO clipped |
objective |
end for |
Return: Optimized policy parameters |
A control strategy is obtained for arbitrary points in the workspace with only 30,000 steps and less than a day with a workstation equipped with an Intel Core i9-12900K CPU and an NVIDIA GeForce RTX 3080 GPU. To assess the efficacy of the control strategy, 50 points are uniformly sampled as targets along both the two-dimensional circular trajectory and the three-dimensional Viviani curve trajectory. Subsequently, these trajectories are tested under load conditions ranging from 0 g to 300 g in the virtual environment.
4.3 Simulation Results.
The trajectory tracking results in Figs. 6 and 7 demonstrate that both the RL and RRL algorithms achieve accurate tracking under various loading conditions. The key difference between the two methods lies in the use of inverse kinematics as prior knowledge in RRL, whereas RL relies on a random initial guess. To ensure a fair comparison, both were trained under identical conditions, including network architecture and hyperparameters. Overall, both methods effectively follow the target trajectories despite increasing loads. Table 1 summarizes the tracking performance of RL, RRL, and the inverse kinematics-based approach, providing a quantitative comparison of their errors. As shown in Table 1, RRL achieves errors ranging from 4.8 mm to 7.6 mm, while RL exhibits a similar accuracy level with errors between 3.7 mm and 6.7 mm. In contrast, the inverse kinematics-based approach shows significantly larger errors, ranging from 33.2 mm to 65.4 mm, highlighting the superior tracking accuracy of the learning-based methods. Although RRL and RL achieve accurate trajectory tracking overall, Fig. 6 shows that z-axis tracking errors are more pronounced than those in the y-direction. This discrepancy is primarily attributed to gravitational effects, which induce structural deformation under higher loads, reducing control precision. Additionally, the simplified assumptions in the inverse kinematics model can fail to capture nonlinear behaviors in the z-axis, lowering prediction accuracy. Other contributing factors include limited training data coverage in the z-direction and actuator asymmetries, such as restricted working range and output force, collectively exacerbating z-axis errors. The circular trajectory tracking results in Fig. 7 further confirm RRL's robustness across different loading conditions and target curves. Even under a 300 g load, RRL accurately tracks the target path with minimal error, demonstrating its high precision control.

Viviani trajectory tracking results for RRL and RL under different loading conditions: (a) no load, (b) 100 g, (c) 200 g, and (d) 300 g. (i) and (ii) represent the trajectory projection onto both the xy-plane and the xz-plane, respectively.

Circular trajectory tracking results for RRL and RL under different loading conditions: (a) no load, (b) 100 g, (c) 200 g, and (d) 300 g
MAEs in RL, RRL, and inverse kinematic model control compared to target curves
Circle | Viviani | |||
---|---|---|---|---|
Load | RL | RRL | RL | RRL |
0 g | 5.3 mm | 4.8 mm | 5.4 mm | 6.2 mm |
100 g | 3.7 mm | 5.1 mm | 6.4 mm | 5.8 mm |
200 g | 3.9 mm | 4.8 mm | 6.5 mm | 6.2 mm |
300 g | 5.7 mm | 6.3 mm | 6.7 mm | 7.6 mm |
Circle | Viviani | |||
---|---|---|---|---|
Load | RL | RRL | RL | RRL |
0 g | 5.3 mm | 4.8 mm | 5.4 mm | 6.2 mm |
100 g | 3.7 mm | 5.1 mm | 6.4 mm | 5.8 mm |
200 g | 3.9 mm | 4.8 mm | 6.5 mm | 6.2 mm |
300 g | 5.7 mm | 6.3 mm | 6.7 mm | 7.6 mm |
Load | Inverse kinematics | Inverse kinematics |
---|---|---|
0 g | 33.2 mm | 52.5 mm |
100 g | 38.2 mm | 57.3 mm |
200 g | 42.6 mm | 61.7 mm |
300 g | 45.9 mm | 65.4 mm |
Load | Inverse kinematics | Inverse kinematics |
---|---|---|
0 g | 33.2 mm | 52.5 mm |
100 g | 38.2 mm | 57.3 mm |
200 g | 42.6 mm | 61.7 mm |
300 g | 45.9 mm | 65.4 mm |
Moreover, these simulation results highlight the significant advantages of the RRL algorithm in terms of learning efficiency. As shown in Fig. 8, the RRL algorithm significantly reduces training time and the number of iterations. Compared to traditional model-free RL methods that require 300,000 steps to train, the proposed RRL control framework only takes 30,000, significantly reducing training time by 90%. This efficient learning mechanism enables the RRL algorithm to achieve high-precision control while also effectively lowering computational complexity and resource consumption.
5 Conclusion
This paper presented an RRL-based efficient modeling and control strategy for soft robotic arms, integrating an analytical framework with data-driven error modeling to build the virtual RL environment. An inverse kinematic model was developed to provide prior physics knowledge for RL training. Control simulations were performed in a virtual environment interacting with the hybrid model, validated through path-tracking simulations of two-dimensional circular and 3D Viviani trajectories.)
The proposed method reduces path-tracking MAEs from 33.8 mm to 57.4 mm (inverse kinematic method) to 4.8 mm–7.6 mm under varying load conditions in simulation. Additionally, the RRL method cuts training time by approximately 90% compared to model-free RL methods, achieving high control precision with significantly improved learning efficiency. Moreover, the shortened training duration could help mitigate material degradation, particularly creep effects caused by prolonged loading during training, potentially preserving actuator consistency and policy transferability. Additionally, the RRL framework could facilitate rapid parameter tuning, enabling efficient testing of different network architectures or reward functions without lengthy iteration cycles. It also could enable quicker validation for real-world deployment, allowing for faster adaptation to dynamic environments. These benefits suggest the potential for enhanced deployment efficiency and long-term system robustness, highlighting the practical value of the proposed approach in soft robotic control.
In future work, we plan to extend and validate the proposed RRL framework in more complex real-world scenarios to further enhance its practical applicability. Specifically, we will integrate the RRL algorithm into more challenging robotic tasks, such as dexterous manipulation and adaptive locomotion, to improve its generalization ability across different control environments. Additionally, to bridge the gap between simulation and real-world applications, we will refine the Sim-to-Real transfer efficiency by leveraging domain adaptation techniques and real-time learning strategies. Instead of relying solely on pretrained models, we will incorporate online sensor feedback (e.g., pressure, strain, and end-effector positions) to dynamically adjust the policy, thereby enhancing the robustness of the system against external disturbances and actuation uncertainties.
Funding Data
National Natural Science Foundation of China (Grant No. 62203174; Funder ID: 10.13039/501100001809).
Fundamental Research Funds for the Central Universities (Grant No. 2024ZYGXZR028; Funder ID: 10.13039/501100012226).
Guangzhou Municipal Science and Technology Project (Grant No. 2025A04J5281; Funder ID: 10.13039/501100010256).
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.