Abstract

Modeling and control of soft robotic arms are challenging due to their complex deformation behavior. Kinematic models offer strong interpretability but are limited by low accuracy, while model-free reinforcement learning (RL) methods, though widely applicable, suffer from inefficiency and require extensive training. To address these issues, we propose a residual reinforcement learning (RRL) modeling and control framework incorporating an inverse kinematic model as prior knowledge to enhance RL training efficiency. Despite the kinematic model producing high mean absolute errors (MAEs) ranging from 33.8 mm to 57.4 mm, it significantly accelerates RL training. Using the Proximal Policy Optimization (PPO) algorithm, our method achieves a 90% reduction in training time and decreases MAEs to 4.8 mm–7.6 mm with just 30,000 iterations. This significantly enhances control precision over inverse kinematic methods while improving efficiency compared to conventional RL approaches.

1 Introduction

Soft robots, a subset of industrial robots, are capable of achieving arbitrary spatial positions and postures, making them well-suited for tasks that require flexible grasping and precise manipulation. As advancements continue in the field of advanced intelligent robotics, the imperative to develop control strategies that are specifically tailored for soft robots becomes increasingly pronounced [1,2]. Soft robots utilize a variety of actuation methods, including pneumatic actuation [35], artificial muscles [6,7], electro-active polymers, and low melting point alloys, with pneumatic actuation being the most prevalent. However, the deformation characteristics of pneumatic soft robots, characterized by their nonlinear behavior and inherent hysteresis, present significant challenges in developing robust and reliable control frameworks [8]. Addressing these challenges is essential for the continued progress of soft robotics and its practical applications.

Kinematic modeling of soft robots typically relies on curvature assumption models, such as the constant curvature model [9], the piecewise constant curvature model [10], and the variable curvature model [11]. For instance, Wen et al. [12] applied the constant curvature assumption to design an underwater robotic arm, simplifying the structure for symmetric deformation, which is effective in underwater environments but introduces inaccuracies when used in air. To mitigate these errors, sensor-based compensation methods have been employed [13,14]. On the other hand, Mahl et al. [15,16] adopted a variable curvature model, dividing the soft arm into segments to enhance accuracy. However, both sensor-based and advanced modeling techniques are limited by the inherent accuracy and complexity of the physical model, and they lack robustness in real-world conditions, particularly under variable payloads and self-weight.

After establishing the forward kinematic model, an inversion process can be undertaken to develop the corresponding inverse kinematic model. This process facilitates the mapping from the task space of the soft robotic arm to the configuration space of the drive, which is crucial for accurate control and operation [9]. The inverse kinematics method for soft robots relies on the Jacobian matrix inversion technique. It allows for solving the actuator configuration space of the robot under any initial conditions, such as initial arm position, initial arm velocity, and target point position. By introducing the concept of instantaneous kinematics, the robot can be viewed as undergoing instantaneous changes in a sufficiently short time frame, enabling iterative solutions. In this Jacobian-based inverse kinematics approach, actuator constraints can be incorporated into the robot system, ensuring that the robot's trajectory remains achievable in the real robot system [17]. Yan et al. [18] introduced a medical soft robot designed for minimally invasive surgery inside the human body, utilizing the Jacobian matrix to establish an inverse kinematic model for static control. Additionally, the segmented affine curvature model can also be applied to the inverse kinematic model, enhancing its accuracy and providing a control framework [19,20]. Xie et al. [21] established a kinematic model for underwater soft robotic arms and, by inversely determining the drive space parameters, such as the cavity length, through shape parameters, enabled static control. However, while inverse kinematic models provide strong interpretability, they are often limited by low accuracy in practical control scenarios.

Reinforcement learning (RL) methods enhance the adaptability of systems in complex, dynamic environments [22,23], particularly in highly nonlinear, nonuniform, gravity-influenced, or unstructured settings where accurate modeling is difficult [24]. Wu et al. [25] applied a multilayer perceptron combined with RL to model and control a soft robotic arm within a neural network-based simulation. The Deep Q-Network algorithm was used to optimize control parameters, benefiting from the cable-driven robotic arm's robustness, which enabled repeated iterations without damage. However, pneumatic soft robotic arms are more susceptible to damage from repeated trials, which severely limits the number of allowable training iterations. This constraint necessitates the development of an effective algorithmic framework that can achieve high accuracy within a reduced training period, maximizing performance under such limitations.

Model-based methods offer better sample efficiency but often involve significant workload due to the necessity of building and maintaining accurate environmental models [26]. On the other hand, model-free methods often face the challenge of extended training times. To address these issues, physics-informed reinforcement learning incorporates physical priors into the learning algorithm. The core framework involves observation, induction, and learning [27], enhancing physics-based models' scalability, accuracy, and data efficiency. Additionally, guided reinforcement learning further improves training efficiency by integrating diverse prior knowledge into the learning process [28]. Integrating physical priors into reinforcement learning frameworks bridges the gap between traditional physics-based methods and data-driven approaches. This integration accelerates the learning process and enhances the model's generalization capabilities in complex and dynamic environments, showing great potential for application in future complex robotic systems.

Despite continuous improvements in training accuracy and efficiency through reinforcement learning strategies, minimizing training time remains a major challenge in soft robotic control. Excessive training time not only increases computational costs but also accelerates material degradation due to creep effects, which are particularly problematic for soft materials with nonlinear and viscoelastic properties. Creep-induced deformations can lead to significant alterations in mechanical responses over time, even within short durations. For example, Honji et al. proposed a stochastic viscoelastic model capturing joint displacement drift under constant load, demonstrating that creep effects are observable within minutes of use [29]. Furthermore, Kim et al. emphasized that time-dependent characteristics like creep and drift are fundamental challenges in soft actuator control, often leading to control inconsistency during extended learning sessions [30]. These challenges are especially pronounced during idle periods in extended simulation training, where material relaxation continues to alter actuator properties, even without active actuation. This time-dependent deformation introduces discrepancies between simulated and real-world performance, widening the Sim-to-Real gap and undermining control accuracy. Consequently, reducing training duration is not just computationally efficient but physically necessary to maintain material integrity and ensure reliable real-world performance.

In this study, we propose a novel modeling and control framework as illustrated in Fig. 1. Although inverse kinematic models typically exhibit substantial errors in soft robotics, these errors can be effectively mitigated using the residual reinforcement learning (RRL) [31] framework. RRL optimizes the RL training process by refining the initial control strategy derived from the inverse kinematic model. By using the inverse kinematic model as prior knowledge, the learning process is effectively guided at the beginning of each training episode, significantly enhancing training efficiency and control accuracy. The remainder of this paper is organized as follows: Sec. 2 presents the forward kinematics model; Sec. 3 describes the inverse kinematics model; Sec. 4 introduces the RRL framework and simulation results; and Sec. 5 concludes the study.

Fig. 1
The proposed framework for controlling soft robotic arms using residual reinforcement learning
Fig. 1
The proposed framework for controlling soft robotic arms using residual reinforcement learning
Close modal

The main contributions of this paper are summarized as follows: (1) We propose a novel soft robotic arm control framework integrating an inverse kinematic model with RRL, substantially improving the efficiency of RL training. (2) By introducing the inverse kinematic model as prior knowledge, we significantly reduce training time by approximately 90%. (3) The effectiveness of the proposed method is validated under various load conditions. Compared to traditional inverse kinematics method, the trajectory tracking mean absolute errors (MAEs) were reduced from 33.8–57.4 mm to 4.8–7.6 mm, demonstrating high accuracy and robustness in complex trajectory tracking tasks.

2 Forward Kinematic of Soft Robotic Arm

2.1 Structure of Soft Robotic Arm.

As shown in Fig. 2, the soft robotic arm is constructed from two modular three-chamber soft actuators connected by three-dimensional (3D)-printed connectors. Each three-chamber soft actuator (Fig. 2(a)) is made of silicone and comprises three independent chambers internally. External to the chambers, Kevlar lines are wound around, and the low elasticity of these Kevlar lines restricts inflation expansion, promoting axial elongation deformation of the chambers. Each three-chamber actuator has a section at the bottom and top without chambers, serving as a base. These base sections remain uninflated and unchanged, functioning as bottom base fixation and load mounting. The fabrication process of the soft manipulator was detailed in Ref. [32].

Fig. 2
Soft robotic arm schematic diagram: (a) three-chamber actuator module, (b) segmented actuators are assembled into a soft robotic arm, and (c) soft robotic arm carrying a 100 g load
Fig. 2
Soft robotic arm schematic diagram: (a) three-chamber actuator module, (b) segmented actuators are assembled into a soft robotic arm, and (c) soft robotic arm carrying a 100 g load
Close modal

2.2 Forward Kinematics.

The mathematical model of the kinematics of the robotic arm maps the relationship between the input air pressure and the coordinates of the robotic arm's end-effector. The forward kinematics can be divided into three mapping relationships:

  1. The relationship between air pressures (P1,P2,P3) and the lengths of chambers (l1,l2,l3).

  2. The relationship between chamber lengths and arc parameters (l¯,φ,k), which represent the arc length, the arc rotation angle around the z-axis, and the radius of curvature.

  3. The relationship between arc parameters and the end-effector coordinates.

The relationship between air pressure and chamber length is given by Ref. [33]
(1)

where μ represents the shear modulus of the material, λ=l/l0 represents the principal stretch ratio, s1 denotes the cross-sectional area of the chamber, and s2 is the cross-sectional area of the wall.

In the first step, the mapping of air pressure to air cavity length was implemented using the Neo-Hookean model [34]. The geometry of the constant curvature assumption for the bending deformation of the robotic arm is illustrated in Fig. 3. Due to fiber constriction, the cavity hardly changes in the circumferential direction during pressurization. Hence, it is expressed as a circumferential stretch ratio of 1.

Fig. 3
Schematic diagram of deformation assumptions for soft body actuators
Fig. 3
Schematic diagram of deformation assumptions for soft body actuators
Close modal

The piecewise constant curvature model is developed based on the following assumptions:

  • The soft body actuator has a homogeneous shape with a symmetric drive design.

  • External load and gravity effects are negligible, and there is no torsional deformation.

  • This approximation simplifies the kinematic model by reducing its complexity from multiple dimensions to three dimensions, allowing for easier modeling.

Based on geometric relationships, the relationship between chamber lengths and arc parameters is
(2)

where φ0 denotes the rotation angle of the plane containing the arc about the Z-axis, with counterclockwise as positive and clockwise as negative. In addition, b is the distance from the origin to the center point of the chamber, li denotes the length of the chamber, and l¯ represents the length of the arc.

Because φ0 is solved using trigonometric functions, its range of values is (π/2,π/2). However, in reality, the range of φ is (π,π). Therefore, cases where a solution cannot be obtained must be separately analyzed
(3)
The central angle of the arc can be determined from θ=kl¯. According to the geometric relationships [35], the mapping of the arc parameters to the tip position is represented by the homogeneous transformation matrix, as follows:
(4)
The fourth column of matrix T represents the end coordinates of the soft driver, as follows:
(5)
As depicted in Fig. 2, the middle connector section of the soft robotic arm and the lower segment of the load-bearing component attached to the base do not undergo bending deformation. Consequently, this section is considered as a rigid element. As illustrated in Fig. 3, they are perceived to extend in the direction tangent to the circular arc. Their respective transformation matrices Sb and Sh are
(6)
where lb is the length of the intermediate connector and lh is the length of the rigid unit in the head of the robotic arm. The forward kinematics for a two-section manipulator can then be generated by the product of four matrices of the form given in Eq. (7). The matrix T represents the relationship between the arc parameters of the two sections [φ1,k1,l1,φ2,k2,l2] and the end coordinates of the manipulator's arm as
(7)

where the matrices T1 and T2 represent the homogeneous transformation matrices of the two segments comprising the soft robotic arm itself, respectively.

3 Inverse Kinematics of Soft Robot Arm

3.1 Inverse Kinematics.

A common method for solving inverse kinematics problems involves employing a first-order analytical rate algorithm based on Jacobian matrices [36]. This approach necessitates establishing the velocity kinematics of the robotic arm. The differential kinematics mapping of the soft-bodied robot arm describes the relationship between the endpoint position X and arc parameters q of the two arc segments that make up the robotic arm, the first-order derivative, expressed as follows:
(8)
(9)
(10)
The Jacobian matrix J equal is as follows:
(11)
The solution formula for the Jacobian matrix method is as follows:
(12)
where I and q0 are the unit matrix and an arbitrary initial arc parameter, respectively. J(q) is a pseudo-inverse of the manipulator's Jacobian. According to Moore–Penrose pseudo-inverse [36] because Jacobi matrices are nonsingular matrices, J is defined as
(13)
Once the arc parameters have been determined by the Jacobi matrix, the lengths of the three chambers (l1,l2,l3) can be represented with arc parameters (l¯,φ,θ)
(14)
According to Eq. (1), the value of the air pressure exerted by the air chamber can be determined through inverse solution, based on the length of the air chamber
(15)
(16)

In the formula, the index i ranges from 1 to 3, corresponding to the three independently controlled chambers. The parameter μ represents the shear modulus of the material, reflecting the shear deformation characteristics of the air chamber material.

3.2 Control Simulation Based on Inverse Kinematics.

The accuracy of the inverse kinematics model was evaluated in a simulation environment. Circular trajectories with a radius of 100 mm and Viviani trajectories were used as target paths. For each trajectory, 50 points were uniformly selected for testing. The errors are depicted in Fig. 4, and the MAEs of inverse kinematic measurement ranged from 33.8 mm to 57.4 mm.

Fig. 4
MAEs of inverse kinematic control simulation
Fig. 4
MAEs of inverse kinematic control simulation
Close modal

It is observed that the average distance errors of the two trajectories, determined by solving the target points using the inverse kinematics model, increase with the load. This error can be attributed to the simplifying assumptions employed in the modeling process, which deviate from the actual kinematics of the soft robotic arm. This is because the inverse kinematics model assumes that there is no effect of gravity and load. The greater the mass of the load, the greater the effect on the shape of the soft robotic arm.

Additionally, it is also evident from Fig. 4 that the MAEs of the Viviani curve are smaller compared to those of the circular trajectory. This discrepancy arises because many points on the Viviani curve are positioned closer to the origin. When the soft robotic arm is in proximity to the origin, the gravitational torque and load exert less influence, thereby resulting in a smaller error. Conversely, the points along the circular trajectory are distributed 100 mm away from the origin, making them more susceptible to the effects of gravity, consequently leading to a larger error.

4 Residual Reinforcement Learning

4.1 Markov Decision Process.

To apply reinforcement learning in robotic systems, a Markov Decision Process is employed. The environment and Markov Decision Process are characterized by states (S), actions (A), and rewards (R), which are represented as follows:

  • The state space S of the robotic arm under a given pressure can be described as the current coordinates XtcurrR3, the deviation from the coordinates of the target point et=XtarXtcurrR3, the load m, and the air pressures of the 6 air chambers PtR6. The state is hence defined as
    (17)
  • The action space A is defined as the change in pressure value ΔpR6 for each pneumatic chamber. In the proposed RRL framework, instead of directly outputting absolute pressure values, the policy network learns to generate residual actions Δpt, which are incremental adjustments applied to the initial pressure computed from the inverse kinematic model. This formulation allows reinforcement learning to iteratively refine the control solution while leveraging the inverse kinematic prior. To ensure stable exploration and prevent excessive fluctuations, we further constrain the residual actions within a normalized action space, where action values are represented as numbers between −1 and 1 and subsequently scaled by a predefined pressure range of 10 kPa. This normalization effectively restricts the magnitude of residual actions while maintaining sufficient exploration around the inverse kinematic predictions, ensuring system stability and control precision
    (18)
  • Rewards Ra(s,s) quantify the impact of each action on the overall system. The optimization objective during reinforcement learning training is to maximize the rewards. Throughout the reinforcement learning process, actions that bring the endpoint of the robotic arm closer to the target point receive higher rewards
    (19)
If the pressure p exceeds the maximum limit or the number of training steps n per episode exceeds 100, the reward is set to −200. When the current error errcurr after taking an action is less than a very small distance error ϵ set beforehand, the reward is set to 200. In other cases, the reward is calculated as the difference between the error produced by the current error errcurr. ϵ is a given small value, and the system is rewarded with a large positive value when the endpoint is sufficiently close to the target point, where set ϵ = 10 mm. ξ is an adjustable parameter that introduces an additional penalty when the error is significant. This penalty is defined as follows:
(20)

4.2 Proximal Policy Optimization and Training Overview.

Proximal policy optimization (PPO) is an algorithm that can train agents in environments with large action spaces and continuous state spaces. Such environments are particularly well-suited for soft-bodied robots, as the continuous actions and states align well with their inherent characteristics. PPO stabilizes the training process by constraining the policy update step size, ensuring that each update does not significantly deviate from the previous policy. This constraint is implemented by clipping or limiting the objective function during policy updates, thereby maintaining the relative difference between the new and old policies within an acceptable range. The objective function for the PPO algorithm is
(21)

where δ(θ) represents the clipped surrogate objective, θ denotes the policy parameters, rt(θ) is the ratio of probabilities of the new policy to the old policy, Ât is the advantage function, and ϵ is the clipping parameter.

An intuitive overview of the training process is shown in Fig. 5. The policy network functions as both a controller and an agent for reinforcement learning. It receives state inputs and outputs actions to the environment, which then parameterizes the reward function. The critic network generates loss values to aid in updating the policy network. In each initialization round of training, the initial air pressure is determined by solving the inverse kinematic model using the current target point. Subsequently, the current position of the robotic arm's endpoint in the virtual environment is computed based on this air pressure. Then, the current error is calculated and subsequently integrated into the initial state.

Fig. 5
Training process of RRL combined with the PPO algorithm
Fig. 5
Training process of RRL combined with the PPO algorithm
Close modal

To clarify the integration of the inverse kinematic model into the reinforcement learning process, we employ a structured training procedure. At the beginning of each training episode, initial pressures are computed using the inverse kinematic model based on the target position. The reinforcement learning strategy follows the RRL framework, where the policy iteratively refines the initial pressures by generating small corrective adjustments. This approach allows the RL framework to leverage the inverse kinematic solution as prior knowledge while focusing on compensating for model inaccuracies and unmodeled dynamics, rather than searching for an optimal policy from scratch. Throughout training, actions are constrained to ensure stable exploration while maintaining system reliability. The detailed training steps are summarized in Algorithm 1.

Algorithm 1
Input: Target trajectory points Xtar, maximum episodes N,
maximum steps per episode T , PPO hyperparameters (ϵ=0.2,γ=0.99)
Output: Optimized policy parameters θ
Initialize policy network πθ and critic network
for episode = 1,2,,Ndo
 Calculate initial position via inverse kinematics
 Reset environment, set Xcurr = initial position
fort=1 to Tdo
  Compute initial pressure P0 via inverse kinematics
  Observe state st=[Xtcurr,et,Pt,m]
  Generate action (pressure residual) at=ΔPtπθ(st)
  Apply pressure Pt+1=Pt+at, simulate deformation
  Calculate reward rt, observe new state st+1
  Store transition (st,at,rt,st+1) in buffer
end for
 Update policy network πθ and critic network using PPO clipped
objective
end for
Return: Optimized policy parameters θ
Input: Target trajectory points Xtar, maximum episodes N,
maximum steps per episode T , PPO hyperparameters (ϵ=0.2,γ=0.99)
Output: Optimized policy parameters θ
Initialize policy network πθ and critic network
for episode = 1,2,,Ndo
 Calculate initial position via inverse kinematics
 Reset environment, set Xcurr = initial position
fort=1 to Tdo
  Compute initial pressure P0 via inverse kinematics
  Observe state st=[Xtcurr,et,Pt,m]
  Generate action (pressure residual) at=ΔPtπθ(st)
  Apply pressure Pt+1=Pt+at, simulate deformation
  Calculate reward rt, observe new state st+1
  Store transition (st,at,rt,st+1) in buffer
end for
 Update policy network πθ and critic network using PPO clipped
objective
end for
Return: Optimized policy parameters θ

A control strategy is obtained for arbitrary points in the workspace with only 30,000 steps and less than a day with a workstation equipped with an Intel Core i9-12900K CPU and an NVIDIA GeForce RTX 3080 GPU. To assess the efficacy of the control strategy, 50 points are uniformly sampled as targets along both the two-dimensional circular trajectory and the three-dimensional Viviani curve trajectory. Subsequently, these trajectories are tested under load conditions ranging from 0 g to 300 g in the virtual environment.

4.3 Simulation Results.

The trajectory tracking results in Figs. 6 and 7 demonstrate that both the RL and RRL algorithms achieve accurate tracking under various loading conditions. The key difference between the two methods lies in the use of inverse kinematics as prior knowledge in RRL, whereas RL relies on a random initial guess. To ensure a fair comparison, both were trained under identical conditions, including network architecture and hyperparameters. Overall, both methods effectively follow the target trajectories despite increasing loads. Table 1 summarizes the tracking performance of RL, RRL, and the inverse kinematics-based approach, providing a quantitative comparison of their errors. As shown in Table 1, RRL achieves errors ranging from 4.8 mm to 7.6 mm, while RL exhibits a similar accuracy level with errors between 3.7 mm and 6.7 mm. In contrast, the inverse kinematics-based approach shows significantly larger errors, ranging from 33.2 mm to 65.4 mm, highlighting the superior tracking accuracy of the learning-based methods. Although RRL and RL achieve accurate trajectory tracking overall, Fig. 6 shows that z-axis tracking errors are more pronounced than those in the y-direction. This discrepancy is primarily attributed to gravitational effects, which induce structural deformation under higher loads, reducing control precision. Additionally, the simplified assumptions in the inverse kinematics model can fail to capture nonlinear behaviors in the z-axis, lowering prediction accuracy. Other contributing factors include limited training data coverage in the z-direction and actuator asymmetries, such as restricted working range and output force, collectively exacerbating z-axis errors. The circular trajectory tracking results in Fig. 7 further confirm RRL's robustness across different loading conditions and target curves. Even under a 300 g load, RRL accurately tracks the target path with minimal error, demonstrating its high precision control.

Fig. 6
Viviani trajectory tracking results for RRL and RL under different loading conditions: (a) no load, (b) 100 g, (c) 200 g, and (d) 300 g. (i) and (ii) represent the trajectory projection onto both the xy-plane and the xz-plane, respectively.
Fig. 6
Viviani trajectory tracking results for RRL and RL under different loading conditions: (a) no load, (b) 100 g, (c) 200 g, and (d) 300 g. (i) and (ii) represent the trajectory projection onto both the xy-plane and the xz-plane, respectively.
Close modal
Fig. 7
Circular trajectory tracking results for RRL and RL under different loading conditions: (a) no load, (b) 100 g, (c) 200 g, and (d) 300 g
Fig. 7
Circular trajectory tracking results for RRL and RL under different loading conditions: (a) no load, (b) 100 g, (c) 200 g, and (d) 300 g
Close modal
Table 1

MAEs in RL, RRL, and inverse kinematic model control compared to target curves

CircleViviani
LoadRLRRLRLRRL
0 g5.3 mm4.8 mm5.4 mm6.2 mm
100 g3.7 mm5.1 mm6.4 mm5.8 mm
200 g3.9 mm4.8 mm6.5 mm6.2 mm
300 g5.7 mm6.3 mm6.7 mm7.6 mm
CircleViviani
LoadRLRRLRLRRL
0 g5.3 mm4.8 mm5.4 mm6.2 mm
100 g3.7 mm5.1 mm6.4 mm5.8 mm
200 g3.9 mm4.8 mm6.5 mm6.2 mm
300 g5.7 mm6.3 mm6.7 mm7.6 mm
LoadInverse kinematicsInverse kinematics
0 g33.2 mm52.5 mm
100 g38.2 mm57.3 mm
200 g42.6 mm61.7 mm
300 g45.9 mm65.4 mm
LoadInverse kinematicsInverse kinematics
0 g33.2 mm52.5 mm
100 g38.2 mm57.3 mm
200 g42.6 mm61.7 mm
300 g45.9 mm65.4 mm

Moreover, these simulation results highlight the significant advantages of the RRL algorithm in terms of learning efficiency. As shown in Fig. 8, the RRL algorithm significantly reduces training time and the number of iterations. Compared to traditional model-free RL methods that require 300,000 steps to train, the proposed RRL control framework only takes 30,000, significantly reducing training time by 90%. This efficient learning mechanism enables the RRL algorithm to achieve high-precision control while also effectively lowering computational complexity and resource consumption.

Fig. 8
Comparison of training steps between RL and RRL
Fig. 8
Comparison of training steps between RL and RRL
Close modal

5 Conclusion

This paper presented an RRL-based efficient modeling and control strategy for soft robotic arms, integrating an analytical framework with data-driven error modeling to build the virtual RL environment. An inverse kinematic model was developed to provide prior physics knowledge for RL training. Control simulations were performed in a virtual environment interacting with the hybrid model, validated through path-tracking simulations of two-dimensional circular and 3D Viviani trajectories.)

The proposed method reduces path-tracking MAEs from 33.8 mm to 57.4 mm (inverse kinematic method) to 4.8 mm–7.6 mm under varying load conditions in simulation. Additionally, the RRL method cuts training time by approximately 90% compared to model-free RL methods, achieving high control precision with significantly improved learning efficiency. Moreover, the shortened training duration could help mitigate material degradation, particularly creep effects caused by prolonged loading during training, potentially preserving actuator consistency and policy transferability. Additionally, the RRL framework could facilitate rapid parameter tuning, enabling efficient testing of different network architectures or reward functions without lengthy iteration cycles. It also could enable quicker validation for real-world deployment, allowing for faster adaptation to dynamic environments. These benefits suggest the potential for enhanced deployment efficiency and long-term system robustness, highlighting the practical value of the proposed approach in soft robotic control.

In future work, we plan to extend and validate the proposed RRL framework in more complex real-world scenarios to further enhance its practical applicability. Specifically, we will integrate the RRL algorithm into more challenging robotic tasks, such as dexterous manipulation and adaptive locomotion, to improve its generalization ability across different control environments. Additionally, to bridge the gap between simulation and real-world applications, we will refine the Sim-to-Real transfer efficiency by leveraging domain adaptation techniques and real-time learning strategies. Instead of relying solely on pretrained models, we will incorporate online sensor feedback (e.g., pressure, strain, and end-effector positions) to dynamically adjust the policy, thereby enhancing the robustness of the system against external disturbances and actuation uncertainties.

Funding Data

  • National Natural Science Foundation of China (Grant No. 62203174; Funder ID: 10.13039/501100001809).

  • Fundamental Research Funds for the Central Universities (Grant No. 2024ZYGXZR028; Funder ID: 10.13039/501100012226).

  • Guangzhou Municipal Science and Technology Project (Grant No. 2025A04J5281; Funder ID: 10.13039/501100010256).

Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

References

1.
Rus
,
D.
, and
Tolley
,
M.
,
2018
, “
Design, Fabrication and Control of Origami Robots
,”
Nat. Rev. Mater.
,
3
(
6
), pp.
101
112
.10.1038/s41578-018-0009-8
2.
McCann
,
L.
,
Yan
,
L. L.
,
Hassan
,
S.
,
Garbini
,
J.
, and
Devasia
,
S.
,
2025
, “
Active Data-Enabled Robot Learning of Elastic Workpiece Interactions
,”
ASME J. Dyn. Syst., Meas., Control
,
147
(
3
), p.
031007
.10.1115/1.4066631
3.
Zhou
,
Y.
,
Headings
,
L. M.
, and
Dapino
,
M. J.
,
2022
, “
Modeling of Fluidic Prestressed Composite Actuators With Application to Soft Robotic Grippers
,”
IEEE Trans. Rob.
,
38
(
4
), pp.
2166
2178
.10.1109/TRO.2021.3139770
4.
Xu
,
Z.
, and
Zhou
,
Y.
,
2024
, “
Bistable Composites With Intrinsic Pneumatic Actuation and Non-Cylindrical Curved Shapes
,”
Mater. Lett.
,
354
, p.
135381
.10.1016/j.matlet.2023.135381
5.
Xu
,
Z.
,
Hu
,
L.
,
Xiao
,
L.
,
Jiang
,
H.
, and
Zhou
,
Y.
,
2024
, “
Modular Soft Robotic Crawlers Based on Fluidic Prestressed Composite Actuators
,”
J. Bionic Eng.
,
21
(
2
), pp.
694
706
.10.1007/s42235-024-00487-6
6.
Yang
,
S. Y.
,
Kim
,
K.
,
Ko
,
J. U.
,
Seo
,
S.
,
Hwang
,
S. T.
,
Park
,
J. H.
,
Jung
,
H. S.
, et al.,
2023
, “
Design and Control of Lightweight Bionic Arm Driven by Soft Twisted and Coiled Artificial Muscles
,”
Soft Rob.
,
10
(
1
), pp.
17
29
.10.1089/soro.2021.0058
7.
Wang
,
M.
,
Zhang
,
X.
,
Zhang
,
M.
,
Li
,
M.
,
Zhang
,
C.
, and
Jia
,
J.
,
2024
, “
Design of TCP-Actuator-Driven, Soft-Tendon-Integrated Anthropomorphic Dexterous Hand: Soroagilhand-1
,”
Sens. Actuators, A
,
378
, p.
115760
.10.1016/j.sna.2024.115760
8.
Zhou
,
Y.
, and
Li
,
H.
,
2022
, “
A Scientometric Review of Soft Robotics: Intellectual Structures and Emerging Trends Analysis (2010–2021)
,”
Front. Rob. AI
,
9
, p.
868682
.10.3389/frobt.2022.868682
9.
Hannan
,
M. W.
, and
Walker
,
I. D.
,
2003
, “
Kinematics and the Implementation of an Elephant's Trunk Manipulator and Other Continuum Style Robots
,”
J. Rob. Syst.
,
20
(
2
), pp.
45
63
.10.1002/rob.10070
10.
Jones
,
B.
, and
Walker
,
I.
,
2006
, “
Practical Kinematics for Real-Time Implementation of Continuum Robots
,”
IEEE Trans. Rob.
,
22
(
6
), pp.
1087
1099
.10.1109/TRO.2006.886268
11.
Liu
,
Y.
,
Shi
,
W.
,
Chen
,
P.
,
Cheng
,
L.
,
Ding
,
Q.
, and
Deng
,
Z.
,
2023
, “
Variable Curvature Modeling Method of Soft Continuum Robots With Constraints
,”
Chin. J. Mech. Eng.
,
36
(
1
), p.
148
.10.1186/s10033-023-00967-6
12.
Gong
,
Z.
,
Fang
,
X.
,
Chen
,
X.
,
Cheng
,
J.
,
Xie
,
Z.
,
Liu
,
J.
,
Chen
,
B.
, et al.,
2021
, “
A Soft Manipulator for Efficient Delicate Grasping in Shallow Water: Modeling, Control, and Real-World Experiments
,”
Int. J. Rob. Res.
,
40
(
1
), pp.
449
469
.10.1177/0278364920917203
13.
Camarillo
,
D.
,
Carlson
,
C.
, and
Salisbury
,
J.
,
2009
, “
Task-Space Control of Continuum Manipulators With Coupled Tendon Drive
,”
The Eleventh International Symposium
, Athens, Greece, July 13–15, pp.
271
280
.10.1007/978-3-642-00196-3_32
14.
Bajo
,
A.
,
Goldman
,
R.
, and
Simaan
,
N.
,
2011
, “
Configuration and Joint Feedback for Enhanced Performance of Multi-Segment Continuum Robots
,”
IEEE International Conference on Robotics and Automation
, Shanghai, China, May 9–13, pp.
2905
2912
.10.1109/ICRA.2011.5980005
15.
Mahl
,
T.
,
Mayer
,
A.
,
Hildebrandt
,
A.
, and
Sawodny
,
O.
,
2013
, “
A Variable Curvature Modeling Approach for Kinematic Control of Continuum Manipulators
,”
Proceedings of the American Control Conference
, Washington, DC, June 17–19, pp.
4945
4950
.10.1109/ACC.2013.6580605
16.
Mahl
,
T.
,
Hildebrandt
,
A.
, and
Sawodny
,
O.
,
2014
, “
A Variable Curvature Continuum Kinematics for Kinematic Control of the Bionic Handling Assistant
,”
IEEE Trans. Rob.
,
30
(
4
), pp.
935
949
.10.1109/TRO.2014.2314777
17.
Webster
,
R.
, III
,
Swensen
,
J.
,
Romano
,
J.
, and
Cowan
,
N.
,
2009
, “
Closed-Form Differential Kinematics for Concentric-Tube Continuum Robots With Application to Visual Servoing
,”
The Eleventh International Symposium
, Athens, Greece, July 13–15, pp.
485
494
.10.1007/978-3-642-00196-3_56
18.
Bailly
,
Y.
, and
Amirat
,
Y.
,
2005
, “
Modeling and Control of a Hybrid Continuum Active Catheter for Aortic Aneurysm Treatment
,”
IEEE International Conference on Robotics and Automation
, Barcelona, Spain, Apr. 18–22, pp.
924
929
.10.1109/ROBOT.2005.1570235
19.
Jang
,
J. H.
,
Jamil
,
B.
,
Moon
,
Y.
,
Coutinho
,
A.
,
Park
,
G.
, and
Rodrigue
,
H.
,
2023
, “
Design of Gusseted Pouch Motors for Improved Soft Pneumatic Actuation
,”
IEEE/ASME Trans. Mechatron.
,
28
(
6
), pp.
3053
3063
.10.1109/TMECH.2023.3244347
20.
Stella
,
F.
,
Guan
,
Q.
,
Della Santina
,
C.
, and
Hughes
,
J.
,
2023
, “
Piecewise Affine Curvature Model: A Reduced-Order Model for Soft Robot-Environment Interaction Beyond PCC
,” IEEE International Conference on Soft Robotics (
RoboSoft
), Singapore, Apr. 3–7, pp.
1
7
.10.1109/RoboSoft55895.2023.10121939
21.
Xie
,
Q.
,
Wang
,
T.
,
Yao
,
S.
,
Zhu
,
Z.
,
Tan
,
N.
, and
Zhu
,
S.
,
2020
, “
Design and Modeling of a Hydraulic Soft Actuator With Three Degrees of Freedom
,”
Smart Mater. Struct.
,
29
(
12
), p.
125017
.10.1088/1361-665X/abc26e
22.
Thuruthel
,
T.
,
Falotico
,
E.
,
Renda
,
F.
, and
Laschi
,
C.
,
2019
, “
Model-Based Reinforcement Learning for Closed-Loop Dynamic Control of Soft Robotic Manipulators
,”
IEEE Trans. Rob.
,
35
(
1
), pp.
124
134
.10.1109/TRO.2018.2878318
23.
Kim
,
M.
,
Seo
,
J.
,
Lee
,
M.
, and
Choi
,
J.
,
2021
, “
Vision-Based Uncertainty-Aware Lane Keeping Strategy Using Deep Reinforcement Learning
,”
ASME J. Dyn. Syst., Meas., Control
,
143
(
8
), p.
084503
.10.1115/1.4050396
24.
Bhagat
,
S.
,
Banerjee
,
H.
,
Tse
,
Z.
, and
Ren
,
H.
,
2019
, “
Deep Reinforcement Learning for Soft, Flexible Robots: Brief Review With Impending Challenges
,”
Robotics
,
8
(
1
), p.
4
.10.3390/robotics8010004
25.
Wu
,
Q.
,
Gu
,
Y.
,
Li
,
Y.
,
Zhang
,
B.
,
Chepinskiy
,
S.
,
Wang
,
J.
,
Zhilenkov
,
A.
,
Krasnov
,
A.
, and
Chernyi
,
S.
,
2020
, “
Position Control of Cable-Driven Robotic Soft Arm Based on Deep Reinforcement Learning
,”
Information
,
11
(
6
), p.
310
.10.3390/info11060310
26.
Benyamen
,
H.
,
Chowdhury
,
M.
, and
Keshmiri
,
S.
,
2024
, “
Data-Driven Aircraft Modeling for Robust Reinforcement Learning Control Synthesis With Flight Test Validation
,”
ASME J. Dyn. Syst., Meas., Control
,
146
(
6
), p.
061105
.10.1115/1.4065804
27.
Banerjee
,
C.
,
Nguyen
,
K.
,
Fookes
,
C.
, and
Raissi
,
M.
, “
A Survey on Physics Informed Reinforcement Learning: Review and Open Problems
,”
Expert Syst.Appl.
, 287, p.
128166
.10.1016/j.eswa.2025.128166
28.
Eßer
,
J.
,
Bach
,
N.
,
Jestel
,
C.
,
Urbann
,
O.
, and
Kerner
,
S.
,
2023
, “
Guided Reinforcement Learning: A Review and Evaluation for Efficient and Effective Real-World Robotics [Survey]
,”
IEEE Rob. Autom. Mag.
,
30
(
2
), pp.
67
85
.10.1109/MRA.2022.3207664
29.
Honji
,
S.
,
Arita
,
H.
, and
Tahara
,
K.
,
2023
, “
Stochastic Approach for Modeling Soft Fingers With Creep Behavior
,”
Adv. Rob.
,
37
(
22
), pp.
1471
1484
.10.1080/01691864.2023.2279600
30.
Kim
,
D.
,
Kim
,
S.-H.
,
Kim
,
T.
,
Kang
,
B. B.
,
Lee
,
M.
,
Park
,
W.
,
Ku
,
S.
, et al.,
2021
, “
Review of Machine Learning Methods in Soft Robotics
,”
Plos One
,
16
(
2
), p.
e0246102
.10.1371/journal.pone.0246102
31.
Wang
,
C.
,
Lin
,
Z.
,
Liu
,
B.
,
Su
,
C.
,
Chen
,
G.
, and
Xie
,
L.
,
2024
, “
Task Attention-based Multimodal Fusion and Curriculum Residual Learning for Context Generalization in Robotic Assembly
,”
Appl. Intell.
,
54
(
6
), pp.
4713
4735
.10.1007/s10489-024-05417-x
32.
Lou
,
G.
,
Wang
,
C.
,
Xu
,
Z.
,
Liang
,
J.
, and
Zhou
,
Y.
,
2024
, “
Controlling Soft Robotic Arms Using Hybrid Modelling and Reinforcement Learning
,”
IEEE Rob. Autom. Lett.
,
9
(
8
), pp.
7070
7077
.10.1109/LRA.2024.3418312
33.
Polygerinos
,
P.
,
Wang
,
Z.
,
Overvelde
,
J. T.
,
Galloway
,
K. C.
,
Wood
,
R. J.
,
Bertoldi
,
K.
, and
Walsh
,
C. J.
,
2015
, “
Modeling of Soft Fiber-Reinforced Bending Actuators
,”
IEEE Trans. Rob.
,
31
(
3
), pp.
778
789
.10.1109/TRO.2015.2428504
34.
Ogden
,
R. W.
,
1997
,
Non-Linear Elastic Deformations
,
Courier Corporation
, Mineola, New York.
35.
Webster
,
I. I. I. R. J.
, and
Jones
,
B. A.
,
2010
, “
Design and Kinematic Modeling of Constant Curvature Continuum Robots: A Review
,”
Int. J. Rob. Res.
,
29
(
13
), pp.
1661
1683
.10.1177/0278364910368147
36.
Siciliano
,
B.
,
1990
, “
Kinematic Control of Redundant Robot Manipulators: A Tutorial
,”
J. Intell. Rob. Syst.
,
3
(
3
), pp.
201
212
.10.1007/BF00126069