Actor Critic Methods
 The actor-critic method is a reinforcement learning approach that combines
  two components: The actor, which decides the actions to take (policy), and
  the critic, which evaluates the actions taken by estimating the value
  function. The actor updates the policy based on feedback from the critic,
  creating a balance between exploration and optimization.
Deep deterministic policy gradient (DDPG)
Deep Deterministic Policy Gradient (DDPG) is an algorithm is a reinforcement
learning algorithm that combines ideas from Q-Learning (estimating the value of
actions) and Policy Gradients (directly optimizing actions) to learn both the best
actions and their value simultaneously.
                                                                       BITS Pilani, Pilani Campus
Energy-Efficient Trajectory Planning
   Energy-efficient trajectory planning is possible using the Deep Deterministic Policy
    Gradient (DDPG) algorithm for training. It involves parallel training by dividing
    the robot's dynamic model into submodules, facilitating faster training while
    obtaining high accuracy.
   This achieves significant energy savings (23.21%) reduction compared to default
    (trajectories) by eliminating the heavy computations involved in traditional
    nonlinear methods.
   The main advantage of this method is it achieves real-time trajectory generation,
    contrasting with slower traditional optimization techniques like genetic algorithms
    or dynamic programming.
                                                                            BITS Pilani, Pilani Campus
Proximal Policy Optimization (PPO)
   Proximal Policy Optimization (PPO) is a reinforcement learning algorithm
    designed to optimize policies efficiently and reliably by improving upon Trust
    Region Policy Optimization (TRPO).
   PPO uses a clipped objective function to limit the size of policy updates, ensuring
    they stay within a safe range without requiring the computational complexity of
    TRPO's trust region constraints.
Robotic arm trajectory tracking method based on improved proximal policy
optimization
• To study the trajectory tracking method for robotic arms, the traditional tracking
  method has low accuracy and cannot realize the complex tracking tasks.
• Compared with traditional methods, deep reinforcement learning is an effective
  scheme with the advantages of robustness and solving complex problems.
                                                                           BITS Pilani, Pilani Campus
Conti..
• If the step size is too large, the result is jittery and does not converge. The PPO
  algorithm uses the ratio of new and old strategies, which can solve the problem that
  the learning rate is difficult to determine in the PG algorithm. To improve the
  robustness of the tracking algorithm, the PPO algorithm is improved based on the
  stable policy gradient.
                           (a)                                  (b)
• The solid blue line in Fig (a) is the expected trajectory of the robotic arm. The red
  solid line in Fig (b) shows the actual trajectory of the robotic arm.
• The simulation results show that the Improved-PPO algorithm outperforms the A3C
  and PPO algorithms for robotic arm trajectory tracking.
                                                                           BITS Pilani, Pilani Campus
Trust region policy Optimization
 Trust Region Policy Optimization (TRPO) is a reinforcement learning algorithm
  that optimizes policies by constraining the step size during updates to ensure stable
  and reliable learning. It uses a trust region constraint to prevent the policy from
  changing too drastically, maintaining a balance between exploration and
  exploitation.
 Using TRPO, industrial robots can improve their decision-making in complex
  scenarios, such as assembly or material handling, while ensuring performance
  consistency and energy efficiency.
Complex Robot Manipulation Tasks Based On Hindsight Trust Region Policy Optimization
• In this experimentation, the manipulator is put into four challenging sparse-reward
  environments, which include two types of tasks. One is the reaching task with
  obstacles, and the other consists of three dynamic object tasks. Both types of tasks
  are goal-conditioned, which means the robot will have a goal observation at every
  time step.
                                                                           BITS Pilani, Pilani Campus
• The results show that HTRPO (Hindsight Trust Region Policy Optimization) when
  compared with HPG and TRPO achieves higher success rate and stability on most
  of the tasks.
                                                                    BITS Pilani, Pilani Campus
Thank You
            BITS Pilani, Pilani Campus