Detailed Description
The embodiment of the application provides a method, a system and a storage medium for controlling a mechanical arm based on multi-mode driving. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
For ease of understanding, a specific flow of an embodiment of the present application is described below, referring to fig. 1, and an embodiment of a method for controlling a mechanical arm based on multi-mode driving in an embodiment of the present application includes:
Step S101, acquiring data through a sensor module on a mechanical arm and preprocessing the data to obtain a multi-mode data set, wherein the multi-mode data set comprises visual data, force-touch data, joint encoder data and inertia measurement unit data;
It can be understood that the execution body of the present application may be a multi-mode driving-based mechanical arm control system, and may also be a terminal or a server, which is not limited herein. The embodiment of the application is described by taking a server as an execution main body as an example.
Specifically, a high-speed camera is utilized to shoot the working space of the mechanical arm in real time, and an image stream and a video stream are obtained. These raw visual data typically contain motion blur and therefore need to be processed using a deblurring algorithm. Specifically, a convolutional neural network-based deblurring method, such as DeblurGAN, can be used, which can effectively remove motion blur and improve image quality by generating a mapping relationship from the anti-network learning blur to the sharpness. At the same time, depth cameras are used to acquire three-dimensional point cloud data of the workspace, which provides important information for subsequent environmental modeling and obstacle detection. In terms of force-touch data acquisition, a force-touch sensor array mounted on a robotic arm end effector records pressure distribution, friction and slip information upon contact with an object at a high sampling rate (typically 1 kHz). These data are critical for accurate control of gripping force and judgment of object stability. Meanwhile, the high-precision encoder of each joint of the mechanical arm acquires joint angle, angular velocity and angular acceleration information in real time, and provides basic data for kinematics and dynamics calculation. In addition, an Inertial Measurement Unit (IMU) mounted at a critical portion of the robotic arm collects linear acceleration and angular velocity data that, when fused with encoder data, can provide a more accurate pose estimate.
In order to adapt to different working environments, the mechanical arm is also provided with a temperature sensor and a humidity sensor for acquiring environmental parameter data. These data may be used to adjust control parameters to accommodate the effects of temperature and humidity changes on the performance of the robotic arm. All of this raw data needs to be denoised to reduce the effects of sensor noise and environmental interference. For visual data, wavelet transformation is adopted to remove noise, so that image edge information while can be effectively reserved to remove high-frequency noise. The force-sense of touch data and IMU data can then be noise suppressed using a Kalman filter, which can comprehensively consider the system model and measurement noise to provide an optimal estimate. In the data preprocessing stage, a median filter is used for outlier detection and rejection. This method is particularly effective for removing transient spike noise, and can keep the overall trend of the data while rejecting outliers that deviate significantly. Finally, a time alignment process is required to take into account that the sampling rate of the different sensors may be different. The Dynamic Time Warping (DTW) algorithm plays an important role here, and can adaptively align time series data with different lengths and sampling rates, so as to ensure consistency of multi-mode data in a time dimension.
Through the above steps, the resulting multimodal dataset includes pre-processed visual data, force-haptic data, joint encoder data, and inertial measurement unit data. The data lays a foundation for subsequent feature extraction and fusion, and ensures that a control system can obtain high-quality and synchronous multi-mode sensing information.
For example, in a scenario where a robotic arm performs a precision assembly task, a high speed camera captures images of the work area at a rate of 120 frames/second, with slight motion blur in the original image. After DeblurGAN processing, the image sharpness is improved by about 40%, and the edge sharpness is improved by 25%. The resolution of the point cloud data acquired by the depth camera at the same time is 640x480, and each frame comprises about 30 ten thousand three-dimensional points on average. The force-touch sensor records force data of the end effector in contact with the part at a sampling rate of 1kHz, with approximately + 0.05N random noise present in the raw data. After Kalman filtering, the noise level is reduced to +/-0.01N, and the accuracy of force feedback is remarkably improved. The joint encoder collects data at a frequency of 500Hz, with an angular resolution of up to 0.01 degrees. The IMU data sampling rate was 200Hz, with the raw data having acceleration noise of about + -0.1 m/s2 and angular velocity noise of + -0.05 DEG/s. After the filtering treatment, the noise is reduced to +/-0.02 m/s2 and +/-0.01 degree/s respectively. The environmental sensor records a working area temperature of 23 + -0.5 deg.c and a relative humidity of 45 + -2%. After time alignment processing is carried out on all the data, a unified multi-mode data set is formed, the sampling rate is unified to be 100Hz, and abundant and accurate input information is provided for subsequent feature extraction and control decisions.
Step S102, extracting features of the multi-mode data set and carrying out feature fusion through an attention mechanism to obtain a fusion feature vector;
Specifically, the spatial pyramid pooling process is performed on the visual data, the process divides the image into grids with different scales, and pooling operation is performed in each grid, so that a multi-scale visual characteristic diagram is obtained. Specifically, spatial pyramid pooling typically includes three levels, 1x1, 2x2, and 4x4, of meshing, with maximum pooling operations within each mesh. The method can capture the global and local features of the image and improve the scale invariance of the features. And carrying out self-adaptive feature selection on the multi-scale visual feature map. This step employs a feature selection algorithm based on a channel attention mechanism, such as the Squeeze-and-Excitation (SE) module. The SE module carries out global average pooling on the feature map to obtain a channel descriptor, learns the correlation between channels through two layers of fully-connected networks, and finally re-weights the original feature map by using the learned weight. This process can adaptively emphasize important visual features, suppress non-important features, and thus yield key visual feature descriptors.
For force-haptic data, time-frequency analysis was performed using wavelet packet transforms. The wavelet packet transform is an extension of the wavelet transform, which performs multi-layer decomposition on a signal to decompose not only a low frequency part but also a high frequency part, thus enabling finer time-frequency resolution. In the method, daubechies wavelet basis is used for 5-layer decomposition, and the time-frequency characteristics of 32 frequency bands are obtained. Then, principal Component Analysis (PCA) is performed on these time-frequency features, preserving the principal components that explain the 95% variance, resulting in a compressed pressure distribution and texture representation. The joint encoder data and the inertia measurement unit data are calculated through kinematic forward and backward solutions to obtain pose characteristics of the mechanical arm. The forward kinematics uses DH parameter method to calculate the position and posture of the end effector, while the reverse kinematics adopts analytic method and numerical iteration method to solve the joint angle. And then, establishing a dynamic model of the mechanical arm by using a Lagrangian equation, and obtaining the dynamic characteristics of the mechanical arm, including joint moment, inertia matrix, coriolis force and the like, through differential dynamic modeling.
In order to align the feature sequences of different modalities, a Dynamic Time Warping (DTW) algorithm is employed. DTW finds the best alignment path between two time series by dynamic programming, and can handle sequences of different lengths and sampling rates. In the method, the characteristic sequences of other modes are aligned with the visual characteristic sequences serving as references. And resampling the aligned characteristic sequences through cubic spline interpolation to obtain characteristic data with uniform sampling rate. The feature data of the uniform sampling rate is importance weighted by the cross-modal attention network. The network adopts a transducer architecture, and comprises a multi-head self-attention mechanism and a cross-mode attention mechanism. The self-attention mechanism is used to capture time-series dependency within a single modality, while the cross-modality attention mechanism is used to learn correlations between different modalities. Specifically, the features of each modality are first calculated through the self-attention layer, then cross-modal attention is calculated with the features of other modalities, and finally weighted feature representation is obtained through the feedforward neural network.
The weighted feature representation is subjected to multi-level cascade fusion to form a final fusion feature vector. Cascade fusion includes three levels, early fusion, mid fusion and late fusion. Early fusion directly splices low-level features of different modes, middle fusion adopts a convolutional neural network to fuse the middle-level features, and late fusion adopts a full-connection layer to fuse the high-level features. The multi-level fusion strategy can fully utilize the characteristic information of different levels, and improves the richness and the robustness of characteristic representation.
For example, in a precision assembly task, the robotic arm requires insertion of a microelectronic element into the circuit panel. Firstly, visual data is subjected to spatial pyramid pooling to obtain feature maps of three scales of 1x1, 2x2 and 4x4, wherein each feature map comprises 256 channels. By adaptive feature selection of the SE module, the importance weights of 256 channels are adjusted, with the most important channel weight reaching 0.95 and the least important channel weight falling to 0.05. The force-touch data is subjected to wavelet packet transformation to obtain time-frequency characteristics of 32 frequency bands, and the first 10 principal components are reserved after principal component analysis, so that a variance of 97.8% is explained. The joint encoder and IMU data are calculated through kinematics to obtain the position accuracy of the end effector reaching +/-0.01 mm, and the attitude accuracy is +/-0.1 degrees. The joint moment estimation error obtained by dynamic modeling is less than 2%. The DTW algorithm aligns the feature sequences of the different modalities, reducing the maximum time offset from the original 200ms to 5ms. In a cross-modal attention network, the importance score of a key frame in a self-attention mechanism capturing visual feature sequence reaches 0.8, and the importance score of a common frame is only 0.2. Cross-modal attention shows that at the critical moment of the insertion operation, the weight of the force-haptic feature rises to 0.6, while the visual feature weight falls to 0.3. After multi-level cascade fusion, a 1024-dimensional fusion feature vector is finally obtained, wherein the 1024-dimensional fusion feature vector contains comprehensive representation of visual, force touch, kinematic and dynamic information. The fusion feature vector provides rich and accurate input information for subsequent state estimation and control decisions, so that the mechanical arm can complete the task of inserting the micro-element with the precision of 0.1mm and the force control precision of 0.5N.
Step S103, estimating state data of the fusion feature vector through an unscented Kalman filter, and carrying out state prediction through a recurrent neural network to obtain the current state of the mechanical arm and the predicted state of the mechanical arm;
Specifically, joint space mapping is performed on the fusion feature vectors, and the high-dimensional feature vectors are converted into initial joint state estimation. This is accomplished by a fully connected neural network that maps 1024-dimensional fusion feature vectors to the joint space of the robotic arm, resulting in an initial state vector that contains joint angle, angular velocity, and angular acceleration. And performing forward kinematics calculation on the initial joint state estimation by using a kinematics model of the mechanical arm to obtain pose estimation of the end effector. This step uses DH parametric methods to transform the joint angle into the position and pose of the end effector in Cartesian space through a continuous matrix transformation. Meanwhile, the binding force-touch data is used for estimating the position of the target object, and the process utilizes force feedback information when the end effector is in contact with the object, and the centroid coordinates and the gesture of the object are estimated through a least square method by combining the target object profile information in the visual data.
The initial joint state estimate, end effector pose estimate, and target object position estimate are combined together to form a complete system state vector. The state vector contains the internal state (joint state) and the external state (position relation between the end effector and the target object) of the mechanical arm, and provides a comprehensive information basis for subsequent state estimation and prediction. Unscented Kalman Filter (UKF) is a state estimation method suitable for nonlinear systems that approximates a state distribution by a carefully selected set of sampling points, called sigma points. In the method, UKF is realized by firstly sampling a system state vector through unscented transformation to generate 2n+1 sigma points, wherein n is the dimension of the state vector. Each sigma point contains one possible value of the state variable. Then, nonlinear state propagation is carried out on the sigma points through a system model, and prior state estimation is obtained. The dynamic model of the mechanical arm is considered in the process, and the dynamic model comprises factors such as joint moment, inertia matrix, coriolis force and the like.
And updating the prior state estimation by using the fusion feature vector as a measurement input. UKF calculates the predicted measurement value corresponding to each sigma point, and then compares the predicted measurement values with the actual fusion feature vector to calculate the Kalman gain. And finally, correcting the prior estimation by using Kalman gain to obtain the current state estimation of the mechanical arm. The process can effectively fuse the information of the multi-mode sensor, and improves the accuracy and the robustness of state estimation. For state prediction, the method employs Recurrent Neural Networks (RNNs), in particular long and short term memory networks (LSTM). LSTM is an RNN variant that can effectively handle long-term dependencies, and controls the flow of information and the updating of memory through gating mechanisms. Specifically, the LSTM network receives as inputs the current state of the robotic arm and the adaptive dynamics model, and outputs a state prediction sequence over a period of time in the future.
The training process of the LSTM network is as follows, first, a large amount of historical state sequence data is collected, including robot arm motion data under different tasks and environmental conditions. Then, these data are divided into an input sequence containing state information of t time steps and a target sequence containing state information of k time steps in the future in time sequence. The LSTM network is trained using a Back Propagation Through Time (BPTT) algorithm, minimizing the mean square error between the predicted and actual values. After training, the LSTM network can predict the state change of the mechanical arm in a plurality of time steps in the future according to the current state and the dynamic model.
For example, in a scenario where a 6-degree-of-freedom mechanical arm performs a precision assembly task, the fusion feature vector is mapped in joint space to obtain an initial joint state estimate, where the initial joint state estimate includes angles, angular velocities, and angular accelerations of 6 joints, and the total number of variables is 18. And the kinematic positive solution is calculated to obtain the position accuracy of the end effector of +/-0.05 mm and the gesture accuracy of +/-0.1 degree. The accuracy of the target object position estimation reaches + -0.1 mm. The unscented Kalman filter uses 13 sigma points (2 x 6+1, state vector dimension 6) for state estimation. The diagonal elements of the initial state covariance matrix are set to [0.01, 0.01, 0.01, 0.001, 0.001, 0.001], indicating the initial uncertainty for position and pose. The diagonal elements of the process noise covariance matrix Q are set to [1e-6, 1e-6, 1e-6, 1e-4, 1e-4, 1e-4], and the diagonal elements of the measurement noise covariance matrix R are set to [1e-4, 1e-4, 1e-4, 1e-3, 1e-3, 1e-3]. After UKF processing, the precision of state estimation is improved to the position + -0.02 mm and the posture + -0.05 degree.
LSTM networks employ a 3-layer structure, each layer containing 128 hidden units. The input sequence length is 20 time steps (corresponding to 0.2 seconds of data, assuming a sampling rate of 100 Hz), and the output predicts the state of 10 time steps in the future. The network was trained on a data set containing 100 tens of thousands of samples for 50 rounds using an Adam optimizer with a learning rate of 0.001. After training is completed, the LSTM network can predict the state of the robot arm within 0.1 seconds in the future with an R2 score of 0.98.
In actual operation, when the robotic arm needs to insert a precision part into the hole, the UKF first estimates the current state, LSTM then predicts the next state change within 0.1 seconds. For example, during insertion, the UKF estimates that the current end effector is still 2.5mm from the target orifice with a pose deviation of 0.8 degrees. LSTM predicts that the end effector will move 2.3mm and the attitude bias will decrease to 0.3 degrees within 0.1 seconds of the future. The accurate state estimation and prediction information provides important basis for subsequent track planning and control strategy generation, so that the mechanical arm can complete the insertion task with sub-millimeter precision.
Step S104, carrying out planning analysis on the running track of the mechanical arm on the current state of the mechanical arm and the predicted state of the mechanical arm to obtain a target execution track;
Specifically, task decomposition is performed on the current state of the mechanical arm, and a complex operation task is decomposed into a series of basic action units. The process adopts a hierarchical task planning method to gradually refine a high-level task target into executable basic actions. For example, a task of "picking up an object and placing it" can be decomposed into basic action units such as "moving to the vicinity of the object", "adjusting the posture", "grabbing the object", "moving to the target position", and "releasing the object".
Then, these basic action units are symbolized to obtain a high-level task execution plan. Symbolizing means that predicate logic is used to represent each action unit as an operator with a pre-condition and a post-effect. For example, a "grab Object" operation may be expressed as GRASP (Object):
Precondition:At(HandPosition,ObjectPosition);HandEmpty ;
Effect: Holding(Object) ; ¬HandEmpty
In this representation:
GRASP (Object) is the operation name, where Object is the Object to be grabbed.
Precondition lists the preconditions required to perform this operation:
at (HandPosition, objectPosition) indicates that the position of the manipulator is the same as the object position.
HANDEMPTY indicates that the manipulator is currently empty (not holding other objects).
Effect describes the result after an operation is performed:
Holding (Object) indicates that the manipulator now grabs the object.
¬ HANDEMPTY indicates that the manipulator is no longer empty (¬ indicates negative).
This symbolized representation facilitates understanding and processing of task requirements by subsequent planning algorithms.
And then, performing three-dimensional reconstruction by utilizing the visual data in the multi-mode data set to obtain dynamic environment point cloud data. This step uses structured light or time of flight (ToF) depth cameras, combined with SLAM (synchronous localization and mapping) techniques, to generate a high-precision three-dimensional environmental model. The point cloud data typically contains hundreds of thousands of three-dimensional points, each point containing spatial coordinates (x, y, z) and possibly color information (R, G, B). And carrying out probability occupation analysis on the obtained dynamic environment point cloud data to generate a spatial obstacle distribution map. This process uses an Octree (Octree) data structure to efficiently represent and update occupancy probabilities in three-dimensional space. Each voxel (voxel) has an occupancy probability value ranging from 0 (free) to 1 (occupied). These probability values are updated continuously according to the new sensor data by means of bayesian update rules.
In order to process moving obstacles in a dynamic environment, time sequence prediction is carried out on a spatial obstacle distribution map, and dynamic obstacle avoidance constraint data are obtained. The Kalman filter is used to predict the trajectory of a moving object, typically in the range of 0.5-1 second into the future. The prediction results are used to update the space occupation probability at future time points. And performing kinematic forward solution on the mechanical arm prediction state and the task execution plan to obtain an end effector prediction track. The DH parameter and joint angle prediction sequence of the mechanical arm are utilized in the step, and the position and posture sequence of the end effector in Cartesian space is calculated.
Projecting the predicted trajectory of the end effector onto an image plane to obtain the predicted trajectory of the image feature. This process converts the trajectory points in three-dimensional space into two-dimensional image coordinates using the camera's inside and outside parameter matrix. This representation facilitates direct comparison and optimization with visual feedback.
And then, sampling and optimizing the image characteristic prediction track and the dynamic obstacle avoidance constraint data by adopting a rapid expansion random tree (RRT) algorithm to obtain a candidate track set. The RRT algorithm randomly samples in the configuration space, quickly explores a feasible path, and simultaneously ensures progressive optimality. In the method, the sampling strategy of the RRT is modified to be biased to the area near the predicted track of the image characteristic, so that the planning efficiency is improved.
And performing multi-objective evaluation on the obtained candidate track set and the task execution plan to obtain the pareto optimal track set. The evaluation index includes track length, energy consumption, smoothness, task completion, and the like. The multi-objective optimization adopts NSGA-II (Non-dominated Sorting Genetic Algorithm II) algorithm, and the pareto front is searched by means of genetic algorithm to obtain a group of optimal solutions for weighing among different objectives. And finally, carrying out dynamic constraint analysis on the pareto optimal track set and the current state of the mechanical arm to obtain a target execution track. This step takes into account the maximum joint moment, speed and acceleration limits of the robotic arm, ensuring that the generated trajectory is physically executable. Dynamic constraint analysis uses inverse dynamic calculations to calculate the required joint moment for each trace point and checks if the limit is exceeded.
For example, in a task where a robot arm is required to take a precision part from a table and insert it into a product on an assembly line, the task is first broken down into 5 basic action units, moving over the part, adjusting the gripping attitude, gripping the part, moving over the target product, and performing the insertion operation. Each action unit is symbolized, for example, the grabbing operation is represented as GRASP (Part 1), the pre-conditions include HandAbove (Part 1) and HANDEMPTY, and the post-effect is Holding (Part 1). The environment is reconstructed three-dimensionally to obtain point cloud data containing 500,000 points, and the resolution is 1mm. Probability occupancy analysis divides the workspace into a grid of 100x100x100, each voxel size 1cm3. Dynamic obstacle prediction shows that a human hand will enter the work area after 0.5 seconds, with an occupancy probability rising from 0 to 0.8.
The robot arm predicted state sequence contains 100 time steps of 10ms each. The predicted trajectory of the end effector forms a smooth curve in Cartesian space, having a length of approximately 50cm. The trajectory is projected onto an image plane at 640x480 resolution, resulting in a series of two-dimensional coordinate points. The RRT algorithm generates 10,000 sample points in the configuration space, resulting in 50 candidate trajectories. The multi-objective evaluation considers track length (weight 0.3), energy consumption (weight 0.3), smoothness (weight 0.2), and task completion (weight 0.2). And carrying out 100 generations of evolution on the NSGA-II algorithm to finally obtain 10 pareto optimal tracks.
Kinetic constraint analysis showed that of these 10 trajectories, 3 exceeded the joint velocity limit (maximally 5%) at some points and 2 exceeded the moment limit (maximally 3%) at some points. The final selected target execution trace length was 52cm, the predicted energy consumption was 120J, the smoothness index was 0.95 (full 1), and the task completion was 1. The trajectory satisfies the kinetic constraints at all time points, with a maximum joint speed of 80% of the nominal speed and a maximum moment of 75% of the nominal moment. This trajectory is discretized into 200 path points, each containing 6 joint angle values, forming a 6x200 matrix as input to the subsequent control algorithm.
Step 105, generating a control strategy for the target execution track through an event-triggered distributed coordination control algorithm, obtaining a real-time joint control instruction sequence, and controlling the mechanical arm to operate through the real-time joint control instruction sequence.
Specifically, the target execution trajectory is discretized and the continuous trajectory is converted into a series of critical path points. This process uses an equal time interval sampling method, typically at 10ms intervals, to discretize the trajectory into a series of time-position pairs. And performing inverse kinematics calculation on the key path points to obtain an initial joint angle sequence. Inverse kinematics calculation uses an analytical approach in combination with numerical iterations to convert end effector positions in cartesian space to angular values in joint space. And dynamically compensating the initial joint angle sequence to obtain a compensated joint angle sequence. The dynamic compensation takes the mass distribution, inertia characteristics and gravity influence of the mechanical arm into consideration, and the moment required by each joint is calculated by using a Newton-Euler method. The compensation process adjusts the joint angle to meet the dynamics constraint by reverse dynamics calculation, ensuring that the trajectory is physically executable.
And then, carrying out event triggering condition construction on the compensated joint angle sequence and the fusion feature vector to obtain a triggering threshold set. The event triggering condition is defined based on a state error and a control error, for example, triggering a control update when the joint angle error exceeds a preset threshold (e.g., 0.1 degrees) or the end effector position error exceeds a certain value (e.g., 1 mm). The design of the trigger threshold set takes into account the importance and task requirements of the different joints, typically the trigger threshold of the critical joint (e.g. the end joint) will be set smaller to ensure a higher control accuracy. And carrying out distributed configuration on the trigger threshold set to obtain independent trigger conditions of each joint of the mechanical arm. The distributed configuration considers the coupling relation and task characteristics among joints, establishes a joint network by using a graph theory method, and distributes trigger conditions through a consistency protocol. For example, adjacent joints may share certain triggering conditions to achieve coordinated control.
And performing reinforcement learning on the compensated joint angle sequence and the independent triggering condition by using a soft actor-critic algorithm to obtain an optimal control strategy. The soft actor-critique algorithm is a deep reinforcement learning-based method that includes an actor network (for generating actions) and a critique network (for evaluating action value). In the learning process, the actor network outputs joint control instructions, and the critic network evaluates the effects of the instructions and continuously optimizes the control strategy through back propagation. Learning the targets includes minimizing trajectory tracking errors, energy consumption, and trigger frequency. And carrying out multi-sampling rate decomposition on the optimal control strategy and the trigger threshold set to obtain a hierarchical control instruction set for vision, force feedback and joint control. The multisampling rate decomposition takes into account the update frequency of the different sensors and control loops, e.g., visual feedback may be updated at 30Hz, while joint control may be run at 1 kHz. The hierarchical control instructions ensure control consistency over different time scales by interpolation and prediction methods.
And then, carrying out fault mode analysis on the hierarchical control instruction set and the independent triggering condition of each joint to obtain a fault-tolerant control strategy library. The fault mode analysis considers the situations of sensor failure, actuator abnormality and the like, and designs alternative control strategies for each possible fault situation. The fault-tolerant control strategy library is organized by using a decision tree structure, and can be rapidly switched to an appropriate control strategy according to the system state monitored in real time. And performing man-machine interaction interface mapping on the fault-tolerant control strategy library and the optimized control strategy to obtain a visual control panel. The visual control panel displays the current control state, the trigger condition and the execution track through an intuitive graphical interface, and allows an operator to monitor and intervene in the control process in real time.
And finally, carrying out real-time data stream processing on the visual control panel and the layered control instruction set to obtain a real-time joint control instruction sequence. Real-time data streaming processes use a streaming computing framework, such as APACHE FLINK, to process high frequency sensor data and control instructions, ensuring control delays are minimized. The generated real-time joint control instruction sequence is directly sent to each joint actuator of the mechanical arm, so that the mechanical arm is accurately controlled.
For example, in a scenario where a 6-degree-of-freedom robotic arm performs a precision assembly task, the target execution trajectory is discretized into 500 critical path points, with a time interval of 10ms. The initial sequence of joint angles calculated by inverse kinematics forms a matrix of 6x 500. After dynamic compensation, some joint angle values are adjusted to + -0.5 degrees to compensate for the effects of dynamic loading.
The event triggering condition is set to be that the position error threshold of the end effector is 0.5mm, the attitude error threshold is 0.1 degree, and the angle error threshold of the joint is between 0.05 and 0.1 degree (according to the importance of the joint). After the distributed configuration, the joint triggering condition near the end is more strict, for example, the angle error threshold of the 6 th joint is set to 0.05 degrees, and the base joint is set to 0.1 degrees.
The soft actor-critique algorithm uses a three-layer fully connected network as the actor, inputs as the current state (18 dimensions, including position, velocity and moment of 6 joints), and outputs as control commands for 6 joints. The commentator network is similar in structure, but outputs a scalar value representing the value of the action. After 10000 times of iterative training, the control strategy converges, and the average track tracking error is reduced from the initial 2mm to 0.3mm.
The multisampling rate decomposition divides the control commands into three layers, visual feedback command (30 Hz), force feedback command (100 Hz) and joint control command (1 kHz). The high frequency control command is generated from the low frequency command by interpolation method, ensuring smooth transition.
Failure mode analysis takes into account 5 common failure conditions including single joint encoder failure, force sensor anomalies, and the like. Alternative control strategies are designed for each failure condition, such as switching to IMU-based pose estimation upon failure of the joint encoders.
Real-time data stream processing uses a sliding window of 5ms to process the data streams from 100 sensor nodes. In a normal operation state, the event triggering frequency is about 50Hz, and the communication burden is greatly reduced. The joint control command finally generated is updated at the frequency of 1kHz, the control precision reaches +/-0.02 degree, and the mechanical arm can finish the precision assembly task with the precision of the sub-millimeter level, for example, a bearing with the diameter of 5mm is accurately inserted into a hole with the aperture of 5.01 mm.
According to the embodiment of the application, the real-time synchronization and noise reduction of the data of the vision, force-touch, joint encoder and inertia measurement unit are realized by designing high-efficiency multi-modal data acquisition and preprocessing processes, the data quality and processing efficiency are greatly improved, a solid foundation is laid for subsequent feature extraction and fusion, the effective feature extraction of multi-modal data is realized by adopting innovative space pyramid pooling and adaptive feature selection technology and combining wavelet packet transformation and principal component analysis, the perception capability of the system to environments and tasks is remarkably enhanced, the cross-modal attention network and multi-level cascade fusion mechanism are introduced, the difficult problem of heterogeneous data fusion is successfully solved by the method, the information of different modalities can be organically combined, the respective advantages are fully exerted, the perception precision of the system to the current state and future trend of the mechanical arm is greatly improved based on the state estimation and prediction module of the unscented Kalman filter and the recurrent neural network, the reliable basis is provided for intelligent decision and control, the current state, the prediction state and the environment constraint are comprehensively considered, the perception capability of the multi-modal data is remarkably improved, the optimal control system is designed and the self-adaptive control system is capable of meeting the requirements of the optimal control and the requirements of the automatic control system, the self-adaptive control system, the requirements of the system is well calculated, and the requirements of the self-adaptive control and the self-adaptive control system are met, and the requirements of the system are well-adaptive control is calculated, and the requirements of the system is well-controlled by the environment control system and the control system is triggered by the conditions, the operation precision, flexibility and task adaptability of the mechanical arm in a complex dynamic environment are comprehensively improved, and the efficiency and accuracy of mechanical arm control based on multi-mode driving are improved.
In a specific embodiment, the process of executing step S101 may specifically include the following steps:
(1) Carrying out high-speed shooting on a working space of the mechanical arm to obtain a real-time image stream and a real-time video stream, and carrying out motion blur elimination processing on the real-time image stream and the real-time video stream to obtain clear image data;
(2) Scanning the working space of the mechanical arm through a depth camera to obtain three-dimensional point cloud data;
(3) Performing high-frequency sampling on a force-touch sensor array of an end effector of the mechanical arm to obtain pressure distribution data, friction force and sliding information, and performing data acquisition on encoders of all joints of the mechanical arm to obtain joint angles, joint angular velocities and joint angular accelerations;
(4) Acquiring data of an inertial measurement unit arranged on the mechanical arm to obtain linear acceleration of the mechanical arm and angular velocity of the mechanical arm, and acquiring environmental data through a temperature sensor and a humidity sensor arranged on the mechanical arm to obtain environmental parameter data;
(5) Denoising clear image data, three-dimensional point cloud data and environment parameter data to obtain initial visual data;
(6) Denoising the pressure distribution data, the friction force and the sliding information to obtain initial force-touch data;
(7) Denoising the joint angle, the joint angular velocity and the joint angular acceleration to obtain initial joint encoder data, and denoising the linear acceleration of the mechanical arm and the angular velocity of the mechanical arm to obtain initial inertial measurement unit data;
(8) Combining the initial visual data, the initial force-touch data and the initial inertia measurement unit data into noise reduction multi-modal data, and detecting and eliminating abnormal values of the noise reduction multi-modal data through a median filter to obtain smooth multi-modal data;
(9) And carrying out time alignment processing on the smooth multi-mode data to obtain a multi-mode data set, wherein the multi-mode data set comprises visual data, force-touch data, joint encoder data and inertia measurement unit data.
Specifically, high-speed imaging of the robot arm working space is performed. Real-time image and video streams are captured using a high frame rate camera (e.g., a 500fps high speed camera) followed by a convolutional neural network based deblurring algorithm (e.g., deblurGAN) to eliminate motion blur. The algorithm obtains clear image data by generating a fuzzy-to-clear mapping relation for resisting network learning and processing each frame of image for about 20 ms. Meanwhile, the workspace is 3D scanned with a structured light or time of flight (ToF) depth camera, acquiring point cloud data, typically updated at a frequency of 30Hz, containing about 100,000 3D points per frame.
For force-touch data acquisition, a force-touch sensor array mounted at the robotic arm end effector records pressure distribution, friction and slip information at a sampling rate of 1 kHz. These sensors are capable of detecting forces in the range of 0.1N to 50N with a precision of 0.01N. Meanwhile, the high-precision encoder of each joint collects joint angle data at the frequency of 500Hz, the resolution reaches 0.01 degree, and the angular speed and the angular acceleration are obtained through numerical differential calculation.
An Inertial Measurement Unit (IMU) mounted at a critical part of the robotic arm records linear acceleration and angular velocity at a sampling rate of 200 Hz. A typical MEMS IMU is capable of measuring acceleration of + -16 g and angular velocity of + -2000 deg./s. The environmental parameters are collected by a temperature sensor and a humidity sensor, the update frequency is low, the update frequency is usually 1Hz, the temperature precision is +/-0.5 ℃, and the humidity precision is +/-2% RH. Denoising the acquired data is a key step for improving the data quality. The clear image data and the point cloud data are denoised by using bilateral filtering, and the method can smooth noise while maintaining edge information. The environmental parameter data is slow in change and simple moving average filtering is adopted. These processed data together constitute the initial visual data.
The denoising of force-touch data adopts a wavelet threshold denoising method. And selecting db4 wavelet, performing 4-level decomposition, processing wavelet coefficients by using a soft threshold method, and then reconstructing signals to obtain initial force-touch data. The method can effectively remove high-frequency noise, and simultaneously retains the characteristic of rapid change of signals. Denoising of joint encoder data and IMU data uses a kalman filter. For joint angles, the state equation is established based on the kinematic relationships of angle, angular velocity, and angular acceleration, and the observation equation directly uses the encoder readings. The Kalman filtering of the IMU data is based on a rigid body kinematics model, and the accelerometer and gyroscope data are fused. And obtaining initial joint encoder data and initial IMU data after filtering.
All initially processed data (visual, force-tactile, joint encoder, IMU) are combined into noise-reducing multimodal data. Outlier detection and rejection are performed on these data using a median filter. The window size of the median filter is adjusted for different data types, typically 3-5 samples. The step can effectively remove the sudden noise and abnormal value, and obtain smooth multi-mode data. And performing time alignment processing on the smooth multi-mode data. Other modality data is interpolated and aligned using a Dynamic Time Warping (DTW) algorithm, based on the highest sample rate data (typically force-haptic data, 1 kHz). The DTW algorithm finds the best match between different time sequences through dynamic programming, and can solve the problems of inconsistent sampling rate and time delay.
For example, in a scenario where a 6-degree-of-freedom robotic arm performs a precision assembly task, a high-speed camera captures 1280x720 resolution images at a frame rate of 500 fps. There is motion blur of about 5 pixels in the original image, and after DeblurGAN processing, the blur degree is reduced to below 1 pixel, and each frame of image is processed for 18ms. The depth camera acquires a 640x480 resolution depth map at a frequency of 30Hz, containing 307,200 3D points per frame.
The force-touch sensor array contains 16 force-measuring cells, each sampled at a frequency of 1 kHz. When an insertion operation was performed, the maximum pressure was recorded as 20N and the minimum detectable pressure as 0.05N. The joint encoder data shows that the angle of the end joint ranges from-30 deg. to +45 deg. and the maximum angular velocity reaches 60 deg./s when this operation is performed.
The maximum linear acceleration recorded by the IMU data is 2.5g and the maximum angular velocity is 180 °/s. Environmental parameters showed that the operating zone temperature was maintained at 23±1 ℃ and the relative humidity was 45±3%rh. Wavelet denoising reduces the noise level of force-sense data from + -0.1N to + -0.02N. Kalman filtering reduces the joint angle measurement noise from + -0.05 DEG to + -0.01 DEG, and IMU acceleration measurement noise from + -0.1 m/s2 to + -0.02 m/s2.
The median filtering successfully removes 3 transient spikes (> 40N) in the force data, which may be due to collisions or electrical disturbances. After the time alignment process, all data are unified to a sampling rate of 1kHz, and the maximum time deviation is reduced from the original 20ms to within 0.5 ms.
In a specific embodiment, the process of executing step S102 may specifically include the following steps:
(1) Performing spatial pyramid pooling on visual data to obtain a multi-scale visual feature map, and performing adaptive feature selection on the multi-scale visual feature map to obtain a key visual feature descriptor;
(2) Performing wavelet packet transformation on the force-touch data to obtain time-frequency domain feature representation, and performing principal component analysis on the time-frequency domain feature representation to obtain target pressure distribution data and texture features;
(3) Performing kinematic forward and backward solution calculation on the joint encoder data and the inertia measurement unit data to obtain the pose characteristics of the mechanical arm, and performing differential dynamics modeling on the pose characteristics of the mechanical arm to obtain the dynamics characteristics of the mechanical arm;
(4) Performing time alignment on the key visual feature descriptors, the target pressure distribution data, the texture features, the mechanical arm pose features and the mechanical arm dynamics features to obtain an alignment feature sequence;
(5) Performing nonlinear interpolation on the aligned feature sequences to obtain feature data with uniform sampling rate;
(6) Importance weighting is carried out on the feature data with the uniform sampling rate through the cross-modal attention network, weighted feature representation is obtained, and multi-level cascade fusion is carried out on the weighted feature representation, so that fusion feature vectors are obtained.
Specifically, spatial pyramid pooling is performed on the visual data, which divides the image into grids of different scales and performs pooling operations within each grid. Specifically, a three-level pyramid structure is adopted, and 1x1, 2x2 and 4x4 grids are respectively divided, and the maximum pooling operation is performed in each grid. For example, for an original image of 1280x720, a feature map of 320x180 is obtained on a 4x4 level, a feature map of 640x360 is obtained on a 2x2 level, and the 1x1 level maintains the original resolution. The multi-scale feature extraction can capture global and local features of an image, and the scale invariance of the features is improved. And carrying out self-adaptive feature selection on the multi-scale visual feature map, and adopting a Squeeze-and-Excitation (SE) module. The SE module carries out global average pooling on the feature map to obtain a channel descriptor, learns the correlation between channels through two layers of fully-connected networks, and finally re-weights the original feature map by using the learned weight. For example, for a 256-channel feature map, the SE module might boost the weight of some important channels to 1.5, while the weight of non-important channels is reduced to 0.5, highlighting the key visual features.
For force-haptic data, time-frequency analysis was performed using wavelet packet transforms. The wavelet packet transform is an extension of the wavelet transform that performs multi-layer decomposition on a signal to decompose not only a low frequency part but also a high frequency part. In the method, daubechies wavelet basis is used for 5-layer decomposition, and the time-frequency characteristics of 32 frequency bands are obtained. Principal Component Analysis (PCA) of these time-frequency features, preserving the principal component that accounts for 95% variance, typically compresses the original high-dimensional features (e.g., 512 dimensions) to a lower dimension (e.g., 50 dimensions), resulting in a compressed pressure distribution and texture feature representation.
The joint encoder data and the inertia measurement unit data are calculated through kinematic forward and backward solutions to obtain pose characteristics of the mechanical arm. The forward kinematics uses DH parameter method to calculate the position and posture of the end effector, while the reverse kinematics adopts analytic method and numerical iteration method to solve the joint angle. For example, for a 6 degree-of-freedom robotic arm, forward kinematics maps 6 joint angles to a 6-dimensional pose (3-dimensional position and 3-dimensional pose) of the end effector, and reverse kinematics solves the reverse. And then, establishing a dynamic model of the mechanical arm by using a Lagrangian equation, and obtaining the dynamic characteristics of the mechanical arm, including joint moment, inertia matrix, coriolis force and the like, through differential dynamic modeling.
In order to align the feature sequences of different modalities, a Dynamic Time Warping (DTW) algorithm is employed. DTW finds the best alignment path between two time series by dynamic programming, and can handle sequences of different lengths and sampling rates. In the method, the characteristic sequences of other modes are aligned with the visual characteristic sequences serving as references. The aligned feature sequence is resampled by cubic spline interpolation to obtain feature data with a uniform sampling rate, and the highest sampling rate (such as 1 kHz) is generally selected as the uniform sampling rate. Finally, importance weighting is carried out on the characteristic data with the uniform sampling rate through the cross-modal attention network. The network adopts a transducer architecture, and comprises a multi-head self-attention mechanism and a cross-mode attention mechanism. The self-attention mechanism is used to capture time-series dependency within a single modality, while the cross-modality attention mechanism is used to learn correlations between different modalities. For example, in an 8-head attention mechanism, each head may be focused on a different combination of features, such as visual-force haptic, visual-kinematic, and the like. The attention weights were normalized by the softmax function, ranging between 0 and 1.
The weighted feature representation is subjected to multi-level cascade fusion to form a final fusion feature vector. Cascade fusion includes three levels, early fusion, mid fusion and late fusion. Early fusion directly splices low-level features of different modes, middle fusion adopts a convolutional neural network to fuse the middle-level features, and late fusion adopts a full-connection layer to fuse the high-level features. The resulting fused feature vector is typically a high-dimensional vector, e.g., 1024-dimensional, that contains the comprehensive information of each modality.
For example, in a precision assembly task, the robotic arm requires insertion of a microelectronic element into the circuit panel. The visual data is subjected to spatial pyramid pooling to obtain three-scale feature maps, namely 1280x720x64, 640x360x128 and 320x180x256. The SE module adaptively selects these feature maps to boost the weight of some key channels (e.g., edge detection related channels) to 1.8, while the weight of background related channels is reduced to 0.3. The force-touch data is subjected to wavelet packet transformation to obtain time-frequency characteristics of 32 frequency bands, and the first 20 principal components are reserved after principal component analysis, so that a variance of 98.2% is explained. The joint encoder and IMU data are calculated through kinematics to obtain that the position accuracy of the end effector reaches +/-0.005 mm and the attitude accuracy is +/-0.01 degrees. The joint moment estimation error obtained by dynamic modeling is less than 1%.
The DTW algorithm aligns the feature sequences of the different modalities, reducing the maximum time offset from the original 50ms to 1ms. In a cross-modal attention network, at the key moment of the insertion operation, the attention weight of the force-tactile feature rises to 0.6, the visual feature weight is 0.3, and the kinematic feature weight is 0.1. And finally obtaining a 1024-dimensional fusion feature vector after multilevel cascade fusion. The fusion feature vector enables the mechanical arm to comprehensively utilize visual, force touch and kinematic information, achieves the insertion task of the micro element with the precision of 5 mu m and the force control precision of 0.1N, and improves the insertion success rate from 90% to 99.5%.
In a specific embodiment, the process of executing step S103 may specifically include the following steps:
(1) Performing joint space mapping on the fusion feature vectors to obtain initial joint state estimation data, and performing kinematic orthometric calculation on the initial joint state estimation data to obtain end effector pose estimation data;
(2) Performing object position estimation analysis on the end effector pose estimation data and the force-touch data to obtain target object position estimation data, wherein the target object position estimation data comprises object centroid coordinates and object contour data;
(3) Performing data combination on the initial joint state estimation data, the end effector pose estimation data and the target object position estimation data to obtain a system state vector;
(4) Sampling a system state vector through unscented transformation to obtain a sigma point set, and carrying out nonlinear state propagation on the sigma point set to obtain priori state estimation data;
(5) Carrying out data updating on the prior state estimation data by utilizing the fusion feature vector to obtain the current state of the mechanical arm;
(6) Carrying out dynamic parameter identification on the current state of the mechanical arm to obtain a self-adaptive dynamic model;
(7) Carrying out time sequence modeling on the current state of the mechanical arm and the self-adaptive dynamics model through a gate control circulation unit network to obtain a state evolution sequence;
(8) And carrying out mechanical arm simulation operation analysis on the state evolution sequence to obtain a mechanical arm prediction state.
Specifically, a fully connected neural network is used to map a high-dimensional fusion feature vector (for example 1024 dimensions) to a joint space of the mechanical arm to obtain initial joint state estimation data. For a 6 degree of freedom manipulator, the output of this mapping network is an 18-dimensional vector containing the angles, angular velocities and angular accelerations of the 6 joints. Then, a DH parameter method is used for kinematic orthometric calculation, and the state of the joint space is converted into the pose of the end effector in Cartesian space. This process involves a series of matrix transformations that result in 6-dimensional end effector pose estimation data, including 3-dimensional positions and 3-dimensional poses (e.g., euler angles or quaternions representations). Target object position estimation is performed in combination with end effector pose estimation data and force-sense of touch data. The process combines visual information and force feedback using a fusion algorithm based on Kalman filtering. The visual information provides the general location and contour of the object, while the force-tactile data provides the precise location and contact force of the contact point. And estimating the barycenter coordinates of the object by a least square method, and reconstructing the outline of the object by using shape priori knowledge and an edge detection algorithm. The resulting target object position estimate data typically includes 3-dimensional centroid coordinates and a set of points or parametric surfaces describing the object's contour.
The initial joint state estimation data, the end effector pose estimation data, and the target object position estimation data are combined into a comprehensive system state vector. For a 6-degree-of-freedom robot arm and a target object, this state vector may be about 30 dimensions, containing the complete state of the robot arm and the position information of the target object.
The system state vector is sampled using an unscented transformation to generate a sigma point set. For an n-dimensional state vector, 2n+1 sigma points are typically generated. Each sigma point is a sample point in the state space representing one possible implementation of the state distribution. These sigma points are propagated in nonlinear state through the system model to obtain prior state estimation data. The dynamic model of the mechanical arm is considered in the process, and the dynamic model comprises joint moment, inertia matrix, coriolis force and other factors. And updating the prior state estimation by using the fusion feature vector as observation data. The step adopts Unscented Kalman Filtering (UKF) algorithm, calculates Kalman gain and corrects state estimation to obtain the current state of the mechanical arm. The advantage of the UKF is that it is able to handle highly non-linear systems, without the need for linearization processes, and thus to estimate the state of the robot arm more accurately.
And carrying out dynamic parameter identification on the obtained current state of the mechanical arm, and constructing a self-adaptive dynamic model. This process uses a Recursive Least Squares (RLS) algorithm to estimate parameters such as mass, inertia, and friction coefficient of the robotic arm in real time. RLS algorithms can converge quickly and can accommodate slow changes in parameters such as load changes or joint wear. And adopting a gate control loop unit (GRU) network to carry out time sequence modeling on the current state of the mechanical arm and the self-adaptive dynamics model. GRU is a variant of recurrent neural network that can effectively capture long-term dependencies while being computationally efficient over LSTM. The GRU network receives as input a sequence of states and kinetic parameters within a time window, and outputs state predictions for a number of time steps in the future. The state evolution sequence is obtained in the process, and the change trend of the state of the mechanical arm along with time is reflected.
And carrying out mechanical arm simulation operation analysis on the state evolution sequence, wherein forward dynamics simulation is used for calculating the motion trail of the mechanical arm in a future time period based on the predicted state and the control input. The analysis considers a complete mechanical arm dynamics model, including joint limit, speed and acceleration constraint and the like, and finally obtains the mechanical arm prediction state.
For example, in a scenario where a 6-degree-of-freedom robotic arm performs a precision assembly task, the fused feature vector is a 1024-dimensional vector. It is mapped to an 18-dimensional initial joint state estimate, including angles, angular velocities, and angular accelerations of 6 joints, by a fully connected neural network. For example, the estimated state of the first joint may be an angle of 30, an angular velocity of 5/s, and an angular acceleration of 1/s 2. The kinematic orthometric calculations transform these joint states into the pose of the end effector, resulting in position (x=500 mm, y=300 mm, z=200 mm) and pose (roll=10°, pitch=5°, yaw=15°).
Bond force-haptic data, estimating the position of a target object (e.g., an electronic component). Assuming that the force sensor detects a contact force of 0.5N, the direction is the negative z-axis direction. In combination with visual information, the object centroid coordinates (505 mm, 305mm, 195 mm) were estimated and the profile data was represented as a 10mm x 5mm x 2mm rectangular prism.
The system state vector combines this information to form a 30-dimensional vector. The unscented transform samples this vector to generate 61 sigma points. Each sigma point is propagated through the kinetic model, for example, taking into account a time step of 0.1 seconds, calculating the possible states after 0.1 seconds given the initial state and the control input. The UKF updating step uses the fusion feature vector as an observation to correct the state estimation. Assuming an initial estimated tip position error of 5mm, the update is followed by a reduction to 1mm. The kinetic parameter identification process may find that the effective mass of the robot arm tip increases from 1kg to 1.1kg as a result of gripping the object.
The GRU network uses 128 hidden units, inputs a sequence of states and kinetic parameters for the past 10 time steps (10 ms per step), predicting the state for the next 5 time steps. For example, after 0.05 seconds of prediction, the end effector will move to the (510 mm, 310mm, 190 mm) position. Finally, the mechanical arm simulation operation analysis considers the predicted state and the control strategy, and calculates that the mechanical arm can accurately place the object within 0.5 seconds in the future, and the predicted final position error is smaller than 0.1mm. This high-precision state estimation and prediction enables the robotic arm to achieve sub-millimeter operational accuracy in complex assembly tasks.
In a specific embodiment, the process of executing step S104 may specifically include the following steps:
(1) Performing task decomposition on the current state of the mechanical arm to obtain a basic action unit sequence, and performing symbolized representation on the basic action unit sequence to obtain a task execution plan;
(2) Performing three-dimensional reconstruction on visual data in the multi-mode data set to obtain dynamic environment point cloud data;
(3) Probability occupation analysis is carried out on the dynamic environment point cloud data to obtain a spatial obstacle distribution map;
(4) Performing time sequence prediction on the spatial obstacle distribution map to obtain dynamic obstacle avoidance constraint data, and performing kinematic forward solution on the predicted state of the mechanical arm and the task execution plan to obtain a predicted track of the end effector;
(5) Performing image plane projection on the predicted track of the end effector to obtain an image characteristic predicted track, and performing sampling optimization on the image characteristic predicted track and dynamic obstacle avoidance constraint data to obtain a candidate track set;
(6) Performing multi-objective evaluation on the candidate track set and the task execution plan to obtain a pareto optimal track set;
(7) And carrying out dynamic constraint analysis on the pareto optimal track set and the current state of the mechanical arm to obtain a target execution track.
Specifically, task decomposition is performed on the current state of the mechanical arm, and a complex operation task is decomposed into a series of basic action units. The process adopts a hierarchical task planning method to gradually refine a high-level task target into executable basic actions. For example, a task of "picking up an object and placing it" can be decomposed into basic action units such as "moving to the vicinity of the object", "adjusting the posture", "grabbing the object", "moving to the target position", and "releasing the object".
Then, these basic action units are symbolized to obtain a high-level task execution plan. Symbolizing means that predicate logic is used to represent each action unit as an operator with a pre-condition and a post-effect. For example, a "grab Object" operation may be expressed as GRASP (Object):
Precondition:At(HandPosition,ObjectPosition);HandEmpty ;
Effect: Holding(Object) ; ¬HandEmpty
In this representation:
GRASP (Object) is the operation name, where Object is the Object to be grabbed.
Precondition lists the preconditions required to perform this operation:
at (HandPosition, objectPosition) indicates that the position of the manipulator is the same as the object position.
HANDEMPTY indicates that the manipulator is currently empty (not holding other objects).
Effect describes the result after an operation is performed:
Holding (Object) indicates that the manipulator now grabs the object.
¬ HANDEMPTY indicates that the manipulator is no longer empty (¬ indicates negative).
This symbolized representation facilitates understanding and processing of task requirements by subsequent planning algorithms.
And then, performing three-dimensional reconstruction by utilizing the visual data in the multi-mode data set to obtain dynamic environment point cloud data. This process typically employs structured light or time of flight (ToF) depth cameras, combined with synchronous localization and mapping (SLAM) techniques, to generate a high-precision three-dimensional environmental model. For example, a typical workspace may be reconstructed as a point cloud containing 500,000 points, each containing (x, y, z) coordinates and RGB color information.
And carrying out probability occupation analysis on the obtained dynamic environment point cloud data to generate a spatial obstacle distribution map. This step uses an Octree (Octree) data structure to efficiently represent and update occupancy probabilities in three-dimensional space. The workspace may be divided into a grid of 100x100x100, each voxel size 1cm3. The occupancy probability for each voxel is continuously updated from the new sensor data, ranging from 0 (free) to 1 (occupied), by bayesian update rules.
In order to process the dynamic environment, time sequence prediction is carried out on the spatial obstacle distribution map, and dynamic obstacle avoidance constraint data are obtained. The Kalman filter is used here to predict the trajectory of a moving object, typically predicting states within 0.5-1 seconds of the future. For example, if an object moving at a speed of 0.5m/s is detected, the predictive algorithm calculates the area of space that the object may occupy in the next 1 second. And performing kinematic forward solution on the mechanical arm prediction state and the task execution plan to obtain an end effector prediction track. This step calculates the position and attitude sequence of the end effector in Cartesian space using DH parameters of the robotic arm and the predicted joint angle sequence. For a 6 degree of freedom manipulator, this process involves a series of 4x4 matrix transformations, resulting in a three-dimensional space curve marked by a time stamp.
Projecting the predicted trajectory of the end effector onto an image plane to obtain the predicted trajectory of the image feature. This process converts the trajectory points in three-dimensional space into two-dimensional image coordinates using the camera's inside and outside parameter matrix. For example, a 3D point (500 mm, 300mm, 200 mm) may be projected as image coordinates (320, 240). Using fast expanding random treesAnd the algorithm performs sampling optimization on the image characteristic prediction track and the dynamic obstacle avoidance constraint data to obtain a candidate track set. The fast extended random tree algorithm samples randomly in the configuration space, exploring feasible paths quickly. In the method, the sampling strategy of the algorithm is modified to favor the area near the predicted track of the image characteristic, so as to improve the planning efficiency. For example, in one planning, the algorithm may generate 10,000 sample points, resulting in 50 viable candidate trajectories.
And performing multi-objective evaluation on the obtained candidate track set and the task execution plan to obtain the pareto optimal track set. The evaluation index includes track length, energy consumption, smoothness, task completion, and the like. The multi-objective optimization adopts NSGA-II (Non-dominated Sorting Genetic Algorithm II) algorithm, and the pareto front is searched by means of genetic algorithm. For example, 10 pareto optimal trajectories may be obtained, each trajectory making different trade-offs between different targets. And carrying out dynamic constraint analysis on the pareto optimal track set and the current state of the mechanical arm to obtain a target execution track. This step takes into account the maximum joint moment, speed and acceleration limits of the robotic arm, ensuring that the generated trajectory is physically executable. Dynamic constraint analysis uses inverse dynamic calculations to calculate the required joint moment for each trace point and checks if the limit is exceeded.
For example, in a task where a robot arm is required to take a precision part from a table and insert it into a product on an assembly line, the task is first broken down into 5 basic action units, moving over the part, adjusting the gripping attitude, gripping the part, moving over the target product, and performing the insertion operation. Each action unit is symbolized, for example, the grabbing operation is represented as GRASP (Part 1), the pre-conditions include HandAbove (Part 1) and HANDEMPTY, and the post-effect is Holding (Part 1). The environment is reconstructed three-dimensionally to obtain point cloud data containing 500,000 points, and the resolution is 1mm. Probability occupancy analysis divides the workspace into a grid of 100x100x100, each voxel size 1cm3. Dynamic obstacle prediction shows that a human hand will enter the work area after 0.5 seconds, with an occupancy probability rising from 0 to 0.8. The robot arm predicted state sequence contains 100 time steps of 10ms each. The predicted trajectory of the end effector forms a smooth curve in Cartesian space, having a length of approximately 50cm.
The fast expanding random tree algorithm generates 10,000 sample points in the configuration space, resulting in 50 candidate trajectories. The multi-objective evaluation considers track length (weight 0.3), energy consumption (weight 0.3), smoothness (weight 0.2), and task completion (weight 0.2). And carrying out 100 generations of evolution on the NSGA-II algorithm to finally obtain 10 pareto optimal tracks. Kinetic constraint analysis showed that of these 10 trajectories, 3 exceeded the joint velocity limit (maximally 5%) at some points and 2 exceeded the moment limit (maximally 3%) at some points. The final selected target execution trace length was 52cm, the predicted energy consumption was 120J, the smoothness index was 0.95 (full 1), and the task completion was 1. The trajectory satisfies the kinetic constraints at all time points, with a maximum joint speed of 80% of the nominal speed and a maximum moment of 75% of the nominal moment. This trajectory is discretized into 200 path points, each containing 6 joint angle values, forming a 6x200 matrix as input to the subsequent control algorithm.
In a specific embodiment, the process of executing step S105 may specifically include the following steps:
(1) Performing discretization sampling on the target execution track to obtain a critical path point sequence, and performing inverse kinematics calculation on the critical path point sequence to obtain an initial joint angle sequence;
(2) Dynamically compensating the initial joint angle sequence to obtain a compensated joint angle sequence;
(3) Carrying out event triggering condition construction on the compensation joint angle sequence and the fusion feature vector to obtain a triggering threshold set, and carrying out distributed configuration on the triggering threshold set to obtain independent triggering conditions of each joint of the mechanical arm;
(4) Performing reinforcement learning on the compensation joint angle sequence and the independent triggering condition of each joint of the mechanical arm to obtain an optimal control strategy;
(5) Performing multi-sampling rate decomposition on the optimal control strategy and the trigger threshold set to obtain a hierarchical control instruction set for vision, force feedback and joint control;
(6) Performing fault mode analysis on the layered control instruction set and the independent triggering conditions of each joint of the mechanical arm to obtain a fault-tolerant control strategy library;
(7) Performing man-machine interaction interface mapping on the fault-tolerant control strategy library and the optimized control strategy to obtain a visual control panel;
(8) And carrying out real-time data stream processing on the visual control panel and the layered control instruction set to obtain a real-time joint control instruction sequence, and controlling the mechanical arm to operate through the real-time joint control instruction sequence.
Specifically, key steps of the method for realizing the control of the mechanical arm based on multi-mode driving start from discretizing and sampling the target execution track. This process employs an equal time interval sampling, typically at 10ms intervals, to convert a continuous track into a series of critical path points. For example, a5 second trajectory may be discretized into 500 path points, each containing position and attitude information of the end effector in Cartesian space. And performing inverse kinematics calculation on the key path points to obtain an initial joint angle sequence. Inverse kinematics calculations use analytic methods in combination with numerical iterations, such as newton-raffinsen, to convert end effector positions in cartesian space to angular values in joint space. For a 6 degree of freedom manipulator, this step will generate a 6x500 matrix, with each column representing 6 joint angles at one point in time.
And dynamically compensating the initial joint angle sequence to obtain a compensated joint angle sequence. The dynamic compensation takes the mass distribution, inertia characteristics and gravity influence of the mechanical arm into consideration, and the moment required by each joint is calculated by using a Newton-Euler method. The compensation process adjusts the joint angle to meet the dynamics constraint by reverse dynamics calculation, ensuring that the trajectory is physically executable. For example, the angle of some joints may need to be adjusted by ±0.5 degrees, taking into account the load and gravitational effects. And carrying out event triggering condition construction on the compensation joint angle sequence and the fusion feature vector to obtain a triggering threshold set. The event triggering condition is defined based on a status error and a control error, such as triggering a control update when the joint angle error exceeds 0.1 degrees or the end effector position error exceeds 1 mm. The design of the trigger threshold set takes into account the importance and task requirements of the different joints, typically the trigger threshold of the critical joint (e.g. the end joint) will be set smaller to ensure a higher control accuracy.
And carrying out distributed configuration on the trigger threshold set to obtain independent trigger conditions of each joint of the mechanical arm. The distributed configuration considers the coupling relation and task characteristics among joints, establishes a joint network by using a graph theory method, and distributes trigger conditions through a consistency protocol. For example, adjacent joints may share certain triggering conditions to achieve coordinated control. And performing reinforcement learning on the compensation joint angle sequence and the independent triggering condition by using a soft actor-critic algorithm to obtain an optimal control strategy. The soft actor-critique algorithm includes an actor network (for generating actions) and a critique network (for evaluating action value). In the learning process, the actor network outputs joint control instructions, and the critic network evaluates the effects of the instructions and continuously optimizes the control strategy through back propagation. Learning the targets includes minimizing trajectory tracking errors, energy consumption, and trigger frequency.
And carrying out multi-sampling rate decomposition on the optimal control strategy and the trigger threshold set to obtain a hierarchical control instruction set for vision, force feedback and joint control. The multisampling rate decomposition considers the update frequency of the different sensors and control loops, such as visual feedback at 30Hz, force feedback at 100Hz, and joint control at 1 kHz. The hierarchical control instructions ensure control consistency over different time scales by interpolation and prediction methods. And carrying out fault mode analysis on the hierarchical control instruction set and the independent triggering condition of each joint to obtain a fault-tolerant control strategy library. The fault mode analysis considers the situations of sensor failure, actuator abnormality and the like, and designs alternative control strategies for each possible fault situation. The fault-tolerant control strategy library is organized by using a decision tree structure, and can be rapidly switched to an appropriate control strategy according to the system state monitored in real time.
And performing man-machine interaction interface mapping on the fault-tolerant control strategy library and the optimized control strategy to obtain a visual control panel. The visual control panel displays the current control state, the trigger condition and the execution track through an intuitive graphical interface, and allows an operator to monitor and intervene in the control process in real time.
And carrying out real-time data stream processing on the visual control panel and the layered control instruction set to obtain a real-time joint control instruction sequence. Real-time data streaming processes use a streaming computing framework, such as APACHE FLINK, to process high frequency sensor data and control instructions, ensuring control delays are minimized. The generated real-time joint control instruction sequence is directly sent to each joint actuator of the mechanical arm, so that the mechanical arm is accurately controlled.
For example, in a scenario where a 6-degree-of-freedom robotic arm performs a precision assembly task, the target execution trajectory is discretized into 500 critical path points, with a time interval of 10ms. The initial sequence of joint angles calculated by inverse kinematics forms a matrix of 6x 500. After dynamic compensation, some joint angle values are adjusted to + -0.5 degrees to compensate for the effects of dynamic loading. The event triggering condition is set to be that the position error threshold of the end effector is 0.5mm, the attitude error threshold is 0.1 degree, and the angle error threshold of the joint is between 0.05 and 0.1 degree (according to the importance of the joint). After the distributed configuration, the joint triggering condition near the end is more strict, for example, the angle error threshold of the 6 th joint is set to 0.05 degrees, and the base joint is set to 0.1 degrees. The soft actor-critique algorithm uses a three-layer fully connected network as the actor, inputs as the current state (18 dimensions, including position, velocity and moment of 6 joints), and outputs as control commands for 6 joints. The commentator network is similar in structure, but outputs a scalar value representing the value of the action. After 10000 times of iterative training, the control strategy converges, and the average track tracking error is reduced from the initial 2mm to 0.3mm.
The multisampling rate decomposition divides the control commands into three layers, visual feedback command (30 Hz), force feedback command (100 Hz) and joint control command (1 kHz). The high frequency control command is generated from the low frequency command by interpolation method, ensuring smooth transition. Failure mode analysis takes into account 5 common failure conditions including single joint encoder failure, force sensor anomalies, and the like. Alternative control strategies are designed for each failure condition, such as switching to IMU-based pose estimation upon failure of the joint encoders.
Real-time data stream processing uses a sliding window of 5ms to process the data streams from 100 sensor nodes. In a normal operation state, the event triggering frequency is about 50Hz, and the communication burden is greatly reduced. The joint control command finally generated is updated at the frequency of 1kHz, the control precision reaches +/-0.02 degree, so that the mechanical arm can finish the precision assembly task with the precision of sub-millimeter level, for example, a bearing with the diameter of 5mm is accurately inserted into a hole with the aperture of 5.01mm, and the maximum force error in the insertion process is controlled within 0.1N.
In a specific embodiment, the process of performing the step of performing event triggering condition construction on the compensated joint angle sequence and the fusion feature vector to obtain a trigger threshold set, and performing distributed configuration on the trigger threshold set to obtain an independent triggering condition of each joint of the mechanical arm may specifically include the following steps:
(1) Performing joint space analysis on the compensated joint angle sequence to obtain a joint movement range;
(2) Moment threshold calculation is carried out on the force-touch data in the multi-mode data set to obtain a joint moment range, constraint analysis is carried out on the joint movement range and the joint moment range to obtain a preliminary triggering condition;
(3) Estimating the position of a target object for the visual data in the multi-mode data set to obtain the position error of the end effector, and performing jacobian matrix mapping for the position error of the end effector to obtain a joint angle error threshold;
(4) Weighting and fusing the preliminary triggering condition and the joint angle error threshold value to obtain a comprehensive triggering threshold value;
(5) Carrying out dynamic adjustment coefficient calculation on the comprehensive trigger threshold value to obtain a self-adaptive trigger threshold value set;
(6) Performing noise sensitivity analysis on the self-adaptive trigger threshold set to obtain an anti-interference trigger threshold;
(7) And performing joint coupling analysis on the anti-interference trigger threshold value to obtain a trigger threshold value set, and performing mechanical arm topological structure mapping on the trigger threshold value set to obtain independent trigger conditions of each joint of the mechanical arm.
Specifically, the range of motion of each joint throughout the task is calculated by statistical methods, including maximum angle, minimum angle, and angular velocity limits. For example, for a 6 degree of freedom mechanical arm, the range of motion of the first joint (base rotation) may be [ -180 °, 180 ° ], while the range of motion of the end joint may be smaller, such as [ -90 °, 90 ° ]. And calculating a moment threshold value of the force-touch data in the multi-mode data set to obtain a joint moment range. The process uses the data of the force-touch sensor and combines the dynamic model of the mechanical arm to calculate the maximum moment which each joint can encounter when executing the task. For example, when the end effector grips a 10kg object, it may cause the second joint (shoulder) to experience a torque of 100 Nm.
And carrying out constraint analysis on the joint movement range and the joint moment range to obtain a preliminary triggering condition. This step considers both kinematic and dynamic constraints and sets a preliminary trigger threshold for each joint. In general, joints with moments approaching a limit will be given more sensitive triggering conditions. For example, if the maximum allowable torque for a joint is 150Nm, a control update may be triggered when the actual torque exceeds 120Nm (80% threshold). And estimating the position of the target object for the visual data in the multi-mode data set to obtain the position error of the end effector. This process uses computer vision algorithms, such as deep learning based target detection and pose estimation methods, to calculate the relative position between the target object and the end effector. For example, in a precision assembly task, if a positional deviation of the end effector from the target jack of more than 0.5mm is detected, a trigger control update is required.
And performing jacobian matrix mapping on the position error of the end effector to obtain a joint angle error threshold. The jacobian describes the sensitivity of the end effector position to joint angle. The end position error can be converted into the angle error of each joint by the inverse operation of the jacobian matrix. For example, a tip position error of 0.5mm may correspond to an angular error of 0.1 ° for the sixth joint, but only 0.01 ° for the first joint. And carrying out weighted fusion on the preliminary triggering condition and the joint angle error threshold value to obtain a comprehensive triggering threshold value. This step uses a weighted average or fuzzy logic approach, taking into account both moment constraints and accuracy requirements. The assignment of the weights reflects the characteristics of the task, as in force-controlled tasks, the weights of the moment thresholds may be higher.
And carrying out dynamic adjustment coefficient calculation on the comprehensive trigger threshold value to obtain a self-adaptive trigger threshold value set. This process takes into account the different phases of the task and environmental changes, dynamically adjusting the trigger sensitivity. For example, the trigger threshold may be automatically lowered by 20% when approaching the target object to improve control accuracy. And carrying out noise sensitivity analysis on the self-adaptive trigger threshold set to obtain an anti-interference trigger threshold. This step uses Monte Carlo simulation or statistical analysis methods to evaluate the reliability of the trigger conditions at different noise levels. For example, if the sensor noise standard deviation is 0.01 °, the trigger threshold may need to be set to 0.03 ° (3 standard deviations) to avoid false triggers.
And then, performing joint coupling analysis on the anti-interference trigger threshold value to obtain a final trigger threshold value set. This process takes into account the kinetic coupling between the joints of the robotic arm and uses a structured learning or graph model approach to build a network of relationships between the joints. For example, the triggering conditions of adjacent joints may need to be cooperatively set to prevent concussion. And mapping the mechanical arm topological structure of the trigger threshold set to obtain independent trigger conditions of each joint of the mechanical arm. This step assigns globally optimized trigger thresholds to the various joints, taking into account the geometry and motion characteristics of the robotic arm. For example, a joint near the base may have a larger trigger threshold, while the threshold for the end joint is more stringent.
For example, in a scenario where a 6-degree-of-freedom mechanical arm performs a precision insertion task, the range of motion of each joint is first obtained by joint space analysis, namely joint 1 is [ -180 °, 180 ° ], joint 2 is [ -90 °, 90 ° ], joint 3 is [ -120 °, 120 ° ], joint 4 is [ -180 °, 180 ° ], joint 5 is [ -90 °, 90 ° ], joint 6 is [ -360 °, 360 ° ]. Moment threshold calculations show that during insertion, the second joint may be subjected to a maximum of 100Nm moment, while the other joints are less. The vision system detects the position of the target jack and calculates the position error of the end effector to be 0.3mm. Through jacobian matrix mapping, this error translates to a joint angle error of 0.01 for joint 1, 0.02 for joint 2, 0.03 for joint 3, 0.05 for joint 4, 0.08 for joint 5, and 0.1 for joint 6.
Considering the moment and precision requirements comprehensively, the initial setting triggering condition is that triggering is performed when the moment exceeds 80% of the maximum allowable value or the angle error exceeds 2 times of the maximum allowable value. In dynamic adjustment, all thresholds are reduced by 25% within the last 1cm distance of insertion to improve accuracy. Noise analysis showed that the sensor noise standard deviation was 0.005 °, thus setting the trigger threshold to 3 times the standard deviation, i.e. 0.015 °. In view of joint coupling, the triggering conditions of adjacent joints are cooperatively set, for example, the triggering of joint 2 and joint 3 needs to satisfy the respective conditions at the same time. Finally, after mapping to the mechanical arm topology, the independent triggering conditions of each joint were obtained, namely joint 1 was 0.03 ° or 80Nm, joint 2 was 0.04 ° or 90Nm, joint 3 was 0.06 ° or 70Nm, joint 4 was 0.1 ° or 50Nm, joint 5 was 0.16 ° or 30Nm, and joint 6 was 0.2 ° or 20Nm. The trigger conditions ensure that the mechanical arm can respond to the change of the position and the force in time when the precise insertion task is executed, and the sub-millimeter operation precision is realized.
The method for controlling the mechanical arm based on the multi-mode driving in the embodiment of the present application is described above, and the system for controlling the mechanical arm based on the multi-mode driving in the embodiment of the present application is described below, referring to fig. 2, an embodiment of the system for controlling the mechanical arm based on the multi-mode driving in the embodiment of the present application includes:
The processing module 201 is configured to acquire data through a sensor module on the mechanical arm and perform preprocessing to obtain a multi-mode data set, where the multi-mode data set includes visual data, force-touch data, joint encoder data and inertial measurement unit data;
the fusion module 202 is configured to perform feature extraction on the multimodal dataset and perform feature fusion through an attention mechanism to obtain a fusion feature vector;
The prediction module 203 is configured to perform state data estimation on the fusion feature vector through an unscented kalman filter, and perform state prediction through a recurrent neural network, so as to obtain a current state of the mechanical arm and a predicted state of the mechanical arm;
the analysis module 204 is configured to perform planning analysis on the motion track of the mechanical arm on the current state of the mechanical arm and the predicted state of the mechanical arm, so as to obtain a target execution track;
The generating module 205 is configured to generate a control policy for the target execution track through an event-triggered distributed coordination control algorithm, obtain a real-time joint control instruction sequence, and control the mechanical arm to perform an operation through the real-time joint control instruction sequence.
Through the collaborative cooperation of the components, the method realizes the real-time synchronization and noise reduction of vision, force-touch, joint encoder and inertia measurement unit data by designing an efficient multi-mode data acquisition and preprocessing flow, greatly improves the data quality and processing efficiency, lays a solid foundation for subsequent feature extraction and fusion, adopts innovative space pyramid pooling and adaptive feature selection technology, combines wavelet packet transformation and principal component analysis, realizes the effective feature extraction of multi-mode data, remarkably enhances the perception capability of the system to environments and tasks, successfully solves the problem of heterogeneous data fusion by introducing a cross-mode attention network and a multi-level cascade fusion mechanism, enables the information of different modes to be organically combined, fully exerts respective advantages, greatly improves the perception precision of the system to the current state and future trend of a mechanical arm, provides a reliable basis for intelligent decision making and control by comprehensively considering the current state, the prediction state and the environment constraint of the mechanical arm, combines a fast random tree algorithm and a multi-target optimization algorithm, meets the requirements of the optimal control and the optimal control of the intelligent control system, has the requirements of the optimal control system, has the capability of generating the optimal control and the self-adaptive control system, has the capability of generating the optimal control path by the optimal control, and the optimal control of the environment, and the self-adaptive control system has the capability of the requirements of the optimal control and the self-adaptive control system, and the requirements of the control system, and the optimal control system is realized by the requirements of the algorithm, the requirements of the control of the system, and the control of the system is realized, and the requirements of the system, and the system is based on the requirements of the conditions of the control of the conditions of the system and the system, the operation precision, flexibility and task adaptability of the mechanical arm in a complex dynamic environment are comprehensively improved, and the efficiency and accuracy of mechanical arm control based on multi-mode driving are improved.
The present application also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, when the instructions are executed on a computer, cause the computer to perform the steps of the multi-mode drive-based mechanical arm control method.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, systems and units may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.
While the application has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit and scope of the embodiments of the application.