CN118982678A

CN118982678A - Intelligent scene understanding method for humanoid robots based on laser point cloud

Info

Publication number: CN118982678A
Application number: CN202411118957.6A
Authority: CN
Inventors: 郭明; 吴祥玲; 张潇澜
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2024-08-15
Filing date: 2024-08-15
Publication date: 2024-11-19
Anticipated expiration: 2044-08-15
Also published as: CN118982678B

Abstract

The invention discloses a humanoid robot scene intelligent understanding method based on laser point cloud, which comprises the following steps: the humanoid robot performs autonomous navigation and acquires three-dimensional live-action data through a laser radar scanning system to obtain scene point cloud data; processing the obtained scene point cloud data by using an adaptive radius sampling module; performing point cloud semantic feature extraction based on RPV-Net on the extracted effective point cloud data to obtain a 3D bounding box and a semantic tag of the point cloud data; processing point cloud data by using a channel attention module, acquiring point cloud integrity characteristics by using an average pooling method, extracting individual characteristics by using a maximum pooling method, and performing characteristic aggregation to acquire point cloud global characteristics; and constructing a multi-mode mixed model of a scene understanding Bert pre-training model and an encoder-decoder model, and processing the global characteristics of the point cloud. The method provides real-time and high-precision three-dimensional point cloud data for the humanoid robot by utilizing a laser radar technology, and can realize depth perception and intelligent understanding of the environment.

Description

Intelligent humanoid robot scene understanding method based on laser point cloud

Technical Field

The invention relates to the field of humanoid robot perception and intelligent understanding. More specifically, the invention relates to a humanoid robot scene intelligent understanding method based on laser point clouds.

Background

With the rapid development of robot technology, humanoid robots exhibit wide application potential in various application scenarios, such as disaster response, high-risk environment detection, daily services, and the like. However, in order for a humanoid robot to operate effectively in a complex environment, highly accurate environmental awareness is required. Traditional robotic vision systems rely primarily on cameras or simple sensors, which present significant challenges in environments with limited vision or complex lighting conditions. Especially in dynamic environments, it is difficult for conventional vision systems to acquire high quality environmental data in real time and accurately. In the research and development and application of the humanoid robot, accurate three-dimensional environment information is acquired in real time and efficient intelligent scene understanding is realized, so that the method is a key for improving the autonomy and the functionality of the robot.

Disclosure of Invention

The invention provides an intelligent understanding method of a humanoid robot scene based on laser point clouds, which provides real-time and high-precision three-dimensional point cloud data for the humanoid robot by utilizing a laser radar technology and can realize depth perception and intelligent understanding of the environment.

To achieve these objects and other advantages and in accordance with the purpose of the invention, as embodied and broadly described herein, there is provided a humanoid robot scene intelligent understanding method based on a laser point cloud, including:

A laser radar scanning system is assembled on a humanoid robot, the humanoid robot performs autonomous navigation and acquires three-dimensional live-action data through the laser radar scanning system to obtain scene point cloud data;

Processing the obtained scene point cloud data by using an adaptive radius sampling module to extract enough effective point cloud data;

Performing point cloud semantic feature extraction based on RPV-Net on the extracted effective point cloud data to obtain a 3D bounding box and a semantic tag of the point cloud data forming the building member in the scene;

Processing the point cloud data after the point cloud semantic feature extraction by using a channel attention module, acquiring the integral features of the point cloud by adopting an average pooling method, extracting individual features with differentiation by adopting a maximum pooling method, and carrying out feature aggregation on the integral features and the individual features to acquire the global features of the point cloud;

And constructing a multi-mode mixed model of a scene understanding Bert pre-training model and an encoder-decoder model, processing the acquired global characteristics of the point cloud, increasing the diversity of training data through the scene understanding Bert pre-training model, understanding the three-dimensional point cloud object into the three-dimensional ground object and the parameter information of the three-dimensional ground object through the encoder-decoder model, and improving the understanding capability of the humanoid robot on the scene.

Preferably, the laser point cloud-based intelligent understanding method for a humanoid robot scene, the laser radar scanning system comprises: the system comprises a laser radar scanner, an inertial measurement unit and a PTP time soft synchronization module, wherein the laser radar scanner and the inertial measurement unit are fused in a tight coupling mode, and the PTP time soft synchronization module enables time synchronization of the laser radar, the inertial measurement unit and the humanoid robot.

Preferably, the intelligent understanding method of the humanoid robot scene based on the laser point cloud further comprises calibrating the laser radar scanning system, specifically:

Using an Allan variance analysis method to evaluate noise characteristics and bias stability of the inertial measurement unit, and performing internal calibration on the inertial measurement unit;

and calibrating external parameters of the laser radar scanner and the inertial measurement unit by adopting a lidar_align calibration method, and acquiring a conversion matrix between a coordinate system of the laser radar scanner and a coordinate system of the inertial measurement unit so as to convert point cloud data under the coordinate system of the laser radar scanner into the coordinate system of the inertial measurement unit.

Preferably, the intelligent understanding method of the humanoid robot scene based on the laser point cloud further comprises the step of removing motion distortion of scene point cloud data acquired by the laser radar scanner after calibrating the laser radar scanning system, and the specific process is as follows:

Motion distortion occurring in scene point cloud data includes: a. if the scanning view field of the laser radar scanner always falls on the same straight wall surface, when the laser radar scanner moves along the direction vertical to the wall surface, the distance from the laser radar scanner to the same point of the wall surface is continuously reduced, so that point cloud data can be raised along the movement direction of the laser radar scanner; b. when the laser radar scanner moves to have angle deviation, the movement of the centers of the petals of the scanning lines can lead the point clouds of the petals of the scanning lines to be stretched into an irregular shape, namely, the centers are distorted; c. the laser radar scanner has a fixed field angle and cannot cover a circle, so that the coverage range of the field of view can be scaled along with the change of the distance, and the size of the petals of the scanning line can be scaled along with the movement of the laser radar scanner, namely, the edge distortion is generated;

And aiming at the motion distortion of a and b, removing the motion compensation of the inter-frame point cloud by the inertial measurement unit, wherein the specific calculation mode comprises the following steps of: P _correct＝T_k,i×P_i, wherein T _k and T _k+1 are start time and end time of a frame of point cloud recorded by a laser radar scanner, T _i is time of ith point P _i relative to start time in a kth frame of point cloud set S, S _num is total point number of a frame of point cloud set, T _k,k+1 is frame pose transformation, T _k,i is pose transformation of P _i point relative to start time, and P _correct is point coordinate after motion distortion correction calculated by using pose transformation matrix corresponding to P _i point;

for the motion distortion of c, the scanning points with unreliable edges are filtered out by limiting the size of the angle of view.

Preferably, in the intelligent understanding method of the humanoid robot scene based on the laser point cloud, after removing the motion distortion of the scene point cloud data, the method further comprises the step of reconstructing the scene point cloud data for point cloud storage, and the specific process is as follows:

The laser radar scanner scans a frame of point cloud data, wherein the frame of point cloud data comprises 20 scanning line petals in a graph, each scanning line petal is continuous at the center, and each scanning line petal sequentially stores the whole petals into one row of a matrix by taking the point closest to the scanning center as a starting point;

Storing spatially adjacent scan line petals into adjacent rows of the matrix, specifically: the 20 scanning line petals in one frame of point cloud data are obtained by rotating for 4 times by taking 5 scanning line petals as a period, and in one scanning period, the starting point numbers recorded by the 5 scanning line petals clockwise are sequentially 0, 1000, 2000, 500 and 1500; the difference of the start point numbers of the petals of the scanning lines in 4 scanning periods is 2500, and the start point numbers of the petals of the adjacent scanning lines indexed clockwise by one frame of scanning point cloud are calculated through the point number relation; and sequentially indexing corresponding scanning line petals through the obtained starting point numbers, and sequentially storing the petals according to rows to obtain a matrix with a continuous spatial relationship among the rows.

Preferably, the intelligent understanding method of the humanoid robot scene based on the laser point cloud further comprises the following steps after the reconstructed point cloud storage is performed on the scene point cloud data: the formula for extracting the angular points and plane point characteristics of each frame of point cloud data is modified, and specifically comprises the following steps:

A. calculating vector differences between the point with the calculated curvature and the front five points and the rear five points of the point with the calculated curvature to the scanning original point respectively, summing the vector differences, and reflecting the curvature value of the point with the length of the obtained sum, wherein the curvature value calculation formula is as follows:

Wherein c _i is the curvature value of the scan point P _i, S is an integer set of-5 to 5 excluding 0, AndVectors representing the ith and the jth points in the kth frame point cloud data under the laser radar scanner coordinate system L;

B. When the curvature value of the scanning point is larger than 1, judging the scanning point as an angular point; when the curvature value of the scanning point is smaller than 0.1, judging the scanning point as a plane point;

C. and C, extracting corner points and plane points in the point cloud data of each frame according to the steps A and B.

Preferably, in the intelligent understanding method of the humanoid robot scene based on the laser point cloud, the adaptive radius sampling module is used for processing the obtained scene point cloud data, specifically: firstly, calculating Euclidean distances between a sampling center point and all points in scene point cloud data, marking the points exceeding a specific threshold value R as Q, and keeping original values of the rest points; then, the points in the spherical sampling radius are arranged in ascending order according to the distance, and the nearest w points are extracted as sampling points; if the sampling points contain Q values, which means that the number of effective points in the spherical sampling radius is less than w, the sampling radius needs to be gradually increased until at least w effective points are met, so that enough effective points are ensured to be used for feature learning;

The extracting of the point cloud semantic features based on RPV-Net is carried out on the extracted effective point cloud data, and specifically comprises the following steps: inputting point cloud P epsilon R ^N×3, namely, point cloud data originally containing three dimensions, wherein N represents the number of the point clouds; obtaining seeds through semantic segmentation network pairs; extracting boundary voxel characteristics of the whole pilar by using the idea of the volume column pilar-based, voting the obtained pilar, and then performing Vote & Sampling & on the obtained point cloud characteristics

An operation of Grouping, wherein each seed independently generates a vote through a voting module; the votes are then grouped into h groups and processed by Proposal modules to generate the final Proposal, which outputs the 3D bounding box and semantic tags.

Preferably, in the intelligent understanding method of the humanoid robot scene based on the laser point cloud, the channel attention module is used for processing the point cloud data extracted by the semantic features of the point cloud, and the specific process is as follows:

The characteristics of the point cloud data may be represented as f=r ^BxNx1xC, where B, N, C represents batch size, number of point clouds, and number of characteristic channels, respectively;

And an average pooling method and a maximum pooling method are adopted along the dimension of the feature channel number C, the average pooling method is used for acquiring the integral features of the point cloud, the maximum pooling method extracts individual features with more distinguishing property, and then the integral features and the individual features are subjected to feature aggregation to generate two different global features F _avg and F _max of the point cloud.

Preferably, in the method for intelligently understanding a humanoid robot scene based on a laser point cloud, the understanding of the Bert pre-training model through the scene increases diversity of training data, specifically:

the scene understanding Bert pre-training model pair generates an countermeasure network GAN model expansion data set, and the diversity of training data is increased;

in the standard generation countermeasure network GAN model, a generator G and a discriminator D are included;

Setting training samples aiming at a scene point cloud data set by adjusting parameter settings for generating an countermeasure network GAN model;

the object of the generator G is to generate dummy data as close as possible to the real data, and the object of the arbiter D is to distinguish the real data from the dummy data generated by the generator G, and to add a balance coefficient γ to adjust the weight between the two.

Preferably, in the intelligent understanding method of the humanoid robot scene based on the laser point cloud, the processing procedure of the constructed encoder-decoder model is specifically as follows:

the point cloud encoder takes point clouds P epsilon R ^n*d as input, wherein n is the number of input point clouds, and d is the feature dimension of each input point;

The output of the point cloud encoder is a point feature sequence x= (X ₁,x₂,…,x_m)∈R^m*c, where m is the number of output point features and c is the output point feature dimension;

Mapping the point feature X to a point mark y= (Y ₁,y₂,…,y_m)∈R^k*c′, where c' is the dimension of the point mark, by a multi-layer perceptron MLP;

3D-MELL is a decoder-only interface that accepts a tag sequence consisting of text and point tags, i.e., a mixed token sequence, denoted as z= (Z ₁,z₂,…,z_k)∈R^k*c′, where k is the total number of tokens;

With the self-attention mechanism, 3D-MELL can understand the contextual relationship between different types of tokens, enabling it to generate a response based on text and point cloud input; the output of 3D-MELL is a series of predictive markers Z _<i＝(z₁,z₂,…,z_i-1), the prediction of the ith marker being conditioned on all previous markers.

The invention at least comprises the following beneficial effects:

The method aims at the problem of uneven point cloud data density, and the self-adaptive radius sampling module is used for processing the obtained scene point cloud data to extract enough effective point cloud data, so that enough points are ensured to be used for feature learning, and the understanding capability and the segmentation precision of the humanoid robot on complex scenes are improved.

Secondly, the invention designs a channel attention module aiming at the problems of low processing speed and lost characteristics caused by large environmental data quantity so as to solve the problems of reducing the parameter quantity and simultaneously avoiding the loss of single characteristics and integral characteristics.

Thirdly, the laser radar scanner used in the invention is a solid-state laser radar scanner, and aims at the characteristics of small visual angle, petal-shaped scanning track, non-repeated scanning mode and the like of the solid-state laser radar scanner, the motion distortion is removed, the point cloud of the solid-state laser radar scanner is reconstructed, the spatial relationship between scanning lines is established, and the characteristic extraction formula is corrected, so that the extraction efficiency and precision are improved, the conditions of insufficient characteristic points and wrong selection caused by the laser radar scanner and the environment are effectively improved, and the problems of layering and aggravation of point cloud are solved.

Fourth, the multi-mode mixed model of the Bert pre-training model and the encoder-decoder model is used for understanding scenes, so that the understanding capability of the humanoid robot on complex scenes is improved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

Fig. 1 is a schematic flow chart of a humanoid robot scene intelligent understanding method based on laser point clouds according to an embodiment of the invention;

FIG. 2 is a schematic diagram of PTP time synchronization according to an embodiment of the present invention;

FIG. 3 is a flow chart of the laser radar scanner and IMU external parameter calibration according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of distortion of a lidar scanner perpendicular to a wall surface in an embodiment of the present invention;

FIG. 5 is a schematic diagram of center distortion when the laser radar scanner is angularly offset in an embodiment of the present invention;

FIG. 6 is a schematic diagram of edge distortion occurring when the field angle of a lidar scanner is small in an embodiment of the present invention;

FIG. 7 is a schematic diagram of a petal-shaped point cloud with a certain radian about laser scanning radar scanning in an embodiment of the present invention;

FIG. 8 is a graph showing the correspondence between the point number and the standard curvature difference under the standard condition in the embodiment of the present invention;

FIG. 9 is a graph showing the change of the distance correction value by calculating the difference between the curvatures of the points corresponding to different distances and the curvature of the points corresponding to the standard conditions as the distance correction value according to the embodiment of the invention;

FIG. 10 is a schematic diagram of a sampling scheme using Pointnet ++;

FIG. 11 is a schematic diagram of an adaptive radius sampling method of the present invention employed in an embodiment of the present invention;

FIG. 12 is a schematic diagram of point cloud semantic segmentation in an embodiment of the present invention;

FIG. 13 is a schematic diagram of a channel attention module according to an embodiment of the present invention;

FIG. 14 is a schematic diagram of an encoder-decoder model process in accordance with an embodiment of the present invention;

FIG. 15 is a 3D bounding box estimation plot of an overall point cloud acquired by a lidar system in an example of the present invention;

FIG. 16 is a 3D bounding box estimation map of a local extraction map obtained in an example of the present invention;

Fig. 17 is a 3D bounding box estimation map of an accurate local extraction map obtained in an example of the present invention.

Detailed Description

The present invention is described in further detail below with reference to the drawings to enable those skilled in the art to practice the invention by referring to the description.

It will be understood that terms, such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other elements or groups thereof.

As shown in fig. 1, the embodiment of the invention provides a humanoid robot scene intelligent understanding method based on laser point cloud, which comprises the following steps:

S100, assembling a laser radar scanning system on the humanoid robot, wherein the humanoid robot performs autonomous navigation and acquires three-dimensional live-action data through the laser radar scanning system to obtain scene point cloud data.

In the ROS operation system of the humanoid robot, a move_base navigation planning frame is added, a mbot _navigation random autonomous navigation function module is built, and the function of three-dimensional live-action data acquisition from fixed route travel to random autonomous navigation of the humanoid robot is realized.

The laser radar scanning system includes: the system comprises a laser radar scanner, an inertial measurement unit and a PTP time soft synchronization module, wherein the laser radar scanner and the inertial measurement unit are fused in a tight coupling mode, and the PTP time soft synchronization module enables time synchronization of the laser radar, the inertial measurement unit and the humanoid robot.

Specifically, the laser radar scanner is a solid laser radar scanner, is a scanning sensor for acquiring the geometric shape of a measured target through ranging, mainly comprises a laser emitter, a detector and an information processing system, has strong anti-interference capability and can work in a severe environment. The working principle of the laser radar scanner is that visible and near infrared laser waves are emitted to a measured object, the emitted laser light can be reflected back to the laser radar after reaching the measured object, the laser radar is received by a receiver, the distance from the measured object to the scanner and the intensity of the reflected laser light are calculated by combining the light velocity by recording the time from the emission of the laser beam to the reflection of the laser beam, and therefore the geometric shape of the measured object is obtained. Specific measurement methods include triangulation, time of flight (ToF) and phase difference.

An Inertial Measurement Unit (IMU) is a sensor for measuring physical quantities of an object, including linear acceleration, angular velocity and magnetic field direction, and is mainly composed of an accelerometer, a gyroscope and a magnetometer. The IMU works on the principle of determining the motion state of the instrument by measuring acceleration and angular velocity. The accelerometer can measure linear acceleration along X, Y, Z three axes, the gyroscope can measure angular velocity around X, Y, Z three axes, and the magnetometer can measure the direction and intensity of geomagnetic field to calibrate drift of the gyroscope, so that the moving direction and the rotating state of the equipment are obtained. The deviation exists between the physical quantity measured by the IMU and the actual physical quantity, and the deviation is mainly divided into two types: systematic errors such as bias (bias) and random errors such as noise, the presence of errors is therefore taken into account in the calculation process.

The basic calculation formula of the IMU is as follows:

Wherein the method comprises the steps of A _true is the true acceleration, b _a is the bias of the accelerometer, n _a is the noise during the accelerometer measurements,Omega _true is the true angular velocity, b _g is the bias of the gyroscope, and n _g is the noise during the gyroscope measurement.

In practical applications, such errors need to be compensated for in order to ensure measurement accuracy. The compensation method typically includes two calibration and filtering algorithms. Calibration may be performed with the device stationary or in motion, the magnitude of the error determined, and the calibrated error value subtracted from the subsequent measurement. The filtering algorithm (such as Kalman filtering) estimates and compensates for the bias in real time while in motion, reducing the effects of noise.

In order to obtain a high-precision point cloud map, a plurality of sensors are required to be used in a fusion mode, and multi-source data mutual constraint correction is obtained. The fusion mode of the sensor is loose coupling and tight coupling, and the fusion mode used in the embodiment of the invention is tight coupling. The tight coupling is the observation equation and the motion equation for motion state estimation constructed by the lidar scanner and the IMU together. The principle is that the data collected by the laser radar scanner and the IMU are put into the same state vector, the rotation matrix, the translation matrix, the speed and the IMU zero offset are jointly optimized, and the state vector is only required to be optimized. And constructing error items comprising laser radar measurement residual errors and IMU measurement residual errors by utilizing the laser radar scanner and IMU data so as to achieve the purpose of simultaneously optimizing the pose, the speed and the deviation of the tightly coupled system. The tight coupling system has compact structure, high integration among the sensors, mutual constraint among the sensor models, more sufficient information utilization and higher positioning and mapping precision and efficiency. The key problem of the multi-sensor fusion operation is multi-sensor data fusion, wherein the data fusion mainly comprises two aspects of content, namely time synchronization and coordinate system alignment of multi-sensor data.

For time synchronization, the embodiment of the invention adopts a PTP time soft synchronization module, which specifically comprises the following steps:

The lidar scanner used in the embodiments of the present invention is a solid state lidar, so that PTPv2 may be used for time synchronization between the lidar scanner and other devices without GPS and PPS hardware signals. The method requires minimal external hardware support and only needs to have one master clock device in the whole network. I.e. the clock of the solid-state laser radar scanner is synchronized to the host computer in use, the synchronization process follows the IEEE1588v2.0PTP protocol, and the delay request-response mechanism Livox Mid-40 in the protocol is used as a slave end to perform PTP time synchronization with the master port (clock equipment of the host computer).

The PTP synchronization principle is that periodic time information is sent from a master port, the slave port receives the information with a time stamp, and local time is adjusted by calculating time delay and time difference between sending and receiving of the information, so that time between the slave device and the master device is kept consistent, and the schematic diagram is shown in fig. 2.

The transmission path delay can be expressed as:

the time offset can be expressed as:

The synchronization flow is as follows:

(1) The master port sends a Sync synchronous message to the slave port, and records time T1;

(2) After receiving the Sync message from the port, recording the receiving time T2;

(3) The master port sends a Sync message, and then sends a Follow_Up following message with a T1 value to the slave port;

(4) The slave port sends delay_req Delay request message to the master port, and the sending time T3 is recorded;

(5) After receiving the delay_req message, the main port records the receiving time T4;

(6) The master port sends a delay_resp Delay request response message carrying T4 to the slave port.

S110, calibrating the laser radar scanning system, specifically:

S111, evaluating noise characteristics and bias stability of the inertial measurement unit by using an Allan variance analysis method, and performing internal calibration on the inertial measurement unit.

The error of the IMU is mainly Gaussian white noise and bias (bias random walk), and the aim of calibrating the internal parameters of the IMU is to improve the accuracy of positioning, navigation and attitude estimation.

The Allen variance is a time domain analysis method that can separate and identify noise ^[59] of different sources from time series data, and the Allen variance σ ² (τ) is a function of the observation time interval τ.

Where N is the number of data points, x _i is the ith data point, and τ is the sampling time interval.

S112, calibrating external parameters of the laser radar scanner and the inertial measurement unit by adopting a lidar_align calibration method, and acquiring a conversion matrix between a coordinate system of the laser radar scanner and a coordinate system of the inertial measurement unit so as to convert point cloud data under the coordinate system of the laser radar scanner into the coordinate system of the inertial measurement unit.

The coordinate system is aligned, namely, the point cloud data is converted into the coordinate system of the IMU sensor, so that the subsequent data processing is facilitated. Therefore, the relative position relation between the two sensor coordinate systems needs to be acquired when the coordinate systems are aligned, so that external parameter calibration of the laser radar and the IMU needs to be performed, and a conversion matrix between the two coordinate systems is acquired. The essence of external parameter calibration is to solve the problem of data optimization, namely, the point cloud data acquired by a laser radar scanner is converted into an IMU coordinate system, and the point clouds of the front frame and the back frame can be overlapped to the greatest extent, so that the value of the external parameter is what is in the case. The calibration method adopted by the embodiment of the invention is lidar_align, and the core principle is that the original point cloud is randomly sampled (most of the point cloud extracted by default is repeated points), the closest point in the next frame of point cloud is found by kdtree method to calculate the distance, the K adjacent error can be generated in the calculation process, and when the conversion matrix is not optimal, the error can be increased, and the nonlinear optimization library nlopt is used for optimizing the error, so that the K adjacent error is approximately 0, and the rotation matrix and the translation matrix at the moment are solved.

The specific calibration flow is shown in fig. 3, wherein, the calibration data should be noted to contain a large amount of rotation and translation during the recording process, and the data amount is controlled within 2G, so that the calibration result has higher precision.

S120, after calibrating the laser radar scanning system, the method further comprises the step of removing motion distortion of scene point cloud data acquired by the laser radar scanner, wherein the specific process is as follows:

Motion distortion occurring in scene point cloud data includes: a. if the scanning view field of the laser radar scanner always falls on the same straight wall surface, when the laser radar scanner moves along the direction vertical to the wall surface, the distance from the laser radar scanner to the same point of the wall surface is continuously reduced, so that point cloud data can be raised along the movement direction of the laser radar scanner, as shown in fig. 4; b. when the movement of the laser radar scanner is angularly offset, the movement of the centers of the petals of the scanning lines can lead to the point cloud of the petals of the scanning lines to be stretched into an irregular shape, namely, the centers of the point cloud of the petals of the scanning lines are distorted, as shown in fig. 5; c. the view angle of the laser radar scanner is fixed and cannot cover a circle, so that the coverage range of the view field can be scaled along with the change of the distance, the size of the petals of the scanning line can be scaled along with the movement of the laser radar scanner, namely, the edges are distorted, and the graph is shown as 6;

And aiming at the motion distortion of a and b, the inertial measurement unit is used for removing the inter-frame point cloud by performing motion compensation, the acquisition frequency of the IMU is 100HZ, and the laser radar point cloud with the frequency of 10HZ can be interpolated. The specific calculation mode is as follows: P _correct＝T_k,i×P_i, wherein T _k and T _k+1 are start time and end time of a frame of point cloud recorded by the laser radar scanner, T _i is time of ith point P _i relative to start time in a kth frame of point cloud set S, S _num is total point number of a frame of point cloud set, T _k,k+1 is frame pose transformation, T _k,i is pose transformation of P _i point relative to start time, and P _correct is point coordinate after motion distortion correction calculated by using pose transformation matrix corresponding to P _i point. It should be noted that, in the calculation of t _i, it is assumed that the solid-state laser radar scanner emits laser at uniform time intervals and the internal prism rotates at a uniform speed; in calculation T _k,i, assuming that the lidar scanner remains in constant motion for one frame time, the pose change is also uniform for one frame time.

By the motion distortion removal method, point clouds in the same frame can be corrected to the starting moment of the current frame approximately, and the influence of the motion distortion on the point cloud characteristics is eliminated.

It should be noted that, in the embodiment of the invention, the scanning frequency of the solid-state laser radar scanner is 10Hz, and the scanning time of one frame of point cloud is 0.1s. Ideally, the same frame of point clouds should be acquired at the same time, but in practice, a frame of point clouds is acquired uniformly within 0.1s. While the laser radar scanner is in motion during data acquisition, when the last point of the point cloud of the current frame is acquired, the scanning positions of the laser radar scanner and the first point of the current frame are different. However, the time stamps of all scanning points in a frame are the time when the point cloud starts to be acquired, and the point cloud of the frame is considered to be acquired at the starting time. The motion of the laser radar scanner itself affects the scanning point cloud, and motion distortion of the point cloud is generated. The more intense the lidar motion during a frame time, the more severe the resulting motion distortion.

S130, performing motion distortion removal on scene point cloud data, and then performing reconstruction point cloud storage on the scene point cloud data, wherein the specific process is as follows:

It should be noted that, a frame of point cloud after removing the motion distortion is generally stored in a matrix according to a scanning sequence, so that the algorithm can quickly index and call the point cloud row by row or column by column. The solid-state laser radar scanning point cloud used in the embodiment of the invention can be stored in the matrix, but the scanning point cloud does not have obvious row and column characteristics in space due to the non-repeated scanning mode of the solid-state laser radar, so that the row and column characteristics of the storage matrix have no physical significance, and the index and relation construction of the point cloud cannot be carried out in space by using the row and column. Therefore, the point cloud of the solid-state lidar needs to be reconstructed to establish the spatial relationship between the scanning lines.

The solid-state laser radar scanning point cloud is composed of 20 petals in a graph, each petal is continuous at the center, and the scanning point cloud of a single petal can be stored in a matrix in rows. Each of the petals of the scanning line starts from the point closest to the scanning center, and the whole petals are sequentially stored in one row of the matrix. If there is a null point with return value on the scan line petals, which is replaced by the original point coordinates, the null point is removed to ensure that the same-row scan lines in the matrix are continuous in space. Since petals having consecutive dot numbers on a scanning line are not adjacent in space, a matrix stored in order of dot numbers does not have a spatial relationship between adjacent rows, and petals adjacent in space need to be stored in adjacent rows of the matrix.

After the point cloud space relation matrix is established according to step S130, the same feature can be quickly indexed in adjacent rows and columns by utilizing the matrix row-column relation during the subsequent feature point extraction and feature matching, without traversing all feature points, so that the calculated amount is reduced, and the algorithm running speed is increased. The method of removing empty spots and storing them in the order of adjacent petals may result in the discontinuity in the spot numbers of adjacent rows and columns of the matrix. Although the matrix stored according to the method loses the original continuous feature of the corresponding point numbers, a more efficient and rapid space index mode is provided compared with the point number index mode, and the space features among the points are reconstructed.

S140, after the scene point cloud data are subjected to the reconstruction point cloud storage, the method further comprises the following steps: the formula for extracting the angular points and plane point characteristics of each frame of point cloud data is modified, and specifically comprises the following steps:

It should be noted that, the embodiment of the invention provides a LEGO-LoopLOAM algorithm, and the angular point and the plane point of each frame in the point cloud are extracted by calculating the curvature of each point as a judging condition. The method calculates vector differences between the calculated curvature points and the front and rear five points to the scanning origin point, sums the vector differences, and reflects the curvature of the calculated points according to the length of the sum, wherein the method defines corner points with curvature values larger than 1 and plane points with curvature values smaller than 0.1. Let the spatial coordinates of the curvature points be P _i(x_i,y_i,z_i), the curvature calculation formula is as follows:

where c _i is the curvature of point P _i, S is a set of integers from-5 to 5 that do not contain 0, AndAnd (3) representing vectors of the ith point and the j point in the kth frame point cloud data under the laser radar scanner coordinate system L.

The method for replacing the point curvature by the calculated value skillfully reduces the calculated amount of the actual curvature calculation method, has higher efficiency and has good application effect. It requires that the points on the scan line have no height difference and can be projected on a plane under the radar coordinate system without loss. The non-repetitive petal-shaped scanning line of the solid-state laser radar scanner does not meet the requirement and is not suitable for the method, so that the characteristic extraction formula is corrected according to the scanning characteristics of the solid-state laser radar scanner.

The single scanning line point cloud of the mechanical laser radar scanner is arranged in a straight line along the direction of the XOY plane, and the curvature calculated according to the original formula is the curvature of the point on the XOY plane. But the single frame scanning point cloud of the solid-state laser radar scanner is petal-shaped. The arrangement of the point clouds on a plane is not flat, and itself has a certain curvature, as shown in fig. 7. The inherent radian of the scan line may interfere with the calculation result of the feature point extraction formula, resulting in erroneous judgment of the diagonal feature point and the planar feature point.

Because the scanning line of the solid-state laser radar scanner has radian, each point on the scanning line has corresponding curvature even on a plane, so that the curvature value calculated by the solid-state laser radar scanning point cloud by using the formula is discontinuous in change and is inconsistent with the actual curvature change. In an actual experiment, since the planar feature point is determined to be an angular feature point, it is sometimes necessary to eliminate the influence of curvature of the scan point.

First, the difference between the curvature of each point on the solid state scanner scan line and the curvature of the flat scan line point needs to be determined under standard conditions. The study defines that the scanning head is 10 meters from the vertical plane, and that the scanning direction is perpendicular to the plane is the standard condition. The calculated point curvature of the scanning point cloud under the condition is taken as a standard curvature difference c _stdif between the solid-state laser radar scanning line and the flat scanning line. Under standard conditions, petals of each scanning line in the scanning point cloud are basically similar, 500 points are arranged on each scanning petal, and the two sides of a connecting line of a scanning center point and a farthest end point of each single petal are basically symmetrical. Therefore, only the standard curvature difference of 250 points of the half petals needs to be calculated, and the standard curvature difference of all points on the standard condition scanning line can be replaced by the corresponding points. Each point from the point closest to the center point of the scan on the single scan line to the furthest point of the petals is numbered, corresponding to numbers 0 through 249. The correspondence between the point number and the standard curvature difference of the point under the standard condition is shown in fig. 8. The standard curvature as a function of point number can be approximated by the following expression:

c_srdif＝5.28e^—7×Index²+0.002

The standard curvature difference of the scanning point is not a constant value, and the size of the standard curvature difference is mainly influenced by two factors of an X-axis included angle of a connecting line of the scanning point and the scanning origin and a distance between the scanning point and the scanning origin. Therefore, the dynamic corresponding relation between the angle and the distance is required to be constructed according to the standard curvature difference so as to compensate the dynamic curvature difference of the scanner at different distances and different postures.

The inherent curvature of a point on a single petal of a single-frame point cloud can change along with the change of the angle theta of an X-axis connecting line of a scanning point and a scanning origin, so that the corresponding relation between the angle and a standard curvature difference needs to be determined. And calculating the corresponding relation between the included angle and the point number, so as to deduce the corresponding relation of the standard curvature difference along with the change of the included angle. The data recorded by the laser radar point cloud only has three-dimensional coordinates and reflection intensity of space points, so that the available known quantity calculated in the subsequent expression is limited to be the three-dimensional coordinates. The calculation formula of the included angle theta of any point on the scanning line comprises the following steps:

Calculating an included angle corresponding to the point number under the standard condition, and fitting a linear expression as follows:

θ＝1.9e^—4×Index+0.54

The corresponding relation expression of the simultaneous obtainable standard curvature difference and the X-axis included angle of the connecting line of the scanning point and the scanning origin is converted into the corresponding relation expression of the standard curvature difference and the spatial coordinate of the point, and the specific formula is as follows:

The curvature calculation formula adopted by the invention is essentially a method for reflecting the curvature change through the difference value of the local point cloud distance. When the scanning distance becomes far, the distance difference from the adjacent point to the scanning center also becomes large, so that the corresponding curvature value is synchronously increased, and the calculated curvatures of the same feature under different scanning distances are different, so that the feature matching and loop detection precision are affected. Therefore, the influence of the distance on the curvature needs to be eliminated as much as possible, and the influence is mainly reflected in that the curvature calculated value becomes larger as the scanning distance increases.

In order to obtain a distance correction equation corresponding to curvature change, the invention collects point clouds every 2 meters on the same wall surface, the collection distance range is 10-60 meters, and the collection is 25 times. And calculating the point curvatures with the same point numbers on the single petals in the different distance scanning lines acquired each time. And calculating the difference value between the curvature of the corresponding points of different distances x _i and the curvature of the corresponding points of the standard condition as a distance correction value. The distance correction value changes as shown in fig. 9. The corresponding expression is:

c_dis＝0.016x_i+0.006

in conjunction with the above formulas, the dynamic correction value for the curvature can be found as follows:

the curvature calculation formula after correction of the invention is as follows:

And S200, processing the obtained scene point cloud data by using the adaptive radius sampling module to extract enough effective point cloud data.

The specific process of processing the obtained scene point cloud data by the self-adaptive radius sampling module is as follows: firstly, calculating Euclidean distances between a sampling center point and all points in scene point cloud data, marking the points exceeding a specific threshold value R as Q, and keeping original values of the rest points; then, the points in the spherical sampling radius are arranged in ascending order according to the distance, and the nearest w points are extracted as sampling points; if the sampling points contain Q values, the number of effective points in the spherical sampling radius is less than w, the sampling radius is required to be gradually increased until at least w effective points are met, so that enough effective points are ensured to be used for feature learning, and the understanding capability and the segmentation precision of the humanoid robot model to complex scenes are improved. As shown in fig. 10 and 11, fig. 10 is a sampling mode of Pointnet ++, fig. 11 is a self-adaptive radius sampling method of the present invention, where green pellets represent point cloud data in an initial radius, yellow pellets represent point cloud data copied by selecting a maximum value, and red pellets and blue pellets represent point cloud features whose initial radius is learned.

In addition, the invention also carries out local sampling on the point cloud and defines a range, and the points in the range are selected as local features to extract so as to capture the fine geometric structure of the small neighborhood. Each local point is subjected to multi-layer perceptron (MLP) operation and local max pooling to obtain local global features. The local features are regrouped and processed on a larger scale to generate feature expression on a wider scale, and the process is repeated until the features of the whole point cloud data are acquired, so that the understanding and representing capability of the model on the whole point cloud structure is enhanced.

S300, extracting point cloud semantic features based on RPV-Net from the extracted effective point cloud data, and acquiring a 3D bounding box and semantic tags of the point cloud data forming the building component in the scene.

The extracting of the point cloud semantic features based on RPV-Net is carried out on the extracted effective point cloud data, and specifically comprises the following steps: inputting point cloud P epsilon R ^N×3, namely, point cloud data originally containing three dimensions; randomly sampling through a semantic segmentation backbone network and processing the semantic segmentation backbone network to obtain features with depth information of points, clustering the point features into M point subsets according to the point features, voting pairs to obtain votes for each point subset, estimating and classifying the shape of a boundary frame by using a BEV method based on pilar, and redefining seeds, namely obtaining seeds through the semantic segmentation network pairs; extracting Boundary voxel characteristics (Boundary voxel) of the whole pilar by using the concept of the body columnar pilar-based, voting the obtained pilar, then performing a Vote & Sampling & Grouping operation on the obtained point cloud characteristics, and independently generating a voting Vote by each seed through a voting module; the votes are then grouped into h groups and processed by Proposal modules to generate the final Proposal, which outputs the 3D bounding box and semantic tags (Sematic labels). As shown in fig. 12, given one input point cloud having XYZ coordinates, votes are generated from the point cloud, and the votes are grouped into M clusters, from which f grids are divided for each M voting cluster centers. Representative points in the partitioning grid imply possible areas of the object. Then, the seed points around the representative point are revisited, and the surrounding seed point features are aggregated to the representative point. The proposal refinement and classification module fuses and processes the aggregated voting characteristics and revisited seed characteristics to generate refined representative points and semantic categories of the objects, and can be conveniently converted into a three-dimensional object boundary box.

S400, processing the point cloud data subjected to the point cloud semantic feature extraction by using a channel attention module, acquiring the integral features of the point cloud by adopting an average pooling method, extracting individual features with differentiation by adopting a maximum pooling method, and carrying out feature aggregation on the integral features and the individual features to acquire the global features of the point cloud.

The specific treatment process comprises the following steps:

According to the method, after the high-dimensional characteristics of the three-dimensional point cloud data are obtained, the influence of each characteristic channel on the whole point cloud information is focused, and a channel attention mechanism of the point cloud data is realized. The mechanism highlights the information of the key channels by analyzing the information association on each channel of the point cloud features. Firstly, extracting information capable of reflecting global characteristics of point clouds, learning based on the information to obtain attention weights of all channels, and further obtaining the characteristics of the point clouds processed by a channel attention module. The channel attention module is shown in fig. 13.

The feature of the point cloud data may be denoted as f=r ^BxNx1xC, where B, N, C represents the batch size, the number of point clouds, and the number of feature channels, respectively. The invention carries out pooling operation along the dimension of the characteristic channel number C to generate a point cloud channel attention mechanism Mc epsilon R ^BxNx1xC. The output characteristic Fc is obtained by matrix cross multiplication of point cloud characteristics F and Mc, namely the invention is subjected to pooling along the dimension of the characteristic channel number C, and the formula can be expressed as follows: fc=f×mc (F).

In order to generate the point cloud channel attention mechanism Mc, the method adopts two modes of average pooling and maximum pooling along the dimension of the characteristic channel number N. Averaging pooling is used to obtain the global features of the point cloud, while maximizing pooling helps to extract more discriminative features. The two pooling modes perform feature aggregation on the point cloud data input features F to generate two different point cloud global features F _avg and F _max. In addition, the invention uses a multi-layer perceptron (MLP) to extract point cloud characteristics and adopts the MLP sharing parameters to train the characteristics. And the required channel attention mechanism weight is obtained by firstly reducing the coefficient to k and then recovering the size of the point cloud channel C and combining the nonlinear activation function Sigmoid. The point cloud channel attention mechanism formula can be expressed as: mc=s { MLP (Avg (F))+mlp (Max (F)) }, where s represents a Sigmoid function.

S500, constructing a multi-mode mixed model of a scene understanding Bert pre-training model and an encoder-decoder model, processing the acquired global characteristics of the point cloud, increasing the diversity of training data through the scene understanding Bert pre-training model, understanding the three-dimensional point cloud object into the three-dimensional ground object and the parameter information of the three-dimensional ground object through the encoder-decoder model, and improving the understanding capability of the humanoid robot on the scene.

S510, understanding the Bert pre-training model through the scene to increase the diversity of training data, specifically:

s511, the scene understanding Bert pre-training model helps to generate an expansion data set of an antagonism network GAN model, so that the diversity of training data is increased;

s512, in the standard generation countermeasure network GAN model, a generator G and a discriminator D are included;

S513, setting training samples aiming at a scene point cloud data set by adjusting and generating parameter settings of an antagonism network GAN model;

the objective of the generator G is to generate false data as close to real data as possible, and the objective of the discriminator D is to distinguish the real data from the false data generated by the generator G, and to add a balance coefficient γ to adjust the weight between the real data and the false data.

It should be noted that, the scene understanding Bert pre-training model can perform noise reduction processing on the extracted global features of the point cloud, expand the variety of the data set, increase the diversity of training data and strengthen the sample set. The output of the scene understanding Bert pre-training model is the data input for generating the countermeasure network GAN, and in the standard generating countermeasure network, the scene understanding Bert pre-training model mainly comprises a generator G and a discriminator D, and the scene understanding Bert pre-training model helps the generating countermeasure network GAN model learn wider data distribution, so that the generalization capability of generating the countermeasure network GAN model is improved.

Where V (D, G) is the overall optimization objective function, E represents the expectation, x is the true data, z is the noise sampled from some prior distribution p _z, G (z) is the data generated by the generator from the noise, and gamma is a dynamically adjusted parameter that is adjusted according to the behavior of the generator and the arbiter during training to maintain the balance between the two. If the log (1-D (G (z)) value is low, the loss of the generator can be enhanced by increasing the gamma value, evaluating the accuracy of the arbiter after each epoch, and adjusting gamma accordingly.

S520, the constructed encoder-decoder model processing process specifically comprises the following steps:

S521, a point cloud encoder takes point clouds P epsilon R ^n*d as input, wherein n is the number of input point clouds, and d is the feature dimension of each input point; the output of the point cloud encoder is a point feature sequence x= (X ₁,x₂,…,x_m)∈R^m*c, where m is the number of output point features and c is the output point feature dimension;

s522, mapping the point feature X to a point mark Y= (Y ₁,y₂,…,y_m)∈R^k*c′ by a multi-layer perceptron MLP, wherein c' is the dimension of the point mark;

S523, 3D-MELL is an interface end of a decoder only, accepting a tag sequence consisting of text and point tags, i.e. a mixed token sequence, denoted z= (Z ₁,z₂,…,z_k)∈R^k*c′, where k is the total number of tokens, using a self-attention mechanism, 3D-MELL can understand the context between different types of tokens so that it can generate a response based on text and point cloud input, the output of 3D-MELL is a series of predictive tags Z _<i＝(z₁,z₂,…,z_i-1), the prediction of the i-th tag being conditioned on all previous tags.

Fig. 14 is a schematic diagram of an encoder-decoder model processing procedure, where the point cloud encoder includes a self-attention mechanism module, a feedforward neural network module, and a geometric information aware adapter module.

The built encoder-decoder model can understand the intelligent extracted three-dimensional point cloud objects into three-dimensional ground objects such as beam columns, arches, tablets, pavements, flower beds, trees and the like, and also has information such as size, age and material of the three-dimensional ground objects. Therefore, the multi-mode hybrid model can intelligently identify information such as point cloud, images, texts and the like.

In summary, through the above embodiments, the humanoid robot will have high environmental awareness and intelligent understanding capabilities, so that it can work more effectively and autonomously in various application scenarios. This will not only push the development of humanoid robot technology, but will also contribute to the strength of future intelligent society construction.

The following is an example.

The lidar scanning system is mounted behind the eyes of the robot head. And acquiring three-dimensional information of the member to be assembled by using a laser radar scanning system, and acquiring high-precision three-dimensional scene point cloud data. And processing the obtained scene point cloud data by using the self-adaptive radius sampling module, extracting point cloud semantic features based on RPV-Net from the extracted effective point cloud data, and acquiring a 3D bounding box and a semantic tag of the point cloud data forming the building component in the scene. As shown in fig. 15 to 17. FIG. 15 is a 3D bounding box estimation plot of an overall point cloud acquired by a lidar system; FIG. 16 is a 3D bounding box estimation map of the acquired partial extraction map; fig. 17 is a 3D bounding box estimation map of the acquired accurate local extraction map. And then carrying out understanding communication by combining a large language model.

The number of equipment and the scale of processing described herein are intended to simplify the description of the present invention. Applications, modifications and variations of the present invention will be readily apparent to those skilled in the art.

Although embodiments of the present invention have been disclosed above, it is not limited to the details and embodiments shown and described, it is well suited to various fields of use for which the invention would be readily apparent to those skilled in the art, and accordingly, the invention is not limited to the specific details and illustrations shown and described herein, without departing from the general concepts defined in the claims and their equivalents.

Claims

1. A humanoid robot scene intelligent understanding method based on laser point cloud, characterized by comprising:

A laser radar scanning system is installed on a humanoid robot, and the humanoid robot autonomously navigates and collects three-dimensional real scene data through the laser radar scanning system to obtain scene point cloud data;

The obtained scene point cloud data is processed using an adaptive radius sampling module to extract sufficient valid point cloud data;

Perform point cloud semantic feature extraction based on RPV-Net on the extracted valid point cloud data to obtain the 3D bounding box and semantic label of the point cloud data constituting the building components in the scene;

The channel attention module is used to process the point cloud data after the semantic features of the point cloud are extracted. The average pooling method is used to obtain the overall features of the point cloud. The maximum pooling method is used to extract the discriminative individual features. The overall features and individual features are aggregated to obtain the global features of the point cloud.

A multimodal hybrid model of a scene understanding Bert pre-training model and an encoder-decoder model is constructed to process the acquired point cloud global features. The scene understanding Bert pre-training model is used to increase the diversity of training data. The encoder-decoder model is used to understand three-dimensional point cloud objects as three-dimensional ground objects and parameter information of three-dimensional ground objects, thereby improving the humanoid robot's ability to understand scenes.

2. The humanoid robot scene intelligent understanding method based on laser point cloud as described in claim 1 is characterized in that the laser radar scanning system includes: a laser radar scanner, an inertial measurement unit and a PTP time soft synchronization module, wherein the laser radar scanner and the inertial measurement unit are fused through a tight coupling manner, and the PTP time soft synchronization module synchronizes the time of the laser radar, the inertial measurement unit and the humanoid robot.

3. The method for intelligent scene understanding of a humanoid robot based on laser point cloud according to claim 2, characterized in that it also includes calibrating the laser radar scanning system, specifically:

Using the Allan variance analysis method to evaluate the noise characteristics and bias stability of the inertial measurement unit, and internally calibrating the inertial measurement unit;

The lidar_align calibration method is used to calibrate the external parameters of the laser radar scanner and the inertial measurement unit, and a transformation matrix between the coordinate system of the laser radar scanner and the coordinate system of the inertial measurement unit is obtained to transform the point cloud data in the laser radar scanner coordinate system into the coordinate system of the inertial measurement unit.

4. The method for intelligent scene understanding of a humanoid robot based on laser point cloud according to claim 3 is characterized in that after calibrating the laser radar scanning system, it also includes removing motion distortion of the scene point cloud data collected by the laser radar scanner, and the specific process is as follows:

The motion distortions that appear in the scene point cloud data include: a. If the scanning field of view of the laser radar scanner always falls on the same flat wall, when moving in a direction perpendicular to the wall, the distance from the laser radar scanner to the same point on the wall will continue to decrease, causing the point cloud data to bulge along the direction of movement of the laser radar scanner; b. When the laser radar scanner moves at an angle, the movement of the center of the scan line petals will cause the scan line petal point cloud to be stretched into an irregular shape, i.e., center distortion; c. The laser radar scanner has a fixed field of view angle and cannot cover a circle, causing the range of the field of view to scale with the change in distance, causing the size of the scan line petals to scale with the movement of the laser radar scanner, i.e., edge distortion;

The motion distortion of a and b is removed by performing motion compensation on the point cloud between frames through the inertial measurement unit. The specific calculation method is: P _correct = T _k,i × P _i , where t _k and t _k+1 are the start and end times of a frame of point cloud recorded by the laser radar scanner, _ti is the time of the i-th point P _i in the k-th frame point cloud set S relative to the start time, S _num is the total number of points in a frame point cloud set, T _k,k+1 is the inter-frame pose transformation, T _k,i is the pose transformation of the P _i point relative to the start time, and P _correct is the point coordinate after motion distortion correction calculated using the pose transformation matrix corresponding to the P _i point;

In view of the motion distortion of c, unreliable scanning points at the edge are filtered out by limiting the field of view angle.

5. The method for intelligent scene understanding of a humanoid robot based on laser point cloud according to claim 4 is characterized in that after removing motion distortion of the scene point cloud data, the method further comprises reconstructing the scene point cloud data for point cloud storage, and the specific process is as follows:

A frame of point cloud data scanned by the laser radar scanner includes 20 scan line petals in the graphics, each scan line petal is continuous at the center, each scan line petal takes its point closest to the scan center as the starting point, and the entire petal is sequentially stored in a row of the matrix;

The spatially adjacent scan line petals are stored in adjacent rows of the matrix. Specifically, the 20 scan line petals in a frame of point cloud data are obtained by rotating 4 times with 5 scan line petals as a period. In one scanning cycle, the starting point numbers of the 5 scan line petals recorded clockwise are 0, 1000, 2000, 500, and 1500 respectively; the difference in the starting point numbers of the scan line petals in 4 scanning cycles is 2500, and the starting point numbers of adjacent scan line petals indexed clockwise in a frame of scanning point cloud are calculated through the point number relationship; the corresponding scan line petals are indexed sequentially by the obtained starting point numbers, and stored row by row in sequence to obtain a matrix with continuous spatial relationships between rows.

6. The method for intelligent scene understanding of a humanoid robot based on laser point cloud according to claim 5, characterized in that after reconstructing the scene point cloud data for point cloud storage, it also includes: modifying the formula for extracting corner point and plane point features of each frame of point cloud data, specifically:

A. Calculate the vector difference between the point of curvature to be sought and the five points in front of it and the five points behind it to the origin of the scan, and sum them up. The length of the sum is used to reflect the curvature value of the point to be sought. The curvature value calculation formula is:

Where, _ci is the curvature value of the scanning point _Pi , S is an integer set from -5 to 5 that does not include 0, and Represents the vector of the i-th point and the j-th point in the k-th frame point cloud data in the laser radar scanner coordinate system L;

B. When the curvature value of the scanning point is greater than 1, the scanning point is judged to be a corner point; when the curvature value of the scanning point is less than 0.1, the scanning point is judged to be a plane point;

C. Extract the corner points and plane points in each frame of point cloud data according to steps A and B.

7. The method for intelligent scene understanding of a humanoid robot based on laser point cloud as claimed in claim 1 is characterized in that the adaptive radius sampling module is used to process the obtained scene point cloud data, specifically: firstly, the Euclidean distance between the sampling center point and all points in the scene point cloud data is calculated, and the points exceeding a specific threshold R are marked as Q, and the remaining points retain the original value; then, the points within the spherical sampling radius are arranged in ascending order of distance, and the nearest w points are extracted as sampling points; if the sampling point contains a Q value, indicating that the number of valid points within the spherical sampling radius is less than w, then the sampling radius needs to be gradually increased until the condition of at least w valid points is met to ensure that there are enough valid points for feature learning;

The extracted valid point cloud data is subjected to point cloud semantic feature extraction based on RPV-Net, specifically as follows: input point cloud P∈R ^N×3 , i.e., original point cloud data containing three dimensions, wherein N represents the number of point clouds; seeds are obtained through a semantic segmentation network; the pillar-based idea is used to extract the boundary voxel features of the whole pillar, a vote is performed on the obtained pillar, and then a Vote & Sampling & Grouping operation is performed on the obtained point cloud features, and each seed independently generates a vote through a voting module; then the votes are grouped into h groups, which are processed by a Proposal module to generate a final Proposal, and finally a 3D bounding box and a semantic label are output.

8. The method for intelligent scene understanding of a humanoid robot based on laser point cloud according to claim 1 is characterized in that the point cloud data after the point cloud semantic feature extraction is processed by using a channel attention module, and the specific process is as follows:

The features of point cloud data can be expressed as F = ^RBxNx1xC , where B, N, and C represent the batch size, the number of point clouds, and the number of feature channels, respectively;

The average pooling method and the maximum pooling method are used along the dimension of the number of feature channels C. The average pooling method is used to obtain the overall features of the point cloud, and the maximum pooling method is used to extract more distinctive individual features. The overall features and individual features are then aggregated to generate two different point cloud global features F _avg and F _max .

9. The method for intelligent scene understanding of a humanoid robot based on laser point cloud according to claim 1, characterized in that the diversity of training data is increased by the scene understanding Bert pre-training model, specifically:

The scene understanding BERT pre-training model expands the data set for the generative adversarial network (GAN) model and increases the diversity of training data;

In the standard generative adversarial network GAN model, it includes a generator G and a discriminator D;

By adjusting the parameter settings of the Generative Adversarial Network (GAN) model, training samples are set for the scene point cloud dataset;

The goal of the generator G is to generate fake data that is as close to the real data as possible, and the goal of the discriminator D is to distinguish between the real data and the fake data generated by the generator G, adding a balancing coefficient γ to adjust the weight between the two.

10. The method for intelligent scene understanding of a humanoid robot based on laser point cloud according to claim 9, characterized in that the encoder-decoder model processing process is specifically as follows:

The point cloud encoder takes the point cloud ^P∈Rn*d as input, where n is the number of input point clouds and d is the feature dimension of each input point;

The output of the point cloud encoder is a point feature sequence X = (x ₁ ,x ₂ ,…,x _m) ∈R ^m*c , where m is the number of output point features and c is the output point feature dimension;

Through the multi-layer perceptron MLP, the point feature X is mapped to the point label Y = (y ₁ ,y ₂ ,…,y _m) ∈R ^k*c′ , where c′ is the dimension of the point label;

3D-MELL is a decoder-only interface that accepts a token sequence consisting of text and point tokens, i.e., a mixed token sequence, which is represented by Z = (z ₁ ,z ₂ ,…,z _k) ∈R ^k*c′ , where k is the total number of tokens;

Using the self-attention mechanism, 3D-MELL is able to understand the contextual relationships between different types of tokens, enabling it to generate responses based on text and point cloud inputs; the output of 3D-MELL is a series of predicted tags Z _<i = (z ₁ ,z ₂ ,…,zi _-1 ), where the prediction of the i-th tag is conditioned on all previous tags.