CN112347923A

CN112347923A - A Roadside Pedestrian Trajectory Prediction Algorithm Based on Adversarial Generative Network

Info

Publication number: CN112347923A
Application number: CN202011229272.0A
Authority: CN
Inventors: 杨彪; 何才臻; 徐黎明; 闫国成; 吕继东; 陈阳
Original assignee: Jiangsu China Israel Industrial Technology Research Institute; Changzhou University
Current assignee: Jiangsu China Israel Industrial Technology Research Institute; Changzhou University
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-02-09

Abstract

The invention relates to a roadside end pedestrian track generation algorithm based on an confrontation generation network, which generates a multi-mode predicted track by utilizing a social attention mechanism and a pedestrian track latent variable; through the confrontation generation training of the track generator and the discriminator, the capabilities of the generator and the discriminator are continuously optimized, and the precision of the track generated by the generator is improved; the method comprises the steps that a social attention mechanism based on head orientation is provided, the head orientation of a pedestrian is obtained through the last speed direction of the pedestrian, a cosine value of an included angle between the pedestrians is calculated according to head orientation information, soft attention and hard attention mechanisms optimize the output of the social attention mechanism by using the calculated angle information, and the output is converged through a maximum pooling layer; a new latent variable generating method is proposed, two feedforward neural networks are used for learning latent variables from pedestrian historical tracks and observation tracks respectively, the input of the latent variable generator comprises position, speed and acceleration, and the distribution of three types of latent variables is generated from the three types of input respectively.

Description

Roadside end pedestrian track prediction algorithm based on confrontation generation network

Technical Field

The invention relates to the technical field of automatic driving, in particular to pedestrian trajectory prediction, and provides a roadside end pedestrian trajectory prediction algorithm based on a confrontation generation network.

Background

With the continuous development of the automatic navigation of the robot and the automatic driving technology of the automobile, the unmanned technology gets wide attention and has bright application prospect; the unmanned vehicle can bring convenience to the life of people, but the unmanned vehicle needs to monitor the motion trail of pedestrians on a road and predict the future motion trail of the pedestrians during driving, so that collision with the pedestrians is avoided; in order to better predict the motion trail of the pedestrian, the unmanned vehicle needs to process the observed pedestrian trail data, learn the rule of the motion of the pedestrian and predict the next motion state of the pedestrian according to the rule; the challenge of accurately predicting the pedestrian motion trajectory comes from the complexity of human behavior and its own intentions and variety of external stimuli; pedestrian motion behavior may be driven by its own target intent, the existence of action interactions between surrounding objects, social relationships, social rules and norms, or its topological, geometric and semantic environment, most of which are not directly visible, need to be inferred from complex laws of motion, or modeled from contextual information; how to let the unmanned vehicle learn the potential motion law is the key for accurately predicting the pedestrian track;

due to the fact that behaviors of pedestrians are random, whether the pedestrians are machines or humans, future tracks of the pedestrians cannot be predicted accurately; the pedestrian's trajectory is influenced by the surrounding environment, such as person-to-person, person-to-object, which is potentially undescribable; however, the future track of the pedestrian is always influenced by the motion of people and objects in front of the pedestrian, and the common knowledge is utilized to be beneficial to simulating the social interaction behavior of the pedestrian, so that the future motion track of the pedestrian is well predicted;

the motion modes of pedestrians are complex and diverse, the complex pedestrian motion is difficult to describe by a dynamic model, and a common method for modeling the general motion of a maneuvering target is to define and fuse different typical motion modes, each mode is described by different dynamic states; the patterns may be linear movements, turning maneuvers or sudden accelerations, forming over time a sequence capable of describing complex movement behaviour; the diversity of pedestrian motion patterns in pedestrian trajectory prediction must also be considered;

disclosure of Invention

The technical problem to be solved by the invention is as follows: in order to solve the problems that the motion modes of pedestrians are complex and various and the complex pedestrian motion is difficult to describe by a dynamic model in the prior art, a roadside end pedestrian track prediction algorithm based on a confrontation generation network is provided.

The technical scheme adopted by the invention for solving the technical problems is as follows: a roadside end pedestrian trajectory prediction algorithm based on a confrontation generation network comprises the following steps:

s10: encoding the input track using an encoder;

s20: calculating the social attention of the pedestrian by utilizing the head orientation of the pedestrian;

s30: applying a latent variable predictor to generate a predictable latent variable distribution;

s40: generating a predicted future trajectory of the pedestrian;

s50: optimizing the pedestrian trajectory generated by the generator using a discriminator;

the step S30 includes the following steps:

s31: designing a latent variable predictor;

s32: and predicting the potential variable distribution of the pedestrian by using a latent variable predictor.

Further, in step S31: the latent variable predictor consists of two feedforward neural networks defined as follows:

wherein Ψ (-) and

is a feed-forward neural network that is,

and

are the parameters of the two feedforward neural networks respectively,

and

is the k-th type input of the latent variable predictor.

Further, in step S32: k is 1, 2 and 3, and respectively represents the position, speed and acceleration of the pedestrian, the position reveals the layout of the potential scene, the speed reflects the motion mode of different pedestrians, and the acceleration shows the motion intensity of the pedestrian; the latent variable predictor estimates the latent distribution of three variables from the three inputs; gaussian random noise is used for generating multi-mode output, and finally, the three kinds of latent variable distribution and the Gaussian random noise are fused together to finally form latent variable distribution parameters in a training stage;

in the testing stage, a latent variable predictor predicts the latent variable distribution from the observation track of the pedestrian, the latent variable predictor inputs the position, speed and acceleration information of the pedestrian, can respectively predict the latent variable distribution of the position, speed and acceleration of the pedestrian from the three types of input, and combines the three types of latent variables and Gaussian random noise to form a final latent variable which is output to a track generator;

in the training process, the latent variable loss function is used for measuring the difference between the latent variable distribution of the observed track and the latent variable distribution of the real track, and KL divergence is used for calculating the error, wherein the formula is as follows:

wherein

And

respectively representing the latent variable distribution of the observed track and the latent variable distribution of the real track.

Further, the step 1 specifically includes the following steps:

s11: processing input track data: the input trace being a series of time-series trace points

Wherein

Is the position coordinate of the target i at time t; the position coordinates of each track at different moments are sent into a coding network;

s12: converting two-dimensional position information into multi-dimensional vector of fixed length by using single-layer multi-layer perceptron

The definition of the multi-layer perceptron is as follows:

where φ (-) is a multi-layered perceptron using a ReLU nonlinear activation function, W_eeIs a parameter of the multi-layer perceptron;

s13: sending the multidimensional vector into a coder based on a long-term and short-term memory network to generate a hidden state of the pedestrian movement

The encoder long short term memory network (LSTM) is defined as follows:

where LSTM (. beta.) is a long-short term memory network, W_encoderThe parameter is a parameter of a long-term and short-term memory network of the encoder, and the parameter can be shared among all pedestrians in the same scene.

Further, the step 2 specifically includes the following steps:

s21: calculating the azimuth angle between the pedestrians: taking the speed of the last position of the pedestrian as the future speed of the pedestrian, taking the direction of the speed of the last position of the pedestrian as the head direction and the track motion direction, and calculating the cosine value of the azimuth angle between the pedestrians by using the head directions of all the pedestrians as follows:

where n is the number of all pedestrians in the same scene, b_ijRepresenting the included angle between the pedestrian i and the pedestrian j;

s22: designing an attention mechanism: designing a soft attention mechanism and a hard attention mechanism according to the cosine values of the azimuth included angles among the pedestrians; the effect of one pedestrian on another decreases as the azimuthal cosine value between them increases; the hard attention mechanism uses a matrix H with the same shape as cos (beta)_AIs represented by H_AEach of the elements h_ijAre all set to 0 or 1, when the row isWhen the cosine value of the azimuth included angle between the people is greater than the preset threshold value 0.2, the corresponding attention weight h_ij1, when the cosine value of the azimuth included angle between the pedestrians is less than the preset threshold value 0.2, the corresponding attention weight h_ijIs 0; the soft attention mechanism and the hard attention mechanism calculate attention weights through thresholds; adaptive computation of correlations between pedestrians for a soft attention mechanism, weight S for the soft attention mechanism_AThe calculation formula of (a) is as follows:

where δ (-) denotes a sigmoid activation function,

represents 1 × 1 convolutional layers;

the soft and hard attention machine is used for the output of the second multilayer perceptron, the soft attention machine and the hard attention machine are used for optimizing the output of the second multilayer perceptron, and the attention machine is converged through the largest pooling layer to obtain the output

Further, the step 4 specifically includes the following steps:

s41: output of social attention module

And the output of latent variable predictor

Hidden from pedestrian movement

Make a splice

S42: the splicing result is input into a decoder based on a long-term and short-term memory network to obtain a new track hidden state fused with various information

The long-short term memory network of the decoder is defined as follows:

where LSTM (. beta.) is a long-short term memory network, W_decoderThe parameters of the decoder long-term and short-term memory network can be shared among all pedestrians in the same scene;

s43: decoding the new hidden state by using a multilayer perceptron to obtain future track coordinates of the pedestrian: the multi-layer perceptron is defined as follows:

where γ (-) is a multi-layered perceptron using the ReLU nonlinear activation function,

is the future position coordinate of the pedestrian, and the output prediction result is a series of position coordinates

Wherein T is_obsIs the length of the predicted trajectory; the invention adopts multi-mode output, the track generator outputs m tracks at a time, and 2 norm loss functions are used for calculating the deviation between the m tracks and the true value, and the expression is as follows:

wherein

Is the real track of the pedestrian,

is the m-th generationThe predicted future trajectory of the pedestrian is set to m-20 in the present invention.

Further, the step 5 specifically includes the following steps:

s51: inputting the trajectory generated by the generator and the real trajectory of the pedestrian into the discriminator

S52: the discriminator discriminates whether the input trajectory is a trajectory generated by the generator or a real trajectory: the discriminator uses an encoder based on a long-short term memory network to encode a real track and a generated track, a multi-layer perceptron is applied to a hidden state output by the encoder to obtain a classification score, under an ideal condition, the discriminator learns social rules of the pedestrian track, and the track which does not accord with the rules is judged to be false by the discriminator;

the penalty function against generative training is expressed as follows:

where D is the discriminator, G is the generator, z is the latent variable distribution parameter, x is the trajectory data of the data,

is the kth input (position, velocity, acceleration) of the observed trajectory latent variable predictor; through game training of the generator and the discriminator, the generator can finally generate samples which are similar to a training set and accord with social rules; because the generator learns a probability distribution similar to that of the training set, each sampling can give different reasonable samples, and therefore the probability distribution can be used for predicting multiple possibilities;

the total loss function is composed of three parts, wherein one part is a training loss function generated by confrontation, one part is the KL divergence of latent variable distribution, and the other part is the deviation between a predicted value and a true value; the total loss function weight is defined as follows:

wherein alpha and beta are respectively set as numbers between 1 and 10, and specific values are obtained by cross validation on a reference data set; during training, a generator and a discriminator are iteratively trained, the batch processing size is set to be 64, 600 epochs, the learning rate is set to be 0.001, and an Adam optimizer is used for optimizing parameters.

The invention has the beneficial effects that the roadside end pedestrian track prediction algorithm based on the confrontation generation network (1) provides a social attention module, the module utilizes the correlation between the head orientation and the track prediction, and the attention mechanism improves the social interaction capturing capability of the social pooling layer under different scenes;

(2) a new latent variable predictor is provided, which can estimate the latent variable with rich knowledge to better predict the track; only the input of the prediction variable is extracted from the trajectory data, thus only little computational overhead is added;

(3) embedding the social attention focusing module and the latent variable predictive variable into an confrontation generating network framework to generate multi-mode output acceptable by social rules;

drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a schematic diagram of a challenge generation training strategy proposed in the present invention;

FIG. 2 is a schematic diagram of a generator proposed in the present invention

FIG. 3 is a schematic representation of latent variable prediction proposed in the present invention;

FIG. 4 is a schematic diagram of the discriminator proposed in the present invention;

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views illustrating only the basic structure of the present invention in a schematic manner, and thus show only the constitution related to the present invention.

1-4, the roadside end pedestrian trajectory prediction algorithm based on the confrontation generation network includes the following steps:

s10: encoding the input track using an encoder;

s40: generating a predicted future trajectory of the pedestrian;

the step 1 specifically comprises the following steps:

Wherein

The definition of the multi-layer perceptron is as follows:

The encoder long short term memory network (LSTM) is defined as follows:

Further, the step 2 specifically includes the following steps:

s21: calculating the azimuth angle between the pedestrians: the invention is based on the fact that the future trajectory of a pedestrian is always influenced by the front crowd and not by the rear crowd; taking the speed of the last position of the pedestrian as the future speed of the pedestrian, taking the direction of the speed of the last position of the pedestrian as the head direction and the track motion direction, and calculating the cosine value of the azimuth angle between the pedestrians by using the head directions of all the pedestrians as follows:

s22: designing an attention mechanism: designing a soft attention mechanism and a hard attention mechanism according to the cosine values of the azimuth included angles among the pedestrians; the effect of one pedestrian on another decreases as the azimuthal cosine value between them increases; the hard attention mechanism uses a matrix H with the same shape as cos (beta)_AIs represented by H_AEach of the elements h_ijIs set to be 0 or 1, when the cosine value of the azimuth angle between the pedestrians is greater than the preset threshold value 0.2, the corresponding attention weight h_ij1, when the cosine value of the azimuth included angle between the pedestrians is less than the preset threshold value 0.2, the corresponding attention weight h_ijIs 0; the soft attention mechanism and the hard attention mechanism calculate attention weights through thresholds; adaptive computation of correlations between pedestrians for a soft attention mechanism, weight S for the soft attention mechanism_AThe calculation formula of (a) is as follows:

where δ (-) denotes a sigmoid activation function,

represents 1 × 1 convolutional layers;

The step S30 includes the following steps:

s31: designing a latent variable predictor;

In step S31: the invention applies a latent variable predictor to generate a predictable latent variable distribution, which is a method for predicting latent variable distribution parameters in a data-driven manner; potential variable distribution parameters can be predicted from the observation track and the real track of the pedestrian in a training stage by a potential variable generator, so that a potential motion rule can be learned; the latent variable predictor consists of two feedforward neural networks defined as follows:

wherein Ψ (-) and

is a feed-forward neural network that is,

and

are the parameters of the two feedforward neural networks respectively,

and

is the k-th type input of the latent variable predictor.

wherein

And

Further, the step 4 specifically includes the following steps:

s41: the interaction between pedestrians is obtained by the social attention module, the pedestrian motion latent variable distribution is obtained by the latent variable predictor, and the output of the social attention module is output

And the output of latent variable predictor

Hidden from pedestrian movement

Make a splice

The long-short term memory network of the decoder is defined as follows:

wherein

Is the real track of the pedestrian,

is the predicted future trajectory of the pedestrian by the mth generator, and m is set to 20 in the present invention.

Further, the step 5 specifically includes the following steps:

the penalty function against generative training is expressed as follows:

The invention provides a roadside end pedestrian track generation algorithm based on an confrontation generation network, which generates a multi-mode predicted track by utilizing a social attention mechanism and a pedestrian track latent variable; according to the method, the capabilities of the generator and the discriminator are continuously optimized through the confrontation generation training of the trajectory generator and the discriminator, and the accuracy of the trajectory generated by the generator is improved; the invention provides a social attention mechanism based on head orientation, which obtains the head orientation of a pedestrian through the last speed direction of the pedestrian, calculates the cosine value of an included angle between the pedestrians according to the head orientation information, optimizes the output of the social attention mechanism by using the calculated angle information and converges the output through a maximum pooling layer; the invention provides a new latent variable generation method, which is characterized in that two feedforward neural networks are used for learning latent variables from pedestrian historical tracks and observation tracks respectively, the input of a latent variable generator comprises position, speed and acceleration, and the distribution of three types of latent variables is generated from the three types of input respectively; the three types of latent variable distributions are combined with Gaussian random noise to generate multi-modal output and maintain the capability of processing uncertain input in the future.

In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. a roadside pedestrian trajectory prediction algorithm based on confrontation generating network, is characterized in that: comprise the steps:

S10: use the encoder to encode the input track;

S20: Calculate the pedestrian's social attention by using the pedestrian's head orientation;

S30: Apply a latent variable predictor to generate a predictable latent variable distribution;

S40: Generate a predicted pedestrian future trajectory;

S50: Use the discriminator to optimize the pedestrian trajectory generated by the generator;

The step S30 includes the following steps:

S31: Design a latent variable predictor;

S32: Use a latent variable predictor to predict the latent variable distribution of pedestrians.

2. a kind of roadside pedestrian trajectory prediction algorithm based on confrontation generating network as claimed in claim 1, is characterized in that: in step S31: latent variable predictor is formed by two feedforward neural networks and is defined as follows:

where Ψ( ) and

is a feedforward neural network,

and

are the parameters of the two feedforward neural networks, respectively,

and

is the k-th class input of the latent variable predictor.

3. a kind of roadside pedestrian trajectory prediction algorithm based on confrontation generation network as claimed in claim 2, it is characterized in that: in step S32: k=1,2,3, respectively represent pedestrian position, speed, acceleration, The location reveals the layout of the latent scene, the velocity reflects the movement patterns of different pedestrians, and the acceleration indicates the intensity of the pedestrian movement; the latent variable predictor estimates the latent distribution of the three variables from these three inputs; Gaussian random noise is used for Generate multi-modal output, and finally fuse the three latent variable distributions with Gaussian random noise, and finally constitute the latent variable distribution parameters in the training phase;

In the test phase, the latent variable predictor predicts the distribution of latent variables from the pedestrian observation trajectories. The latent variable predictor inputs the pedestrian's position, speed, and acceleration information. The latent variable predictor can predict the pedestrian's position from these three types of inputs. , the latent variable distribution of velocity and acceleration, and combine these three latent variables with Gaussian random noise to form the final latent variable output to the trajectory generator;

In the training process, the latent variable loss function is used to measure the gap between the latent variable distribution of the observed trajectory and the true trajectory latent variable distribution. The KL divergence is used to calculate this error, and the formula is as follows:

in

and

represent the latent variable distribution of the observed trajectory and the latent variable distribution of the true trajectory, respectively.

4. a kind of roadside pedestrian trajectory prediction algorithm based on confrontation generation network as claimed in claim 1, is characterized in that: described step 1 specifically comprises the following steps:

S11: Process the input trajectory data: the input trajectory is a series of trajectory points in a time series

in

is the position coordinate of target i at time t; the position coordinates of each trajectory at different times are sent into the encoding network;

S12: Use a single-layer multilayer perceptron to convert two-dimensional position information into fixed-length multi-dimensional vectors

The definition of a multilayer perceptron is as follows:

where φ( ) is the multilayer perceptron using ReLU nonlinear activation function, and _Wee is the parameter of the multilayer perceptron;

S13: Send the multi-dimensional vector into the encoder based on long short-term memory network to generate the hidden state of pedestrian movement

The encoder long short-term memory network (LSTM) is defined as follows:

where LSTM( ) is the long short-term memory network, W _encoder is the parameter of the encoder long short-term memory network, and the parameters can be shared among all pedestrians in the same scene.

5. a kind of roadside pedestrian trajectory prediction algorithm based on confrontation generation network as claimed in claim 1, is characterized in that: described step 2 specifically comprises the following steps:

S21: Calculate the azimuth angle between pedestrians: take the speed of the last position of the pedestrian as the future speed of the pedestrian, take the direction of the speed of the last position of the pedestrian as the head orientation and the trajectory movement direction, and use the head orientation of all pedestrians to calculate the distance between the pedestrians The cosine of the azimuth angle is as follows:

where n is the number of all pedestrians in the same scene, and b _ij represents the angle between pedestrian i and pedestrian j;

S22: Design an attention mechanism: Design a soft attention mechanism and a hard attention mechanism according to the cosine value of the azimuth angle between pedestrians; the influence of one pedestrian on another pedestrian decreases as the cosine value of the azimuth angle between them increases ; The hard attention mechanism is represented by a matrix H _A with the same shape as cos(β), and the value of each element h _ij in H _A is set to 0 or 1. When the cosine value of the azimuth angle between pedestrians is greater than the preset When the threshold is 0.2, the corresponding attention weight h _ij is 1, and when the cosine value of the azimuth angle between pedestrians is less than the preset threshold 0.2, the corresponding attention weight h _ij is 0; soft attention mechanism and hard attention mechanism Different attention weights are calculated through the threshold; the soft attention mechanism adaptively calculates the correlation between pedestrians, and the calculation formula of the soft attention mechanism weight S _A is as follows:

where δ( ) represents the sigmoid activation function,

Represents a 1x1 convolutional layer;

The soft and hard attention mechanism acts on the output of the second multi-layer perceptron. The soft-attention mechanism and the hard-attention mechanism are used to optimize the output of the second multi-layer perceptron, and the maximum pooling layer is used to converge the attention mechanism to obtain output

6. a kind of roadside pedestrian trajectory prediction algorithm based on confrontation generation network as claimed in claim 1 is characterized in that: described step 4 specifically comprises the following steps:

S41: Combine the output of the social attention module

and the output of the latent variable predictor

Hidden state with pedestrian motion

splicing

S42: The splicing result is input into the decoder based on the long short-term memory network, and a new trajectory hidden state that integrates various information is obtained.

The long short-term memory network of the decoder is defined as follows:

Where LSTM( ) is the long short-term memory network, W _decoder is the parameters of the decoder long-term and short-term memory network, and the parameters can be shared among all pedestrians in the same scene;

S43: Decode the new hidden state with the multilayer perceptron to obtain the coordinates of the pedestrian's future trajectory: The multilayer perceptron is defined as follows:

where γ( ) is a multilayer perceptron using ReLU nonlinear activation function,

is the future position coordinates of the pedestrian, and the output prediction result is a series of position coordinates

Where T _obs is the length of the predicted trajectory; the present invention adopts multi-mode output, the trajectory generator outputs m trajectories at a time, and uses the 2-norm loss function to calculate the deviation between the m trajectories and the true value, and the expression is as follows:

in

is the true trajectory of pedestrians,

is the pedestrian future trajectory predicted by the mth generator, and m=20 is set in the present invention.

7. a kind of roadside pedestrian trajectory prediction algorithm based on confrontation generation network as claimed in claim 1 is characterized in that: described step 5 specifically comprises the following steps:

S51: Input the trajectory generated by the generator and the real trajectory of the pedestrian into the discriminator

S52: The discriminator discriminates whether the input trajectory is generated by the generator or the real trajectory: the discriminator uses an encoder based on a long short-term memory network to encode the real trajectory and the generated trajectory, on the hidden state output by the encoder Apply a multi-layer perceptron to obtain classification scores. Ideally, the discriminator will learn the social rules of pedestrian trajectories, and trajectories that do not meet this rule will be judged as false by the discriminator;

The loss function for adversarial generative training is expressed as follows:

is the kth input (position, velocity, acceleration) of the observed trajectory latent variable predictor; through the game training of the generator and the discriminator, the generator can finally generate samples similar to the training set and conform to social rules; It is a probability distribution similar to the training set, and each sampling can give different reasonable samples, so it can be used to predict multiple possibilities;

The total loss function consists of three parts, one part is the adversarial generation training loss function, one part is the latent variable distribution KL divergence, and the other part is the deviation between the predicted value and the true value; the total loss function weighting is defined as follows:

where α and β are respectively set to numbers between 1 and 10, and the specific values are obtained by cross-validation on the benchmark dataset; during training, the generator and discriminator are iteratively trained, and the batch size is set to 64,600 epochs , the learning rate is set to 0.001, and the parameters are optimized using the Adam optimizer.