CN119252066B

CN119252066B - A vehicle trajectory optimization method based on imitation learning and related device

Info

Publication number: CN119252066B
Application number: CN202411780013.5A
Authority: CN
Inventors: 周俊杰; 吴劲峰; 吴文浩; 虞霄璐; 陈瑞生
Original assignee: Zhejiang Supcon Information Industry Co Ltd
Current assignee: Zhejiang Supcon Information Industry Co Ltd
Priority date: 2024-12-05
Filing date: 2024-12-05
Publication date: 2025-03-25
Anticipated expiration: 2044-12-05
Also published as: CN119252066A

Abstract

The application discloses a vehicle track optimization method and a related device based on imitation learning, wherein the method comprises the steps of obtaining initial track data of a target vehicle; the method comprises the steps of determining a target flow direction of a target vehicle, screening a target lane from at least one lane corresponding to the target flow direction based on initial track data, extracting lane information of the target lane from configuration information of an intersection if the initial track data meets correction conditions, obtaining current position information and current movement information of the target vehicle and surrounding vehicles based on the initial track data, inputting the current position information, the current movement information and the lane information of the target vehicle and surrounding vehicles to a generated countermeasure simulation learning model to obtain target track data determined by the countermeasure simulation learning model, and correcting the initial track data based on the target track data.

Description

Vehicle track optimization method and related device based on imitation learning

Technical Field

The application relates to the technical field of intelligent traffic management, in particular to a vehicle track optimization method based on imitation learning and a related device.

Background

With the acceleration of the urbanization process, traffic pressure and security challenges are becoming more severe, and Intelligent Transportation Systems (ITS) are rapidly developing as a key technology to cope with these challenges. The ITS realizes real-time monitoring, efficient management and scientific guidance of traffic flow by integrating advanced information technology, data communication transmission technology and computer technology. The holographic track construction technology is used as an important component of the ITS, can comprehensively capture and analyze the dynamic behaviors of vehicles in the intersection, and is important for improving the optimization and prediction capabilities of traffic flow.

Currently, the main technical scheme for realizing the construction of complex holographic trajectories in intersections depends on the thunder fusion technology. The technology combines the advantages of radar and visual detection equipment, and can acquire the position, speed, movement direction, appearance, license plate and other characteristic information of the vehicle. However, in practical applications, due to factors such as line of sight shielding, illumination variation, distance limitation, etc., the visual detection device often cannot continuously or accurately capture the track of the vehicle, resulting in incomplete track information.

In order to solve these problems, the current technical scheme adopts a track correction method. The core of this approach is to set a preset trajectory, i.e. a series of expected travel paths and speeds for the vehicle according to road design and traffic rules. In practical application, the system firstly acquires actual running data of the vehicle through radar and visual detection equipment, and then compares and analyzes the data with a preset track. When the deviation between the actual track and the preset track is found, the system can correct the track by utilizing an algorithm so as to simulate and restore the real running state of the vehicle at the intersection.

However, although the trajectory correction method alleviates the detection error to some extent, it still has limitations. On the one hand, the setting of the preset track depends on road design and traffic rules, and deep learning and understanding of vehicle behaviors are lacking. Therefore, in a complex traffic environment, particularly in the case of a large traffic flow and a traffic incident, the prediction accuracy of the preset trajectory may be limited. On the other hand, the track correction method mainly depends on comparison and analysis of an actual track and a preset track by an algorithm, and lacks the capability of predicting the running intention and the path of the vehicle in real time. This limits the ability of holographic trajectory construction techniques to cope with sudden traffic events and complex traffic scenarios.

Disclosure of Invention

In view of the above problems, the present application provides a vehicle track optimization method and related devices based on simulation learning, so as to achieve the purpose of better coping with sudden traffic events and complex traffic scenes. The specific scheme is as follows:

the first aspect of the present application provides a vehicle track optimization method based on imitation learning, comprising:

obtaining initial track data of a target vehicle;

Determining a target flow direction of the target vehicle;

Screening a target lane from at least one lane corresponding to the target flow based on the initial track data;

if the initial track data meets the correction condition, extracting lane information of the target lane from configuration information of an intersection, and acquiring current position information and current motion information of the target vehicle and surrounding vehicles based on the initial track data;

Inputting the current position information, the current movement information and the lane information of the target vehicle and surrounding vehicles thereof into a generated anti-imitation learning model to obtain target track data determined by the generated anti-imitation learning model;

and correcting the initial track data based on the target track data.

In one possible implementation, determining a target flow direction of the target vehicle includes:

if the target vehicle has locked a flow direction, taking the flow direction of the target vehicle locked as a target flow direction of the target vehicle;

If the target vehicle is not in the flowing direction and the lane flowing direction of the entrance road of the target vehicle is configured to be in a single flowing direction, the single flowing direction is used as the target flowing direction of the target vehicle;

If the target vehicle is not in the flowing direction and the lane flowing direction of the entrance road of the target vehicle is configured to be in multiple flowing directions, selecting the flowing direction with the largest flow rate from the multiple flowing directions as the target flowing direction of the target vehicle by comparing the flow rates corresponding to the flowing directions in the multiple flowing directions;

And if the vehicle flow corresponding to each of the multiple directions is consistent, determining a temporary target point based on the historical track of the target vehicle, and if the temporary target point is positioned in one of the multiple directions, taking the direction containing the temporary target point in the multiple directions as the target direction of the target vehicle.

In one possible implementation, selecting a target lane from the target stream to a corresponding at least one lane based on the initial trajectory data, includes:

determining the latest track course angle of the target vehicle based on the track point of the initial track data where the target vehicle is currently located;

Acquiring an outlet course angle corresponding to a target flow direction of the target vehicle;

determining an average included angle between the latest track course angle and the exit course angle;

and determining the angle between the target point of each lane in at least one lane corresponding to the target flow direction and the current track point of the target vehicle, and taking the lane with the smallest difference between the angle and the average included angle as the target lane.

In one possible implementation, the initial trajectory data satisfies a correction condition by determining that:

Determining a first distance between a first initial track point in the initial track data, which enters an intersection of the target lane, and a target point of the target lane;

Determining a second distance between a track point where the target vehicle is currently located and the first initial track point in the initial track data;

if the target flow direction is a left-turning flow direction, if the ratio of the second distance to the first distance is not smaller than a left-turning threshold value, a correction condition is satisfied;

if the target flow direction is a right turn flow direction, if the ratio of the second distance to the first distance is not smaller than a right turn threshold value, a correction condition is satisfied;

If the target flow direction is a straight-going flow direction, if the ratio of the second distance to the first distance is not smaller than a straight-going threshold value and the target vehicle fails in visual tracking, a correction condition is met;

and if the target flow direction is a u-turn flow direction, if the target vehicle fails in visual tracking, the correction condition is met.

In one possible implementation, the generating the challenge simulation learning model is based on a challenge network training, the challenge network including a generator and a evaluator;

the process of generating an countermeasure imitation learning model for training based on the countermeasure network includes:

Acquiring a vehicle running track of an intersection, and determining a state action pair of an expert based on the vehicle running track of the intersection;

At the current moment Setting according to course distribution setSampling outA plurality of vehicles as a plurality of sample vehicles;

The method comprises the steps of acquiring sample information of each sample vehicle in the plurality of sample vehicles at the current position, wherein the sample information comprises position information and motion information of the sample vehicle at the current position, position information and motion information of surrounding sample vehicles and lane information of a target sample lane;

processing sample information corresponding to each sample vehicle according to the current strategy of the generator to generate the track of each sample vehicle;

Determining a punishment value corresponding to the track of each sample vehicle;

Action pairs for each state in the track of each sample vehicle based on the evaluator Scoring to generate a reward value of each sample vehicle, wherein the reward value is determined by the following steps:

wherein, Representation judging deviceAt the parameters ofUnder according to the state action pairThe value of the result is that,A penalty value corresponding to the trajectory of the sample vehicle;

The method for updating the policy parameters of the generator based on the trust domain optimization method comprises the following steps of solving the constraint optimization problem:

;

wherein, Representing policiesParameters of (2); representing the desire; Is shown in Current policy taken by time of day based on old parametersDefined as follows; representing a new policy; Indicating that the current policy is in Under the observation condition at the momentTake action downwardsProbability of (2); representing new strategies in Under the observation condition at the momentTake action downwardsProbability of (2); Representing current policies Under observation conditionsProbability distribution of action taken down; Representing new policies Under observation conditionsProbability distribution of action taken down; Representation of AndKL (Kullback-Leibler) divergence between; Representing a step size parameter for controlling the maximum variation of the strategy in each optimization step; representing dominance functions for measuring observation conditions Take action downwardsAction value expectations of (a)And observer deviceEstimated state value expectationsDegree of difference between them, actionRepresenting an action taken by the sample vehicle according to a policy;

The dominance function is estimated by the following generalized dominance estimation method:

wherein, Representing a discount rate; is a parameter between 0 and 1 for balancing TD (Temporal Difference) errors Weights of (2); Representing the prize value determined by the evaluator; And Respectively shown inTime of day and time of dayState value expectations at time;

maintaining the policy parameters of the generator unchanged, and updating the judgment parameters of the judgment device based on the state action pairs of the expert and the state action pairs generated by the new policy of the generator The judging parameters of the judging deviceUpdating is performed by the following objective function:

wherein, Representing an expert policy that is set up to,The new policy is represented by a representation of the new policy,Representing in-execution policiesTime state action pairProbability of being accessed; expressed in policy Down timeIn a state ofProbability of (2); Representing state based on current policy Take action downwardsProbability of (2); Representing in-execution policies Time state action pairProbability of being accessed; expressed in policy Down timeIn a state ofProbability of (2); representing expert policy based on state Take action downwardsProbability of (2); Is that Is a simplified representation of (1) representing a evaluatorAt the parameters ofUnder according to the state action pairThe resulting values.

In one possible implementation, determining a penalty value corresponding to the trajectory of each of the sample vehicles includes:

By penalty function Determining a penalty value corresponding to the track of each sample vehicle;

wherein, Representing the minimum distance between any two sample vehicles,1 Represents a collision penalty value and,Representing the closest distance of the sample vehicle from the road edge,,Representing the closest distance of the sample vehicle from the left edge of the road,Representing the closest distance of the sample vehicle from the right edge of the road,A distance penalty value is indicated and,Indicating that the vehicle kinematic constraints are not met,Representing the constraint penalty value(s),The penalty value of sudden braking is indicated,Indicating acceleration.

In one possible implementation, the collision penalty value is determined by:

extracting first n consecutive location points from the trajectory of the sample vehicle;

For each of the first n consecutive location points, marking the location point as abnormal if the sample vehicle did not collide with its surrounding vehicles before moving to the location point, but did collide with its surrounding vehicles at the location point; if the sample vehicle did not collide with its surrounding vehicles before moving to the location point and the location point did not collide with its surrounding vehicles, marking the location point as a candidate;

if the position points marked as candidates exist in the first n continuous position points, arranging the last position point in the position points marked as candidates as a new current position of the sample vehicle, determining a penalty value corresponding to the position points marked as abnormal, and accumulating the penalty values corresponding to the position points marked as abnormal to obtain a collision penalty value;

and if the first n continuous position points are marked as abnormal, taking the first position point in the first n continuous position points as a new current position of the sample vehicle, determining a punishment value corresponding to the position point marked as abnormal, and accumulating the punishment values corresponding to the position point marked as abnormal to obtain a collision punishment value.

Another aspect of the present application provides a vehicle trajectory optimization device based on simulation learning, including:

the first obtaining module is used for obtaining initial track data of the target vehicle;

the first determining module is used for determining a target flow direction of the target vehicle;

The screening module is used for screening a target lane from at least one lane corresponding to the target flow direction based on the initial track data;

The second obtaining module is used for extracting the lane information of the target lane from the configuration information of the intersection if the initial track data meets the correction condition, and obtaining the current position information and the current motion information of the target vehicle and surrounding vehicles based on the initial track data;

The second determining module is used for inputting the current position information, the current motion information and the lane information of the target vehicle and surrounding vehicles thereof into a generated anti-imitation learning model to obtain target track data determined by the generated anti-imitation learning model;

And the correction module is used for correcting the initial track data based on the target track data.

A third aspect of the present application provides an electronic apparatus, comprising:

the memory is used for storing a computer program;

The processor is configured to execute the computer program to enable the electronic device to implement a method of vehicle trajectory optimization based on imitation learning as described in any one of the above.

A fourth aspect of the present application provides a computer storage medium carrying one or more computer programs which, when executed by an electronic device, enable the electronic device to implement a method of model learning based vehicle track optimisation as described in any one of the preceding claims.

In the present application, by training the generated countermeasure imitation learning model, the generated countermeasure imitation learning model can learn how to predict the future travel intention and path of the vehicle based on the current position, speed, direction of movement, and dynamic changes of the surrounding vehicles. Therefore, the current position information, the current movement information and the lane information of the target vehicle and surrounding vehicles are input into the generation of the anti-imitation learning model, and the generation of the anti-imitation learning model can generate more accurate and reliable target track data, so that the target track data can intelligently correct incomplete or deviation vehicle tracks determined by radar equipment and visual detection equipment to solve the problems of vision shielding, illumination change, too far distance and the like, thereby realizing continuous and accurate capturing of the movement state of the vehicle in the intersection. And by generating the antagonism imitation learning model to deeply learn and understand the vehicle behavior, the dependence on the preset track can be abandoned, the accurate prediction of the vehicle driving intention and the path can be realized, and the sudden traffic incident and the complex traffic scene can be better dealt with. And by generating the countermeasure imitation learning model, the fusion process of the data under the complex traffic scene, such as the accuracy and adaptability of the radar fusion under the condition of traffic jam, can be optimized.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of a vehicle track optimization method based on simulation learning according to embodiment 1 of the present application;

FIG. 2 is a schematic flow chart of guiding a vehicle to move;

FIG. 3 is a schematic diagram of another scenario implementation of a vehicle trajectory optimization method based on simulation learning provided by the present application;

FIG. 4 is a schematic diagram of a training process for a generator and a evaluator provided by the present application;

FIG. 5 is a schematic diagram of still another scenario implementation of a vehicle trajectory optimization method based on simulation learning provided by the present application;

Fig. 6 is a schematic structural diagram of a vehicle track optimizing device based on imitation learning.

Detailed Description

Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. The terminology used in the description of the embodiments of the application herein is for the purpose of describing particular embodiments of the application only and is not intended to be limiting of the application.

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.

Referring to fig. 1, a schematic flow chart of a vehicle track optimization method based on simulation learning according to embodiment 1 of the present application is shown in fig. 1, and the method may include, but is not limited to, the following steps:

step S101, obtaining initial trajectory data of a target vehicle.

In this embodiment, the radar device may be used to accurately capture dynamic data such as position information, speed, and movement direction of the target vehicle, and the visual detection device may be used to capture unique identification information such as appearance characteristics (e.g., vehicle type, color, etc.) and license plate of the target vehicle. Initial trajectory data of the target vehicle is generated based on dynamic data such as position information, speed, movement direction and the like of the target vehicle, appearance characteristics (such as vehicle type, color and the like) of the target vehicle, license plates and the like.

Step S102, determining a target flow direction of the target vehicle.

The target flow direction may be understood as a direction or path that the target vehicle may or is expected to travel in the future. Determining the target flow direction may help more accurately screen and locate the lane in which the target vehicle may travel.

And step S103, selecting a target lane from at least one lane corresponding to the target flow direction based on the initial track data.

After the target flow direction of the target vehicle is determined, a lane most likely to be driven by the target vehicle, that is, a target lane, may be selected from a plurality of lanes corresponding to the target flow direction according to the initial trajectory data. This step helps to further narrow the range of travel of the target vehicle, facilitating subsequent analysis and processing.

Step S104, if the initial track data meets the correction condition, extracting the lane information of the target lane from the configuration information of the intersection, and acquiring the current position information and the current motion information of the target vehicle and surrounding vehicles based on the initial track data.

When the initial track data meets a certain correction condition, the lane information related to the target lane, such as a lane number, a lane corresponding flow direction, an exit target point and the like, can be extracted from the configuration information of the intersection.

In this embodiment, the current position information and the current motion information of the target vehicle may be extracted from the initial trajectory data. And determining surrounding vehicles of the target vehicle based on the initial trajectory data. The current position information and the current movement information of the surrounding vehicles are determined by the track data of the surrounding vehicles.

The current motion information of the target vehicle may include, but is not limited to, at least one of a current speed, a current acceleration, and a current vehicle corner of the target vehicle. Accordingly, the current motion information of the surrounding vehicles of the target vehicle may also include, but is not limited to, at least one of the current speed, the current acceleration, and the current vehicle corner of the surrounding vehicles.

If the initial trajectory data does not satisfy the correction condition, the initial trajectory data may be used to output the position in the initial trajectory data. The output position may be used to guide the movement of the target vehicle.

Step S105, inputting the current position information, the current movement information and the lane information of the target lane of the target vehicle and the surrounding vehicles thereof into a generated anti-imitation learning model, and obtaining the target track data determined by the generated anti-imitation learning model.

In this embodiment, the generated countermeasure learning model may be trained based on the actual vehicle driving track of the intersection, and the policy parameters for generating countermeasure learning may be updated, so that the generated countermeasure learning model may simulate expert policies (i.e., policies corresponding to actual vehicle behaviors), learn and understand vehicle behaviors in depth, so that the generated countermeasure learning model may generate action parameters similar to the actual vehicle behaviors, and further obtain target track data, and ensure that the target track data is more reasonable.

In this embodiment, the vehicle running tracks of the road opening may be grouped in the flow direction to obtain corresponding vehicle running tracks of the left turn direction, the straight flow direction, the right turn direction and the u-turn direction, and training is performed simultaneously based on the corresponding vehicle running tracks of the left turn direction, the straight flow direction, the right turn direction and the u-turn direction, so that the generation of the anti-imitation learning model may simulate expert strategies for learning different flow directions at the same time.

And step S106, correcting the initial track data based on the target track data.

In this embodiment, the initial track data may be replaced with the target track data, so as to complete the correction of the initial track data.

Or the initial track data may be updated based on the target track data to complete the correction.

After the initial trajectory data is corrected, the target vehicle may be guided to move based on the corrected trajectory data. In the process of guiding the target vehicle to move, whether to continue to generate target track data can be judged. For example, as shown in fig. 2, after the vehicle moves and the post-movement position is output, it may be determined whether the target trajectory data has arrived in the effective areas of the radar device and the visual detection device based on the post-movement position, if so, the step of generating the target trajectory data may be ended, and if not, the step of generating the target trajectory data may be continued by acquiring new current motion information of the target vehicle from the target trajectory data with the post-movement position as new current position information and acquiring new current position information and new current motion information of surrounding vehicles.

In the present embodiment, by training the generated countermeasure imitation learning model, the generated countermeasure imitation learning model can learn how to predict the future travel intention and path of the vehicle based on the current position, speed, movement direction of the vehicle, and dynamic changes of surrounding vehicles. Therefore, the current position information, the current movement information and the lane information of the target vehicle and surrounding vehicles are input into the generation of the anti-imitation learning model, and the generation of the anti-imitation learning model can generate more accurate and reliable target track data, so that the target track data can intelligently correct incomplete or deviation vehicle tracks determined by radar equipment and visual detection equipment to solve the problems of vision shielding, illumination change, too far distance and the like, thereby realizing continuous and accurate capturing of the movement state of the vehicle in the intersection. And by generating the antagonism imitation learning model to deeply learn and understand the vehicle behavior, the dependence on the preset track can be abandoned, the accurate prediction of the vehicle driving intention and the path can be realized, and the sudden traffic incident and the complex traffic scene can be better dealt with. And by generating the countermeasure imitation learning model, the fusion process of the data under the complex traffic scene, such as the accuracy and adaptability of the radar fusion under the condition of traffic jam, can be optimized.

As another optional embodiment of the present application, a vehicle track optimization method based on simulation learning provided in embodiment 2 of the present application is mainly an implementation of step S102 in embodiment 1, where step S102 may include, but is not limited to, the following steps:

step S1021, if the target vehicle has locked the flow direction, the flow direction of the target vehicle locked is used as the target flow direction of the target vehicle.

In the present embodiment, if the target vehicle has been determined in its travel direction before entering the intersection due to its travel path or traffic signal control or the like and is not allowed to change, it may be determined that the target vehicle has locked the flow direction.

In most cases, a straight-traveling vehicle may be "locked" as it approaches an intersection, because the straight-traveling vehicle typically does not need to make a directional selection in front of the intersection, but instead travels directly along the current lane.

Step S1022, if the target vehicle is not in the flowing direction, and the lane flowing direction of the entrance of the target vehicle is configured to be in the single flowing direction, the single flowing direction is used as the target flowing direction of the target vehicle.

A single flow direction is understood to mean that the entrance lane allows only vehicles of a single flow direction to travel, such as only left or only right turns.

Step S1023, if the target vehicle is not in the flowing direction, and the lane flowing direction of the entrance road of the target vehicle is configured to be in multiple flowing directions, selecting the flowing direction with the largest flow rate from the multiple flowing directions as the target flowing direction of the target vehicle by comparing the flow rates corresponding to the flowing directions in the multiple flowing directions.

If the multi-direction can include a left turn direction and a straight direction, the number of vehicles in the left turn and the straight direction in the intersection can be monitored to judge the current vehicle direction.

And if the number of the left turning vehicles is larger than that of the straight running vehicles, the left turning vehicles are turned to a target flow direction which is the target vehicle.

If the number of left-hand steering vehicles is smaller than that of straight-hand steering vehicles, the straight-hand steering direction is taken as the target steering direction of the target vehicle.

If the multi-flow direction can comprise a right turn flow direction and a straight flow direction, the number of vehicles turning right and straight flow directions in the intersection can be monitored to judge the flow direction.

If the number of right turn vehicles is greater than the number of straight run vehicles or the number of straight run vehicles is 0, the right turn may be directed to a target flow direction as the target vehicle.

And step S1024, if the vehicle flow rates corresponding to all the multiple directions are consistent, determining a temporary target point based on the historical track of the target vehicle, and if the temporary target point is located in one of the multiple directions, taking the direction containing the temporary target point in the multiple directions as the target direction of the target vehicle.

In this embodiment, when the traffic flows of the directions in the multi-direction lanes are consistent, the temporary target point based on the history track of the target vehicle is selected to determine the target direction, so that the real intention and the driving habit of the vehicle can be reflected more accurately, and the accuracy of the target direction can be improved.

As another optional embodiment of the present application, for the vehicle track optimization method based on the simulation learning provided in embodiment 3 of the present application, this embodiment is mainly one implementation of step S103 in the above embodiment 1, and step S103 may include, but is not limited to, the following steps:

and step S1031, determining the latest track course angle of the target vehicle based on the track point of the initial track data where the target vehicle is currently located.

In this embodiment, the track point where the target vehicle is currently located in the initial track data may be based onAnd original track points which are N distances from the track point where the target vehicle is currently located, and calculating the latest track course angle of the target vehicle. Wherein N may be set as needed, and is not limited in the present application. For example, N may be 3.

Step S1032, obtaining an exit course angle corresponding to the target flow direction of the target vehicle.

Step S1033, determining an average included angle between the latest track course angle and the exit course angle.

In this embodiment, the average angle can be calculated by the following relation:

Represents the average included angle of the two components, The latest track course angle is represented,Representing the exit course angle.

And step S1034, determining the angle between the target point of each lane in at least one lane corresponding to the target flow direction and the current track point of the target vehicle, and taking the lane with the smallest difference between the angle and the average included angle as the target lane.

As shown in fig. 3, the target point of each lane in at least one lane corresponding to the target flow direction and the track point where the target vehicle is currently located may be connected, an angle between the target point of each lane in at least one lane corresponding to the target flow direction and the track point where the target vehicle is currently located is calculated, and the lane with the smallest difference between the angle and the average included angle is used as the target lane.

As another optional embodiment of the present application, in the vehicle track optimization method based on simulation learning provided in embodiment 4 of the present application, this embodiment is mainly an implementation manner of a determination manner that initial track data in the foregoing embodiment 1 satisfies a correction condition, where the determination manner that the initial track data satisfies the correction condition may be determined by:

step S11, determining a first distance between a first initial track point in the intersection entering the target lane and a target point of the target lane in the initial track data.

In this embodiment, the coordinates of the first initial track point in the intersection entering the target lane may be based on the initial track dataAnd coordinates of a target point of the target laneCalculating the distance between the first initial track point and the target point of the target lane (expressed as) Will beAs the first distance.

And step S12, determining a second distance between the track point of the target vehicle in the initial track data and the first initial track point.

In this embodiment, the coordinates of the track point where the target vehicle is currently located in the initial track data may be based onAnd the coordinates of the first initial trajectory pointCalculating the distance between the current track point of the target vehicle and the first initial track point (expressed as) Will beAs the second distance.

And S13, if the target flow direction is a left turning flow direction, if the ratio of the second distance to the first distance is not smaller than a left turning threshold value, the correction condition is met.

In the present embodiment, comparison can be madeAnd a left turn threshold (expressed as) If the size of (a)Not less thanIt is determined that the correction condition is satisfied.

And S14, if the target flow direction is the right turning flow direction, if the ratio of the second distance to the first distance is not smaller than a right turning threshold value, the correction condition is met.

In the present embodiment, comparison can be madeAnd a right turn threshold (expressed as) If the size of (a)Not less thanIt is determined that the correction condition is satisfied.

And S15, if the target flow direction is a straight flow direction, if the ratio of the second distance to the first distance is not smaller than a straight threshold value and the target vehicle fails in visual tracking, the correction condition is met.

In the present embodiment, comparison can be madeAnd a straight threshold (expressed as) If the size of (a)Not less thanThe target vehicle may be marked as a lock flow state. When it is further determined that the target vehicle fails in visual tracking, it is determined that a correction condition is satisfied.

And S16, if the target flow direction is a U-turn flow direction, if the target vehicle fails in visual tracking, the correction condition is met.

Besides the condition that the vehicles in the straight-running direction can lock the flow direction before meeting the correction conditions, the vehicles in the left-turning and right-turning directions can lock the flow direction while meeting the correction conditions, and the vehicles in the turning-around direction can lock the flow direction while meeting the correction conditions after tracking failure.

In the embodiment, the threshold value of different flow directions is set、、) And according to the actual driving distance of the vehicle) Distance from the reference) To determine whether to enter an "untrusted zone" (i.e., a dynamic blind zone). Compared with the method of determining the blind area range by relying on physical calibration of radar equipment and visual detection equipment, the method is more flexible and can adapt to the changes of different environments and intersections. When the vehicle enters a dynamic blind zone, the system automatically enters a correction mode, and intelligent correction is carried out on the vehicle track in the blind zone. This helps to reduce trajectory errors and false positives due to dead zones. By dynamically defining the blind area and intelligently correcting, the method improves the robustness of the system, so that the system can track and predict the running track of the vehicle more accurately. In addition, the radar fusion process can be improved, so that the radar and the visual information can be combined more effectively, and the accuracy and the reliability of overall perception are improved.

As another alternative embodiment of the present application, a vehicle trajectory optimization method based on simulation learning provided in embodiment 5 of the present application is mainly an implementation manner of generating a challenge simulation learning model described in the foregoing embodiment 1, where the generating a challenge simulation learning model may be obtained based on training a challenge network, and the challenge network may include a generator and a evaluator.

In this embodiment, the discount rate during training may be determinedBatch sizeNumber of trainingCourse learning process and training maximum number of vehiclesAnd policiesMeaning. For example, two training phases may be selected, a first training phase with a discount rate of 0.95 and a batch size of 10000 observation action pairsThe training frequency is 1000 iterations, 10 vehicles are added to the environment every 200 iterations, a course learning process is carried out, and the learning efficiency and performance are improved by gradually increasing the difficulty of tasks. A second training stage (fine tuning stage), a discount rate of 0.99, a batch size of 40000 observation action pairsThe number of training was 200 iterations and training was performed with 100 vehicles. StrategyThe actions that the vehicle should perform may be selected based on the current environmental conditions and policy parameters.

The discount rate may be understood as a coefficient between 0 and 1 for measuring the current value of the future reward in the scene of reinforcement learning, etc., which determines the importance of the future reward, and a higher discount rate pays more attention to the future reward than a lower discount rate pays more attention to the current reward.

The observation action pair may be understood as a combination of an observation (observation) received by the vehicle at the current time (an observation is information acquired from the environment that describes the current state or part of the state of the environment) and an action to be taken subsequently.

In this embodiment, the penalty function may be defined in accordance with vehicle kinematic constraints and anti-collision rules.

In this embodiment, policy parameters of the generator may be initializedJudgment parameter of judgment device(By updating the evaluation parameters)The evaluator may learn whether the evaluation state action pairs are generated by expert policy or by generator policy), step size parameters(For controlling the maximum variation of the strategy in each optimization step, ensuring that the strategy update is both efficient and stable. This parameter is a key super-parameter that adjusts algorithm stability and learning progress), course distribution set(I.e., one concept in course learning (Curriculum Learning) that defines the task difficulty profile that an agent will face during training.) in a multi-agent (i.e., vehicular) environment, course profile setsCan be defined as a time-varying distribution that gradually increases the number of agents under strategic control, thereby gradually increasing the difficulty of training. The method is helpful for the agent to learn and process more complex interaction scenes step by step, and avoids the problem of excessively complex decision-making at the beginning.

In the PS-GAIL (parameter sharing type generation resist imitation learning, PARAMETER SHARING GENERATIVE ADVERSARIAL Imitation Learning) method, policy parametersPolicy parameters, which may be shared. In a multi-agent environment, all agents share the same set of policy parameters. This means that agents will take the same strategic actions given the input. By sharing policy parameters, the agents can learn driving policies that interact with other agents in complex traffic scenarios. Sharing policy parameters helps to reduce model complexity, improve training efficiency, and promote collaborative behavior between agents.

and S31, acquiring the vehicle running track of the intersection, and determining the state action pair of the expert based on the vehicle running track of the intersection.

In this embodiment, the vehicle running track at the intersection can be collected by using the unmanned aerial vehicle, the mobile phone positioning data and other parties, and the track which is too short and the discontinuous track are removed, so that the remaining vehicle running track is obtained.

In this embodiment, the remaining vehicle travel tracks may be flow-direction-grouped to accommodate different driving intentions such as left turn, straight turn, and right turn. Based on expert strategy, discrete track points are extracted from each vehicle running track corresponding to each flow direction as a state action pair of an expertForm an expert behavior set. Training is performed simultaneously for different driving intents.Representing a state (e.g., position, velocity, etc.),Representing an action (e.g., acceleration, deceleration, steering, etc.).

Step S32, at the current timeSetting according to course distribution setSampling outA plurality of vehicles as a plurality of sample vehicles.

And step S33, acquiring sample information of each sample vehicle in the plurality of sample vehicles at the current position, wherein the sample information comprises position information and motion information of the sample vehicle at the current position and lane information of surrounding sample vehicles and position information and motion information of the sample vehicle at the current position.

The position information and the motion information may be referred to in the previous embodiments, and the description thereof is omitted herein.

And step S34, processing sample information corresponding to each sample vehicle according to the current strategy of the generator, and generating the track of each sample vehicle.

And step S35, determining a penalty value corresponding to the track of each sample vehicle.

In this embodiment, a penalty value may be generated if the behavior of the vehicle violates a rule.

Step S36, based on the judgment device, acting pairs for each state in the track of each sample vehicleScoring to generate a reward value of each sample vehicle, wherein the reward value is determined by the following steps:

wherein, Representation judging deviceAt the parameters ofUnder according to the state action pairThe value of the result is that,And representing the punishment value corresponding to the track of the sample vehicle.

When the behavior of the vehicle violates a rule,A positive value is generated, thereby reducing the overall prize. In this way, the vehicle will tend to avoid those behaviors that would cause a penalty when making decisions.

Step S37, keeping parameters of the evaluator unchanged, updating policy parameters of the generator based on a trust domain optimization method, wherein the updating of the policy parameters of the generator based on the trust domain optimization method comprises solving the following constraint optimization problems:

;

wherein, Representing policiesParameters of (2); representing the desire; Is shown in Current policy taken by time of day based on old parametersDefined as follows; representing a new policy; Indicating that the current policy is in Under the observation condition at the momentTake action downwardsProbability of (2); representing new strategies in Under the observation condition at the momentTake action downwardsProbability of (2); Representing current policies Under observation conditionsProbability distribution of action taken down; Representing new policies Under observation conditionsProbability distribution of action taken down; Representation of AndKL (Kullback-Leibler) divergence between; Representing a step size parameter for controlling the maximum variation of the strategy in each optimization step; representing dominance functions for measuring observation conditions Take action downwardsAction value expectations of (a)And observer deviceEstimated state value expectationsThe degree of difference between; Representing a selection of actions that the vehicle should perform according to the current policy of the generator;

step S38, keeping the policy parameters of the generator unchanged, and updating the judgment parameters of the judgment device based on the state action pairs of the expert and the state action pairs generated by the new policy of the generator The judging parameters of the judging deviceUpdating is performed by the following objective function:

wherein, Representing an expert policy that is set up to,The new policy is represented by a representation of the new policy,Representing in-execution policiesTime state action pairThe probability of being accessed, i.e., occupancy metrics; expressed in policy Down timeIn a state ofProbability of (2); Representing state based on current policy Take action downwardsProbability of (2); Representing in-execution policies Time state action pairThe probability of being accessed, i.e., occupancy metrics; expressed in policy Down timeIn a state ofProbability of (2); representing expert policy based on state Take action downwardsProbability of (2); Is that Is a simplified representation of (1) representing a evaluatorAt the parameters ofUnder according to the state action pairThe resulting values.

If the training end condition is not reached, steps S31-S37 may be continued, and if the training end condition is reached, training may be ended, and a generator in the countermeasure network may act as generating a countermeasure simulation learning model. The training end condition may be set as needed, and is not limited in the present application. For example, the training end condition may include reaching the number of iterations or meeting an objective function requirement (e.g., a criterion index). Referring to fig. 4, the iterative execution process of steps S31-S37 will be described, and as shown in fig. 4, on the basis of determining the training process, defining a penalty function, initializing various parameters, acquiring a vehicle running track of an intersection, and grouping the vehicle running track in a flow direction to obtain a corresponding expert behavior set. After obtaining sample information of each sample vehicle in the plurality of sample vehicles at the current position, processing the sample information corresponding to each sample vehicle according to the current strategy of the generator, and generating tracks of each sample vehicle, evaluating whether the training times are met or the requirements of objective functions are met, and if the training times are not met or the requirements of the objective functions are not met, evaluating whether each state action pair in the tracks of each sample vehicle is metScoring to generate a reward value of each sample vehicle, keeping parameters of the evaluator unchanged, updating policy parameters of the evaluator based on a trust domain optimization method, wherein the reward value generated by the evaluator can be used in the trust domain optimization method, after updating the policy parameters of the evaluator, learning a new policy by the evaluator, keeping the policy parameters of the evaluator unchanged, and updating the evaluation parameters of the evaluator based on a state action pair of the expert and a state action pair generated by the new policy of the evaluatorJudgment parameters of the judgment deviceUpdating is performed by means of an objective function.

And ending the training if the training times are reached or the requirements of the objective function are met.

As another optional embodiment of the present application, for the vehicle track optimization method based on simulation learning provided in embodiment 6 of the present application, this embodiment is mainly an implementation of step S35 in the above embodiment 5, and step S35 may include, but is not limited to, the following steps:

step S351, passing penalty function And determining a penalty value corresponding to the track of each sample vehicle.

It will be appreciated that the number of components,There may be one penalty value (i.e., one of a collision penalty value, a distance penalty value, a constraint penalty value, and a sudden braking penalty value) or a plurality of penalty values (i.e., at least two of a collision penalty value, a distance penalty value, a constraint penalty value, and a sudden braking penalty value).

As another optional embodiment of the present application, in order to provide a vehicle trajectory optimization method based on simulation learning according to embodiment 7 of the present application, this embodiment is mainly an implementation manner of the collision penalty value in the foregoing embodiment 6, where the collision penalty value is determined by the following manner:

Step S41, extracting the previous track from the track of the sample vehicle Successive location points.

In the present embodiment, the following can be applied as neededIs set, and is not limited in the present application. For example, the number of the cells to be processed,May be set to 6.

Front partThe successive position points can be expressed as。

Step S42, for the front partEach of the plurality of consecutive location points is marked as abnormal if the sample vehicle did not collide with its surrounding vehicles before moving to the location point, but did collide with its surrounding vehicles at the location point, and marked as a candidate if the sample vehicle did not collide with its surrounding vehicles before moving to the location point, and did not collide with its surrounding vehicles at the location point.

For example, forIf the sample vehicle is moving toNo previous collision with its surrounding vehicles, but the sample vehicle isCollision with surrounding vehicles can makeThe mark is abnormal, that is, the travel of the present round is not performed. If the sample vehicle is moving toNo previous collision with its surrounding vehicles, but the sample vehicle isCan make no collision with surrounding vehiclesMarked as candidate, i.e. the run may proceed to。

Step S43, if the front partAnd each position point in the continuous position points is provided with a position point marked as a candidate, the position point marked as the candidate is arranged at the last position point as a new current position of the sample vehicle, a punishment value corresponding to the position point marked as abnormal is determined, and the punishment values corresponding to the position points marked as abnormal are accumulated to obtain a collision punishment value.

For example, if,In (a)In order to mark a location point as a candidate,As a candidate point marked as abnormal, thenAs a new current position of the sample vehicle, and willAnd accumulating the corresponding punishment values to obtain the collision punishment value.

Step S44, if the front partEach successive location point is marked as abnormal, the front is marked as abnormalAnd taking the first position point in the continuous position points as the new current position of the sample vehicle, determining a punishment value corresponding to the position point marked as abnormal, and accumulating the punishment values corresponding to the position point marked as abnormal to obtain a collision punishment value.

For example, if,All marked as abnormal, can be used toAs a new current position of the sample vehicle, and willAnd accumulating the corresponding punishment values to obtain the collision punishment value.

If the front part isEach successive location point is marked as abnormal, the front is marked as abnormalThe first of the consecutive location points serves as a new current location of the sample vehicle to avoid failure to determine the new current location, affecting model training.

In this embodiment, by optimizing the anti-collision judgment logic, unnecessary computation can be reduced, thereby speeding up model training. Although the calculation process is simplified, by accumulating penalty values corresponding to the position points marked as abnormal, the risk of collision of the sample vehicle in the trajectory can be quantified, and such quantitative evaluation helps to more accurately understand the safety of the vehicle behavior. For example, as shown in fig. 5, after training the generation of the countermeasure learning model by the collision penalty value, the generation of the target trajectory data generated by the countermeasure learning model can avoid collision between the target vehicle and other vehicles after correcting the initial trajectory data.

Next, description will be made of the vehicle trajectory optimization device based on the imitation learning provided by the present application, and the vehicle trajectory optimization device based on the imitation learning described below and the vehicle trajectory optimization method based on the imitation learning described above can be referred to correspondingly with each other.

Referring to fig. 6, the vehicle trajectory optimization device based on the imitation learning includes a first obtaining module 100, a first determining module 200, a screening module 300, a second obtaining module 400, a second determining module 500, and a modifying module 600.

A first obtaining module 100 is configured to obtain initial trajectory data of a target vehicle.

A first determining module 200 is configured to determine a target flow direction of the target vehicle.

And the screening module 300 is used for screening the target lane from the at least one lane corresponding to the target flow direction based on the initial track data.

The second obtaining module 400 is configured to extract lane information of the target lane from configuration information of an intersection if the initial track data meets a correction condition, and obtain current position information and current movement information of the target vehicle and surrounding vehicles based on the initial track data.

The second determining module 500 is configured to input current position information, current motion information, and lane information of the target vehicle and surrounding vehicles thereof to a generated anti-imitation learning model, and obtain target track data determined by the generated anti-imitation learning model.

And a correction module 600, configured to correct the initial track data based on the target track data.

The first determining module 200 may specifically be configured to:

The screening module specifically may be used for:

The apparatus may further include:

The judging module is used for:

The apparatus may also include a training module.

The generated countermeasure imitation learning model is trained based on a countermeasure network, and the countermeasure network comprises a generator and a judge;

training module for:

;

The training module determining a penalty value corresponding to the track of each sample vehicle may include:

The collision penalty value may be determined by:

extracting a pre-image from the trajectory of the sample vehicle A plurality of consecutive location points;

For the front part Each of a plurality of consecutive location points, marking the location point as abnormal if the sample vehicle did not collide with its surrounding vehicles before moving to the location point, but did collide with its surrounding vehicles at the location point; if the sample vehicle did not collide with its surrounding vehicles before moving to the location point and the location point did not collide with its surrounding vehicles, marking the location point as a candidate;

If the front part is The method comprises the steps that position points marked as candidates exist in each of the continuous position points, the position points marked as candidates are arranged at the last position point in the position points marked as candidates to serve as a new current position of the sample vehicle, penalty values corresponding to the position points marked as abnormal are determined, and the penalty values corresponding to the position points marked as abnormal are accumulated to obtain collision penalty values;

If the front part is Each successive location point is marked as abnormal, the front is marked as abnormalAnd taking the first position point in the continuous position points as the new current position of the sample vehicle, determining a punishment value corresponding to the position point marked as abnormal, and accumulating the punishment values corresponding to the position point marked as abnormal to obtain a collision punishment value.

In another embodiment of the present application, there is provided an electronic apparatus including:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to enable the electronic device to implement a vehicle trajectory optimization method based on simulation learning as described in any one of embodiments 1-7.

In another embodiment of the application, a computer storage medium is provided, which carries one or more computer programs, which when executed by an electronic device, enable the electronic device to implement a method of vehicle trajectory optimization based on impersonation learning as introduced in any of embodiments 1-7.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

Claims

1. A vehicle trajectory optimization method based on imitation learning, characterized by comprising:

Obtaining initial trajectory data of the target vehicle;

Determining a target flow direction of the target vehicle;

Based on the initial trajectory data, selecting a target lane from at least one lane corresponding to the target flow direction;

If the initial trajectory data satisfies the correction condition, extracting the lane information of the target lane from the configuration information of the intersection, and obtaining the current position information and current motion information of the target vehicle and its surrounding vehicles based on the initial trajectory data;

The current position information, current motion information and lane information of the target lane of the target vehicle and its surrounding vehicles are input into the generative adversarial imitation learning model to obtain the target trajectory data determined by the generative adversarial imitation learning model; the generative adversarial imitation learning model is trained based on the actual vehicle driving trajectory at the intersection, and the generative adversarial imitation learning model is used to imitate the expert strategy, generate action parameters similar to the actual vehicle behavior, and obtain the target trajectory data; the generative adversarial imitation learning model is trained based on an adversarial network, and the adversarial network includes a generator and a judger; the process of training the generative adversarial imitation learning model based on the adversarial network includes: obtaining the vehicle driving trajectory at the intersection, and determining the expert state-action pair based on the vehicle driving trajectory at the intersection; at the current moment , according to the setting of course distribution set Sampling out vehicles as multiple sample vehicles; obtaining sample information of each of the multiple sample vehicles at the current position; the sample information includes: the position information and motion information of the sample vehicle at the current position and the position information and motion information of the surrounding sample vehicles and the lane information of the target sample lane; processing the sample information corresponding to each sample vehicle according to the current strategy of the generator to generate the trajectory of each sample vehicle; determining the penalty value corresponding to the trajectory of each sample vehicle; and determining the penalty value corresponding to each state-action pair in the trajectory of each sample vehicle based on the judgement. Scoring is performed to generate a reward value for each of the sample vehicles; the reward value is determined in the following manner:

Among them, it means that the judge has parameters Next, according to the state action The value obtained, Represents the penalty value corresponding to the trajectory of the sample vehicle; keeps the parameters of the evaluator unchanged, and updates the strategy parameters of the generator based on the trust region optimization method; updates the strategy parameters of the generator based on the trust region optimization method, including: solving the following constrained optimization problem:

;

in, Representation strategy Parameters; express expectations; Indicated in The current strategy adopted at any moment, which is based on the old parameters defined; Indicates new strategy; Indicates that the current strategy is Observe the conditions at all times Take action The probability of Indicates that the new strategy Observe the conditions at all times Take action probability; Indicates the current strategy Under observation conditions The probability distribution of taking actions under Indicates new strategy Under observation conditions The probability distribution of taking actions under express and The KL (Kullback-Leibler) divergence between; represents the step size parameter, which is used to control the maximum change of the strategy in each optimization step; represents the advantage function, which is used to measure the observation condition Take action The expected action value With observer Estimated state value expectation degree of difference between represents the behavior taken by the sample vehicle according to the strategy; the advantage function is estimated by the following generalized advantage estimation method:

in, represents the discount rate; It is a parameter between 0 and 1, used to balance the TD (Temporal Difference) error The weight of represents the reward value determined by the judge; and Respectively expressed in Moment and Keep the strategy parameters of the generator unchanged, and update the evaluation parameters of the judge based on the state-action pairs of the expert and the state-action pairs generated by the new strategy of the generator. ; The judging parameters of the judge The update is performed by the following objective function:

in, represents the expert strategy, represents the new strategy, Indicates that the policy is being executed Time state action pair The probability of being visited; Indicated in strategy Next time In state probability; Indicates that based on the current strategy in state Take action The probability of Indicates that the policy is being executed Time state action pair The probability of being visited; Indicated in strategy Next time In state probability; Indicates that based on the expert strategy in the state Take action probability; yes A simplified expression of In the parameters Next, according to the state action The resulting value;

Based on the target trajectory data, the initial trajectory data is corrected.

2. The vehicle trajectory optimization method based on imitation learning according to claim 1, characterized in that determining the target flow direction of the target vehicle comprises:

If the target vehicle has locked the flow direction, the locked flow direction of the target vehicle is used as the target flow direction of the target vehicle;

If the target vehicle has an unlocked flow direction, and the flow direction of the entrance lane of the target vehicle is configured as a single flow direction, the single flow direction is used as the target flow direction of the target vehicle;

If the target vehicle has an unlocked flow direction, and the flow direction of the entrance lane of the target vehicle is configured as multiple flow directions, then by comparing the vehicle flow corresponding to each flow direction in the multiple flow directions, the flow direction with the largest vehicle flow is selected from the multiple flow directions as the target flow direction of the target vehicle;

If the traffic volume corresponding to each flow direction in the multiple flow directions is consistent, a temporary target point is determined based on the historical trajectory of the target vehicle. If the temporary target point is located in one of the multiple flow directions, the flow direction in the multiple flow directions containing the temporary target point is used as the target flow direction of the target vehicle.

3. The vehicle trajectory optimization method based on imitation learning according to claim 1, characterized in that, based on the initial trajectory data, selecting a target lane from at least one lane corresponding to the target flow direction comprises:

Determining the latest trajectory heading angle of the target vehicle based on the current trajectory point of the target vehicle in the initial trajectory data;

Obtaining an exit track heading angle corresponding to a target flow direction of the target vehicle;

Determine an average angle between the latest trajectory heading angle and the exit path heading angle;

An angle between a target point of each lane in at least one lane corresponding to the target flow direction and a current trajectory point of the target vehicle is determined, and a lane having a minimum difference between the angle and the average angle is taken as a target lane.

4. The vehicle trajectory optimization method based on imitation learning according to claim 1 is characterized in that the initial trajectory data satisfies the correction condition by judging in the following manner:

Determine a first distance between a first initial trajectory point in the initial trajectory data that enters the target lane and a target point in the target lane;

Determine a second distance between the current trajectory point of the target vehicle and the first initial trajectory point in the initial trajectory data;

If the target flow direction is a left turn flow direction, and if the ratio of the second distance to the first distance is not less than a left turn threshold, then the correction condition is satisfied;

If the target flow direction is a right turn flow direction, and if the ratio of the second distance to the first distance is not less than a right turn threshold, then the correction condition is satisfied;

If the target flow direction is a straight flow direction, if the ratio of the second distance to the first distance is not less than a straight threshold, and the target vehicle fails in visual tracking, then a correction condition is satisfied;

If the target flow direction is a U-turn flow direction, and if the target vehicle fails in visual tracking, the correction condition is met.

5. The vehicle trajectory optimization method based on imitation learning according to claim 1, characterized in that determining the penalty value corresponding to the trajectory of each sample vehicle comprises:

Through the penalty function

, determine the penalty value corresponding to the trajectory of each sample vehicle; wherein, represents the minimum distance between any two sample vehicles, 1 indicates the collision penalty value, Indicates the shortest distance between the sample vehicle and the edge of the road, , Indicates the closest distance between the sample vehicle and the left edge of the road, Indicates the closest distance between the sample vehicle and the right edge of the road, Represents the distance penalty value, Indicates that the vehicle kinematic constraints are not met. represents the constraint penalty value, Indicates the emergency brake penalty value. Indicates acceleration.

6. The vehicle trajectory optimization method based on imitation learning according to claim 5, characterized in that the collision penalty value is determined by the following method:

Extracting the first n consecutive position points from the trajectory of the sample vehicle;

For each position point among the first n consecutive position points, if the sample vehicle does not collide with surrounding vehicles before moving to the position point, but collides with surrounding vehicles at the position point, the position point is marked as abnormal; if the sample vehicle does not collide with surrounding vehicles before moving to the position point, and does not collide with surrounding vehicles at the position point, the position point is marked as a candidate;

If there is a position point marked as a candidate among each of the first n consecutive position points, the position point arranged last among the position points marked as candidates is used as the new current position of the sample vehicle, and the penalty value corresponding to the position point marked as abnormal is determined, and the penalty values corresponding to the position points marked as abnormal are accumulated to obtain a collision penalty value;

If the first n consecutive position points are all marked as abnormal, the first position point among the first n consecutive position points is used as the new current position of the sample vehicle, and the penalty value corresponding to the position point marked as abnormal is determined, and the penalty values corresponding to the position points marked as abnormal are accumulated to obtain the collision penalty value.

7. A vehicle trajectory optimization device based on imitation learning, characterized by comprising:

A first acquisition module is used to obtain initial trajectory data of the target vehicle;

A first determination module, used to determine the target flow direction of the target vehicle;

a screening module, configured to screen a target lane from at least one lane corresponding to the target flow direction based on the initial trajectory data;

A second acquisition module is used to extract the lane information of the target lane from the configuration information of the intersection if the initial trajectory data meets the correction condition, and obtain the current position information and current motion information of the target vehicle and its surrounding vehicles based on the initial trajectory data;

The second determination module is used to input the current position information, current motion information and lane information of the target lane of the target vehicle and its surrounding vehicles into the generative adversarial imitation learning model to obtain the target trajectory data determined by the generative adversarial imitation learning model; the generative adversarial imitation learning model is trained based on the actual vehicle driving trajectory at the intersection, and the generative adversarial imitation learning model is used to imitate the expert strategy, generate action parameters similar to the actual vehicle behavior, and obtain the target trajectory data; the generative adversarial imitation learning model is trained based on an adversarial network, and the adversarial network includes a generator and a judger; the process of training the generative adversarial imitation learning model based on the adversarial network includes: obtaining the vehicle driving trajectory at the intersection, and determining the expert state-action pair based on the vehicle driving trajectory at the intersection; at the current moment , according to the setting of course distribution set Sampling out vehicles as multiple sample vehicles; obtaining sample information of each of the multiple sample vehicles at the current position; the sample information includes: the position information and motion information of the sample vehicle at the current position and the position information and motion information of the surrounding sample vehicles and the lane information of the target sample lane; processing the sample information corresponding to each sample vehicle according to the current strategy of the generator to generate the trajectory of each sample vehicle; determining the penalty value corresponding to the trajectory of each sample vehicle; and determining the penalty value corresponding to each state-action pair in the trajectory of each sample vehicle based on the judgement. Scoring is performed to generate a reward value for each of the sample vehicles; the reward value is determined in the following manner:

in, Representation Judger In the parameters Next, according to the state action The value obtained, Represents the penalty value corresponding to the trajectory of the sample vehicle; keeps the parameters of the evaluator unchanged, and updates the strategy parameters of the generator based on the trust region optimization method; updates the strategy parameters of the generator based on the trust region optimization method, including: solving the following constrained optimization problem:

;

in, represents the expert strategy, represents the new strategy, Indicates that the policy is being executed Time state action pair The probability of being visited; Indicated in strategy Next time In state probability; Indicates that based on the current strategy in state Take action probability; Indicates that the policy is being executed Time state action pair The probability of being visited; Indicated in strategy Next time In state probability; Indicates that based on the expert strategy in the state Take action The probability of yes A simplified expression of In the parameters Next, according to the state action The value obtained;

A correction module is used to correct the initial trajectory data based on the target trajectory data.

8. An electronic device, comprising:

The memory is used to store computer programs;

The processor is used to execute the computer program so that the electronic device can implement the vehicle trajectory optimization method based on imitation learning as described in any one of claims 1 to 6.

9. A computer storage medium, characterized in that the storage medium carries one or more computer programs, and when the one or more computer programs are executed by an electronic device, the electronic device can implement the vehicle trajectory optimization method based on imitation learning as described in any one of claims 1 to 6.