[go: up one dir, main page]

CN119252066B - A vehicle trajectory optimization method based on imitation learning and related device - Google Patents

A vehicle trajectory optimization method based on imitation learning and related device Download PDF

Info

Publication number
CN119252066B
CN119252066B CN202411780013.5A CN202411780013A CN119252066B CN 119252066 B CN119252066 B CN 119252066B CN 202411780013 A CN202411780013 A CN 202411780013A CN 119252066 B CN119252066 B CN 119252066B
Authority
CN
China
Prior art keywords
vehicle
target
sample
flow direction
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411780013.5A
Other languages
Chinese (zh)
Other versions
CN119252066A (en
Inventor
周俊杰
吴劲峰
吴文浩
虞霄璐
陈瑞生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Supcon Information Industry Co Ltd
Original Assignee
Zhejiang Supcon Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Supcon Information Industry Co Ltd filed Critical Zhejiang Supcon Information Industry Co Ltd
Priority to CN202411780013.5A priority Critical patent/CN119252066B/en
Publication of CN119252066A publication Critical patent/CN119252066A/en
Application granted granted Critical
Publication of CN119252066B publication Critical patent/CN119252066B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096805Systems involving transmission of navigation instructions to the vehicle where the transmitted instructions are used to compute a route
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096833Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
    • G08G1/096838Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route where the user preferences are taken into account or the user selects one route out of a plurality
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/16Anti-collision systems
    • G08G1/167Driving aids for lane monitoring, lane changing, e.g. blind spot detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Mathematical Physics (AREA)
  • Traffic Control Systems (AREA)

Abstract

The application discloses a vehicle track optimization method and a related device based on imitation learning, wherein the method comprises the steps of obtaining initial track data of a target vehicle; the method comprises the steps of determining a target flow direction of a target vehicle, screening a target lane from at least one lane corresponding to the target flow direction based on initial track data, extracting lane information of the target lane from configuration information of an intersection if the initial track data meets correction conditions, obtaining current position information and current movement information of the target vehicle and surrounding vehicles based on the initial track data, inputting the current position information, the current movement information and the lane information of the target vehicle and surrounding vehicles to a generated countermeasure simulation learning model to obtain target track data determined by the countermeasure simulation learning model, and correcting the initial track data based on the target track data.

Description

Vehicle track optimization method and related device based on imitation learning
Technical Field
The application relates to the technical field of intelligent traffic management, in particular to a vehicle track optimization method based on imitation learning and a related device.
Background
With the acceleration of the urbanization process, traffic pressure and security challenges are becoming more severe, and Intelligent Transportation Systems (ITS) are rapidly developing as a key technology to cope with these challenges. The ITS realizes real-time monitoring, efficient management and scientific guidance of traffic flow by integrating advanced information technology, data communication transmission technology and computer technology. The holographic track construction technology is used as an important component of the ITS, can comprehensively capture and analyze the dynamic behaviors of vehicles in the intersection, and is important for improving the optimization and prediction capabilities of traffic flow.
Currently, the main technical scheme for realizing the construction of complex holographic trajectories in intersections depends on the thunder fusion technology. The technology combines the advantages of radar and visual detection equipment, and can acquire the position, speed, movement direction, appearance, license plate and other characteristic information of the vehicle. However, in practical applications, due to factors such as line of sight shielding, illumination variation, distance limitation, etc., the visual detection device often cannot continuously or accurately capture the track of the vehicle, resulting in incomplete track information.
In order to solve these problems, the current technical scheme adopts a track correction method. The core of this approach is to set a preset trajectory, i.e. a series of expected travel paths and speeds for the vehicle according to road design and traffic rules. In practical application, the system firstly acquires actual running data of the vehicle through radar and visual detection equipment, and then compares and analyzes the data with a preset track. When the deviation between the actual track and the preset track is found, the system can correct the track by utilizing an algorithm so as to simulate and restore the real running state of the vehicle at the intersection.
However, although the trajectory correction method alleviates the detection error to some extent, it still has limitations. On the one hand, the setting of the preset track depends on road design and traffic rules, and deep learning and understanding of vehicle behaviors are lacking. Therefore, in a complex traffic environment, particularly in the case of a large traffic flow and a traffic incident, the prediction accuracy of the preset trajectory may be limited. On the other hand, the track correction method mainly depends on comparison and analysis of an actual track and a preset track by an algorithm, and lacks the capability of predicting the running intention and the path of the vehicle in real time. This limits the ability of holographic trajectory construction techniques to cope with sudden traffic events and complex traffic scenarios.
Disclosure of Invention
In view of the above problems, the present application provides a vehicle track optimization method and related devices based on simulation learning, so as to achieve the purpose of better coping with sudden traffic events and complex traffic scenes. The specific scheme is as follows:
the first aspect of the present application provides a vehicle track optimization method based on imitation learning, comprising:
obtaining initial track data of a target vehicle;
Determining a target flow direction of the target vehicle;
Screening a target lane from at least one lane corresponding to the target flow based on the initial track data;
if the initial track data meets the correction condition, extracting lane information of the target lane from configuration information of an intersection, and acquiring current position information and current motion information of the target vehicle and surrounding vehicles based on the initial track data;
Inputting the current position information, the current movement information and the lane information of the target vehicle and surrounding vehicles thereof into a generated anti-imitation learning model to obtain target track data determined by the generated anti-imitation learning model;
and correcting the initial track data based on the target track data.
In one possible implementation, determining a target flow direction of the target vehicle includes:
if the target vehicle has locked a flow direction, taking the flow direction of the target vehicle locked as a target flow direction of the target vehicle;
If the target vehicle is not in the flowing direction and the lane flowing direction of the entrance road of the target vehicle is configured to be in a single flowing direction, the single flowing direction is used as the target flowing direction of the target vehicle;
If the target vehicle is not in the flowing direction and the lane flowing direction of the entrance road of the target vehicle is configured to be in multiple flowing directions, selecting the flowing direction with the largest flow rate from the multiple flowing directions as the target flowing direction of the target vehicle by comparing the flow rates corresponding to the flowing directions in the multiple flowing directions;
And if the vehicle flow corresponding to each of the multiple directions is consistent, determining a temporary target point based on the historical track of the target vehicle, and if the temporary target point is positioned in one of the multiple directions, taking the direction containing the temporary target point in the multiple directions as the target direction of the target vehicle.
In one possible implementation, selecting a target lane from the target stream to a corresponding at least one lane based on the initial trajectory data, includes:
determining the latest track course angle of the target vehicle based on the track point of the initial track data where the target vehicle is currently located;
Acquiring an outlet course angle corresponding to a target flow direction of the target vehicle;
determining an average included angle between the latest track course angle and the exit course angle;
and determining the angle between the target point of each lane in at least one lane corresponding to the target flow direction and the current track point of the target vehicle, and taking the lane with the smallest difference between the angle and the average included angle as the target lane.
In one possible implementation, the initial trajectory data satisfies a correction condition by determining that:
Determining a first distance between a first initial track point in the initial track data, which enters an intersection of the target lane, and a target point of the target lane;
Determining a second distance between a track point where the target vehicle is currently located and the first initial track point in the initial track data;
if the target flow direction is a left-turning flow direction, if the ratio of the second distance to the first distance is not smaller than a left-turning threshold value, a correction condition is satisfied;
if the target flow direction is a right turn flow direction, if the ratio of the second distance to the first distance is not smaller than a right turn threshold value, a correction condition is satisfied;
If the target flow direction is a straight-going flow direction, if the ratio of the second distance to the first distance is not smaller than a straight-going threshold value and the target vehicle fails in visual tracking, a correction condition is met;
and if the target flow direction is a u-turn flow direction, if the target vehicle fails in visual tracking, the correction condition is met.
In one possible implementation, the generating the challenge simulation learning model is based on a challenge network training, the challenge network including a generator and a evaluator;
the process of generating an countermeasure imitation learning model for training based on the countermeasure network includes:
Acquiring a vehicle running track of an intersection, and determining a state action pair of an expert based on the vehicle running track of the intersection;
At the current moment Setting according to course distribution setSampling outA plurality of vehicles as a plurality of sample vehicles;
The method comprises the steps of acquiring sample information of each sample vehicle in the plurality of sample vehicles at the current position, wherein the sample information comprises position information and motion information of the sample vehicle at the current position, position information and motion information of surrounding sample vehicles and lane information of a target sample lane;
processing sample information corresponding to each sample vehicle according to the current strategy of the generator to generate the track of each sample vehicle;
Determining a punishment value corresponding to the track of each sample vehicle;
Action pairs for each state in the track of each sample vehicle based on the evaluator Scoring to generate a reward value of each sample vehicle, wherein the reward value is determined by the following steps:
wherein, Representation judging deviceAt the parameters ofUnder according to the state action pairThe value of the result is that,A penalty value corresponding to the trajectory of the sample vehicle;
The method for updating the policy parameters of the generator based on the trust domain optimization method comprises the following steps of solving the constraint optimization problem:
;
wherein, Representing policiesParameters of (2); representing the desire; Is shown in Current policy taken by time of day based on old parametersDefined as follows; representing a new policy; Indicating that the current policy is in Under the observation condition at the momentTake action downwardsProbability of (2); representing new strategies in Under the observation condition at the momentTake action downwardsProbability of (2); Representing current policies Under observation conditionsProbability distribution of action taken down; Representing new policies Under observation conditionsProbability distribution of action taken down; Representation of AndKL (Kullback-Leibler) divergence between; Representing a step size parameter for controlling the maximum variation of the strategy in each optimization step; representing dominance functions for measuring observation conditions Take action downwardsAction value expectations of (a)And observer deviceEstimated state value expectationsDegree of difference between them, actionRepresenting an action taken by the sample vehicle according to a policy;
The dominance function is estimated by the following generalized dominance estimation method:
wherein, Representing a discount rate; is a parameter between 0 and 1 for balancing TD (Temporal Difference) errors Weights of (2); Representing the prize value determined by the evaluator; And Respectively shown inTime of day and time of dayState value expectations at time;
maintaining the policy parameters of the generator unchanged, and updating the judgment parameters of the judgment device based on the state action pairs of the expert and the state action pairs generated by the new policy of the generator The judging parameters of the judging deviceUpdating is performed by the following objective function:
wherein, Representing an expert policy that is set up to,The new policy is represented by a representation of the new policy,Representing in-execution policiesTime state action pairProbability of being accessed; expressed in policy Down timeIn a state ofProbability of (2); Representing state based on current policy Take action downwardsProbability of (2); Representing in-execution policies Time state action pairProbability of being accessed; expressed in policy Down timeIn a state ofProbability of (2); representing expert policy based on state Take action downwardsProbability of (2); Is that Is a simplified representation of (1) representing a evaluatorAt the parameters ofUnder according to the state action pairThe resulting values.
In one possible implementation, determining a penalty value corresponding to the trajectory of each of the sample vehicles includes:
By penalty function Determining a penalty value corresponding to the track of each sample vehicle;
wherein, Representing the minimum distance between any two sample vehicles,1 Represents a collision penalty value and,Representing the closest distance of the sample vehicle from the road edge,,Representing the closest distance of the sample vehicle from the left edge of the road,Representing the closest distance of the sample vehicle from the right edge of the road,A distance penalty value is indicated and,Indicating that the vehicle kinematic constraints are not met,Representing the constraint penalty value(s),The penalty value of sudden braking is indicated,Indicating acceleration.
In one possible implementation, the collision penalty value is determined by:
extracting first n consecutive location points from the trajectory of the sample vehicle;
For each of the first n consecutive location points, marking the location point as abnormal if the sample vehicle did not collide with its surrounding vehicles before moving to the location point, but did collide with its surrounding vehicles at the location point; if the sample vehicle did not collide with its surrounding vehicles before moving to the location point and the location point did not collide with its surrounding vehicles, marking the location point as a candidate;
if the position points marked as candidates exist in the first n continuous position points, arranging the last position point in the position points marked as candidates as a new current position of the sample vehicle, determining a penalty value corresponding to the position points marked as abnormal, and accumulating the penalty values corresponding to the position points marked as abnormal to obtain a collision penalty value;
and if the first n continuous position points are marked as abnormal, taking the first position point in the first n continuous position points as a new current position of the sample vehicle, determining a punishment value corresponding to the position point marked as abnormal, and accumulating the punishment values corresponding to the position point marked as abnormal to obtain a collision punishment value.
Another aspect of the present application provides a vehicle trajectory optimization device based on simulation learning, including:
the first obtaining module is used for obtaining initial track data of the target vehicle;
the first determining module is used for determining a target flow direction of the target vehicle;
The screening module is used for screening a target lane from at least one lane corresponding to the target flow direction based on the initial track data;
The second obtaining module is used for extracting the lane information of the target lane from the configuration information of the intersection if the initial track data meets the correction condition, and obtaining the current position information and the current motion information of the target vehicle and surrounding vehicles based on the initial track data;
The second determining module is used for inputting the current position information, the current motion information and the lane information of the target vehicle and surrounding vehicles thereof into a generated anti-imitation learning model to obtain target track data determined by the generated anti-imitation learning model;
And the correction module is used for correcting the initial track data based on the target track data.
A third aspect of the present application provides an electronic apparatus, comprising:
the memory is used for storing a computer program;
The processor is configured to execute the computer program to enable the electronic device to implement a method of vehicle trajectory optimization based on imitation learning as described in any one of the above.
A fourth aspect of the present application provides a computer storage medium carrying one or more computer programs which, when executed by an electronic device, enable the electronic device to implement a method of model learning based vehicle track optimisation as described in any one of the preceding claims.
In the present application, by training the generated countermeasure imitation learning model, the generated countermeasure imitation learning model can learn how to predict the future travel intention and path of the vehicle based on the current position, speed, direction of movement, and dynamic changes of the surrounding vehicles. Therefore, the current position information, the current movement information and the lane information of the target vehicle and surrounding vehicles are input into the generation of the anti-imitation learning model, and the generation of the anti-imitation learning model can generate more accurate and reliable target track data, so that the target track data can intelligently correct incomplete or deviation vehicle tracks determined by radar equipment and visual detection equipment to solve the problems of vision shielding, illumination change, too far distance and the like, thereby realizing continuous and accurate capturing of the movement state of the vehicle in the intersection. And by generating the antagonism imitation learning model to deeply learn and understand the vehicle behavior, the dependence on the preset track can be abandoned, the accurate prediction of the vehicle driving intention and the path can be realized, and the sudden traffic incident and the complex traffic scene can be better dealt with. And by generating the countermeasure imitation learning model, the fusion process of the data under the complex traffic scene, such as the accuracy and adaptability of the radar fusion under the condition of traffic jam, can be optimized.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
Fig. 1 is a schematic flow chart of a vehicle track optimization method based on simulation learning according to embodiment 1 of the present application;
FIG. 2 is a schematic flow chart of guiding a vehicle to move;
FIG. 3 is a schematic diagram of another scenario implementation of a vehicle trajectory optimization method based on simulation learning provided by the present application;
FIG. 4 is a schematic diagram of a training process for a generator and a evaluator provided by the present application;
FIG. 5 is a schematic diagram of still another scenario implementation of a vehicle trajectory optimization method based on simulation learning provided by the present application;
Fig. 6 is a schematic structural diagram of a vehicle track optimizing device based on imitation learning.
Detailed Description
Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. The terminology used in the description of the embodiments of the application herein is for the purpose of describing particular embodiments of the application only and is not intended to be limiting of the application.
Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.
The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.
Referring to fig. 1, a schematic flow chart of a vehicle track optimization method based on simulation learning according to embodiment 1 of the present application is shown in fig. 1, and the method may include, but is not limited to, the following steps:
step S101, obtaining initial trajectory data of a target vehicle.
In this embodiment, the radar device may be used to accurately capture dynamic data such as position information, speed, and movement direction of the target vehicle, and the visual detection device may be used to capture unique identification information such as appearance characteristics (e.g., vehicle type, color, etc.) and license plate of the target vehicle. Initial trajectory data of the target vehicle is generated based on dynamic data such as position information, speed, movement direction and the like of the target vehicle, appearance characteristics (such as vehicle type, color and the like) of the target vehicle, license plates and the like.
Step S102, determining a target flow direction of the target vehicle.
The target flow direction may be understood as a direction or path that the target vehicle may or is expected to travel in the future. Determining the target flow direction may help more accurately screen and locate the lane in which the target vehicle may travel.
And step S103, selecting a target lane from at least one lane corresponding to the target flow direction based on the initial track data.
After the target flow direction of the target vehicle is determined, a lane most likely to be driven by the target vehicle, that is, a target lane, may be selected from a plurality of lanes corresponding to the target flow direction according to the initial trajectory data. This step helps to further narrow the range of travel of the target vehicle, facilitating subsequent analysis and processing.
Step S104, if the initial track data meets the correction condition, extracting the lane information of the target lane from the configuration information of the intersection, and acquiring the current position information and the current motion information of the target vehicle and surrounding vehicles based on the initial track data.
When the initial track data meets a certain correction condition, the lane information related to the target lane, such as a lane number, a lane corresponding flow direction, an exit target point and the like, can be extracted from the configuration information of the intersection.
In this embodiment, the current position information and the current motion information of the target vehicle may be extracted from the initial trajectory data. And determining surrounding vehicles of the target vehicle based on the initial trajectory data. The current position information and the current movement information of the surrounding vehicles are determined by the track data of the surrounding vehicles.
The current motion information of the target vehicle may include, but is not limited to, at least one of a current speed, a current acceleration, and a current vehicle corner of the target vehicle. Accordingly, the current motion information of the surrounding vehicles of the target vehicle may also include, but is not limited to, at least one of the current speed, the current acceleration, and the current vehicle corner of the surrounding vehicles.
If the initial trajectory data does not satisfy the correction condition, the initial trajectory data may be used to output the position in the initial trajectory data. The output position may be used to guide the movement of the target vehicle.
Step S105, inputting the current position information, the current movement information and the lane information of the target lane of the target vehicle and the surrounding vehicles thereof into a generated anti-imitation learning model, and obtaining the target track data determined by the generated anti-imitation learning model.
In this embodiment, the generated countermeasure learning model may be trained based on the actual vehicle driving track of the intersection, and the policy parameters for generating countermeasure learning may be updated, so that the generated countermeasure learning model may simulate expert policies (i.e., policies corresponding to actual vehicle behaviors), learn and understand vehicle behaviors in depth, so that the generated countermeasure learning model may generate action parameters similar to the actual vehicle behaviors, and further obtain target track data, and ensure that the target track data is more reasonable.
In this embodiment, the vehicle running tracks of the road opening may be grouped in the flow direction to obtain corresponding vehicle running tracks of the left turn direction, the straight flow direction, the right turn direction and the u-turn direction, and training is performed simultaneously based on the corresponding vehicle running tracks of the left turn direction, the straight flow direction, the right turn direction and the u-turn direction, so that the generation of the anti-imitation learning model may simulate expert strategies for learning different flow directions at the same time.
And step S106, correcting the initial track data based on the target track data.
In this embodiment, the initial track data may be replaced with the target track data, so as to complete the correction of the initial track data.
Or the initial track data may be updated based on the target track data to complete the correction.
After the initial trajectory data is corrected, the target vehicle may be guided to move based on the corrected trajectory data. In the process of guiding the target vehicle to move, whether to continue to generate target track data can be judged. For example, as shown in fig. 2, after the vehicle moves and the post-movement position is output, it may be determined whether the target trajectory data has arrived in the effective areas of the radar device and the visual detection device based on the post-movement position, if so, the step of generating the target trajectory data may be ended, and if not, the step of generating the target trajectory data may be continued by acquiring new current motion information of the target vehicle from the target trajectory data with the post-movement position as new current position information and acquiring new current position information and new current motion information of surrounding vehicles.
In the present embodiment, by training the generated countermeasure imitation learning model, the generated countermeasure imitation learning model can learn how to predict the future travel intention and path of the vehicle based on the current position, speed, movement direction of the vehicle, and dynamic changes of surrounding vehicles. Therefore, the current position information, the current movement information and the lane information of the target vehicle and surrounding vehicles are input into the generation of the anti-imitation learning model, and the generation of the anti-imitation learning model can generate more accurate and reliable target track data, so that the target track data can intelligently correct incomplete or deviation vehicle tracks determined by radar equipment and visual detection equipment to solve the problems of vision shielding, illumination change, too far distance and the like, thereby realizing continuous and accurate capturing of the movement state of the vehicle in the intersection. And by generating the antagonism imitation learning model to deeply learn and understand the vehicle behavior, the dependence on the preset track can be abandoned, the accurate prediction of the vehicle driving intention and the path can be realized, and the sudden traffic incident and the complex traffic scene can be better dealt with. And by generating the countermeasure imitation learning model, the fusion process of the data under the complex traffic scene, such as the accuracy and adaptability of the radar fusion under the condition of traffic jam, can be optimized.
As another optional embodiment of the present application, a vehicle track optimization method based on simulation learning provided in embodiment 2 of the present application is mainly an implementation of step S102 in embodiment 1, where step S102 may include, but is not limited to, the following steps:
step S1021, if the target vehicle has locked the flow direction, the flow direction of the target vehicle locked is used as the target flow direction of the target vehicle.
In the present embodiment, if the target vehicle has been determined in its travel direction before entering the intersection due to its travel path or traffic signal control or the like and is not allowed to change, it may be determined that the target vehicle has locked the flow direction.
In most cases, a straight-traveling vehicle may be "locked" as it approaches an intersection, because the straight-traveling vehicle typically does not need to make a directional selection in front of the intersection, but instead travels directly along the current lane.
Step S1022, if the target vehicle is not in the flowing direction, and the lane flowing direction of the entrance of the target vehicle is configured to be in the single flowing direction, the single flowing direction is used as the target flowing direction of the target vehicle.
A single flow direction is understood to mean that the entrance lane allows only vehicles of a single flow direction to travel, such as only left or only right turns.
Step S1023, if the target vehicle is not in the flowing direction, and the lane flowing direction of the entrance road of the target vehicle is configured to be in multiple flowing directions, selecting the flowing direction with the largest flow rate from the multiple flowing directions as the target flowing direction of the target vehicle by comparing the flow rates corresponding to the flowing directions in the multiple flowing directions.
If the multi-direction can include a left turn direction and a straight direction, the number of vehicles in the left turn and the straight direction in the intersection can be monitored to judge the current vehicle direction.
And if the number of the left turning vehicles is larger than that of the straight running vehicles, the left turning vehicles are turned to a target flow direction which is the target vehicle.
If the number of left-hand steering vehicles is smaller than that of straight-hand steering vehicles, the straight-hand steering direction is taken as the target steering direction of the target vehicle.
If the multi-flow direction can comprise a right turn flow direction and a straight flow direction, the number of vehicles turning right and straight flow directions in the intersection can be monitored to judge the flow direction.
If the number of right turn vehicles is greater than the number of straight run vehicles or the number of straight run vehicles is 0, the right turn may be directed to a target flow direction as the target vehicle.
And step S1024, if the vehicle flow rates corresponding to all the multiple directions are consistent, determining a temporary target point based on the historical track of the target vehicle, and if the temporary target point is located in one of the multiple directions, taking the direction containing the temporary target point in the multiple directions as the target direction of the target vehicle.
In this embodiment, when the traffic flows of the directions in the multi-direction lanes are consistent, the temporary target point based on the history track of the target vehicle is selected to determine the target direction, so that the real intention and the driving habit of the vehicle can be reflected more accurately, and the accuracy of the target direction can be improved.
As another optional embodiment of the present application, for the vehicle track optimization method based on the simulation learning provided in embodiment 3 of the present application, this embodiment is mainly one implementation of step S103 in the above embodiment 1, and step S103 may include, but is not limited to, the following steps:
and step S1031, determining the latest track course angle of the target vehicle based on the track point of the initial track data where the target vehicle is currently located.
In this embodiment, the track point where the target vehicle is currently located in the initial track data may be based onAnd original track points which are N distances from the track point where the target vehicle is currently located, and calculating the latest track course angle of the target vehicle. Wherein N may be set as needed, and is not limited in the present application. For example, N may be 3.
Step S1032, obtaining an exit course angle corresponding to the target flow direction of the target vehicle.
Step S1033, determining an average included angle between the latest track course angle and the exit course angle.
In this embodiment, the average angle can be calculated by the following relation:
Represents the average included angle of the two components, The latest track course angle is represented,Representing the exit course angle.
And step S1034, determining the angle between the target point of each lane in at least one lane corresponding to the target flow direction and the current track point of the target vehicle, and taking the lane with the smallest difference between the angle and the average included angle as the target lane.
As shown in fig. 3, the target point of each lane in at least one lane corresponding to the target flow direction and the track point where the target vehicle is currently located may be connected, an angle between the target point of each lane in at least one lane corresponding to the target flow direction and the track point where the target vehicle is currently located is calculated, and the lane with the smallest difference between the angle and the average included angle is used as the target lane.
As another optional embodiment of the present application, in the vehicle track optimization method based on simulation learning provided in embodiment 4 of the present application, this embodiment is mainly an implementation manner of a determination manner that initial track data in the foregoing embodiment 1 satisfies a correction condition, where the determination manner that the initial track data satisfies the correction condition may be determined by:
step S11, determining a first distance between a first initial track point in the intersection entering the target lane and a target point of the target lane in the initial track data.
In this embodiment, the coordinates of the first initial track point in the intersection entering the target lane may be based on the initial track dataAnd coordinates of a target point of the target laneCalculating the distance between the first initial track point and the target point of the target lane (expressed as) Will beAs the first distance.
And step S12, determining a second distance between the track point of the target vehicle in the initial track data and the first initial track point.
In this embodiment, the coordinates of the track point where the target vehicle is currently located in the initial track data may be based onAnd the coordinates of the first initial trajectory pointCalculating the distance between the current track point of the target vehicle and the first initial track point (expressed as) Will beAs the second distance.
And S13, if the target flow direction is a left turning flow direction, if the ratio of the second distance to the first distance is not smaller than a left turning threshold value, the correction condition is met.
In the present embodiment, comparison can be madeAnd a left turn threshold (expressed as) If the size of (a)Not less thanIt is determined that the correction condition is satisfied.
And S14, if the target flow direction is the right turning flow direction, if the ratio of the second distance to the first distance is not smaller than a right turning threshold value, the correction condition is met.
In the present embodiment, comparison can be madeAnd a right turn threshold (expressed as) If the size of (a)Not less thanIt is determined that the correction condition is satisfied.
And S15, if the target flow direction is a straight flow direction, if the ratio of the second distance to the first distance is not smaller than a straight threshold value and the target vehicle fails in visual tracking, the correction condition is met.
In the present embodiment, comparison can be madeAnd a straight threshold (expressed as) If the size of (a)Not less thanThe target vehicle may be marked as a lock flow state. When it is further determined that the target vehicle fails in visual tracking, it is determined that a correction condition is satisfied.
And S16, if the target flow direction is a U-turn flow direction, if the target vehicle fails in visual tracking, the correction condition is met.
Besides the condition that the vehicles in the straight-running direction can lock the flow direction before meeting the correction conditions, the vehicles in the left-turning and right-turning directions can lock the flow direction while meeting the correction conditions, and the vehicles in the turning-around direction can lock the flow direction while meeting the correction conditions after tracking failure.
In the embodiment, the threshold value of different flow directions is set) And according to the actual driving distance of the vehicle) Distance from the reference) To determine whether to enter an "untrusted zone" (i.e., a dynamic blind zone). Compared with the method of determining the blind area range by relying on physical calibration of radar equipment and visual detection equipment, the method is more flexible and can adapt to the changes of different environments and intersections. When the vehicle enters a dynamic blind zone, the system automatically enters a correction mode, and intelligent correction is carried out on the vehicle track in the blind zone. This helps to reduce trajectory errors and false positives due to dead zones. By dynamically defining the blind area and intelligently correcting, the method improves the robustness of the system, so that the system can track and predict the running track of the vehicle more accurately. In addition, the radar fusion process can be improved, so that the radar and the visual information can be combined more effectively, and the accuracy and the reliability of overall perception are improved.
As another alternative embodiment of the present application, a vehicle trajectory optimization method based on simulation learning provided in embodiment 5 of the present application is mainly an implementation manner of generating a challenge simulation learning model described in the foregoing embodiment 1, where the generating a challenge simulation learning model may be obtained based on training a challenge network, and the challenge network may include a generator and a evaluator.
In this embodiment, the discount rate during training may be determinedBatch sizeNumber of trainingCourse learning process and training maximum number of vehiclesAnd policiesMeaning. For example, two training phases may be selected, a first training phase with a discount rate of 0.95 and a batch size of 10000 observation action pairsThe training frequency is 1000 iterations, 10 vehicles are added to the environment every 200 iterations, a course learning process is carried out, and the learning efficiency and performance are improved by gradually increasing the difficulty of tasks. A second training stage (fine tuning stage), a discount rate of 0.99, a batch size of 40000 observation action pairsThe number of training was 200 iterations and training was performed with 100 vehicles. StrategyThe actions that the vehicle should perform may be selected based on the current environmental conditions and policy parameters.
The discount rate may be understood as a coefficient between 0 and 1 for measuring the current value of the future reward in the scene of reinforcement learning, etc., which determines the importance of the future reward, and a higher discount rate pays more attention to the future reward than a lower discount rate pays more attention to the current reward.
The observation action pair may be understood as a combination of an observation (observation) received by the vehicle at the current time (an observation is information acquired from the environment that describes the current state or part of the state of the environment) and an action to be taken subsequently.
In this embodiment, the penalty function may be defined in accordance with vehicle kinematic constraints and anti-collision rules.
In this embodiment, policy parameters of the generator may be initializedJudgment parameter of judgment device(By updating the evaluation parameters)The evaluator may learn whether the evaluation state action pairs are generated by expert policy or by generator policy), step size parameters(For controlling the maximum variation of the strategy in each optimization step, ensuring that the strategy update is both efficient and stable. This parameter is a key super-parameter that adjusts algorithm stability and learning progress), course distribution set(I.e., one concept in course learning (Curriculum Learning) that defines the task difficulty profile that an agent will face during training.) in a multi-agent (i.e., vehicular) environment, course profile setsCan be defined as a time-varying distribution that gradually increases the number of agents under strategic control, thereby gradually increasing the difficulty of training. The method is helpful for the agent to learn and process more complex interaction scenes step by step, and avoids the problem of excessively complex decision-making at the beginning.
In the PS-GAIL (parameter sharing type generation resist imitation learning, PARAMETER SHARING GENERATIVE ADVERSARIAL Imitation Learning) method, policy parametersPolicy parameters, which may be shared. In a multi-agent environment, all agents share the same set of policy parameters. This means that agents will take the same strategic actions given the input. By sharing policy parameters, the agents can learn driving policies that interact with other agents in complex traffic scenarios. Sharing policy parameters helps to reduce model complexity, improve training efficiency, and promote collaborative behavior between agents.
The process of generating an countermeasure imitation learning model for training based on the countermeasure network includes:
and S31, acquiring the vehicle running track of the intersection, and determining the state action pair of the expert based on the vehicle running track of the intersection.
In this embodiment, the vehicle running track at the intersection can be collected by using the unmanned aerial vehicle, the mobile phone positioning data and other parties, and the track which is too short and the discontinuous track are removed, so that the remaining vehicle running track is obtained.
In this embodiment, the remaining vehicle travel tracks may be flow-direction-grouped to accommodate different driving intentions such as left turn, straight turn, and right turn. Based on expert strategy, discrete track points are extracted from each vehicle running track corresponding to each flow direction as a state action pair of an expertForm an expert behavior set. Training is performed simultaneously for different driving intents.Representing a state (e.g., position, velocity, etc.),Representing an action (e.g., acceleration, deceleration, steering, etc.).
Step S32, at the current timeSetting according to course distribution setSampling outA plurality of vehicles as a plurality of sample vehicles.
And step S33, acquiring sample information of each sample vehicle in the plurality of sample vehicles at the current position, wherein the sample information comprises position information and motion information of the sample vehicle at the current position and lane information of surrounding sample vehicles and position information and motion information of the sample vehicle at the current position.
The position information and the motion information may be referred to in the previous embodiments, and the description thereof is omitted herein.
And step S34, processing sample information corresponding to each sample vehicle according to the current strategy of the generator, and generating the track of each sample vehicle.
And step S35, determining a penalty value corresponding to the track of each sample vehicle.
In this embodiment, a penalty value may be generated if the behavior of the vehicle violates a rule.
Step S36, based on the judgment device, acting pairs for each state in the track of each sample vehicleScoring to generate a reward value of each sample vehicle, wherein the reward value is determined by the following steps:
wherein, Representation judging deviceAt the parameters ofUnder according to the state action pairThe value of the result is that,And representing the punishment value corresponding to the track of the sample vehicle.
When the behavior of the vehicle violates a rule,A positive value is generated, thereby reducing the overall prize. In this way, the vehicle will tend to avoid those behaviors that would cause a penalty when making decisions.
Step S37, keeping parameters of the evaluator unchanged, updating policy parameters of the generator based on a trust domain optimization method, wherein the updating of the policy parameters of the generator based on the trust domain optimization method comprises solving the following constraint optimization problems:
;
wherein, Representing policiesParameters of (2); representing the desire; Is shown in Current policy taken by time of day based on old parametersDefined as follows; representing a new policy; Indicating that the current policy is in Under the observation condition at the momentTake action downwardsProbability of (2); representing new strategies in Under the observation condition at the momentTake action downwardsProbability of (2); Representing current policies Under observation conditionsProbability distribution of action taken down; Representing new policies Under observation conditionsProbability distribution of action taken down; Representation of AndKL (Kullback-Leibler) divergence between; Representing a step size parameter for controlling the maximum variation of the strategy in each optimization step; representing dominance functions for measuring observation conditions Take action downwardsAction value expectations of (a)And observer deviceEstimated state value expectationsThe degree of difference between; Representing a selection of actions that the vehicle should perform according to the current policy of the generator;
The dominance function is estimated by the following generalized dominance estimation method:
wherein, Representing a discount rate; is a parameter between 0 and 1 for balancing TD (Temporal Difference) errors Weights of (2); Representing the prize value determined by the evaluator; And Respectively shown inTime of day and time of dayState value expectations at time;
step S38, keeping the policy parameters of the generator unchanged, and updating the judgment parameters of the judgment device based on the state action pairs of the expert and the state action pairs generated by the new policy of the generator The judging parameters of the judging deviceUpdating is performed by the following objective function:
wherein, Representing an expert policy that is set up to,The new policy is represented by a representation of the new policy,Representing in-execution policiesTime state action pairThe probability of being accessed, i.e., occupancy metrics; expressed in policy Down timeIn a state ofProbability of (2); Representing state based on current policy Take action downwardsProbability of (2); Representing in-execution policies Time state action pairThe probability of being accessed, i.e., occupancy metrics; expressed in policy Down timeIn a state ofProbability of (2); representing expert policy based on state Take action downwardsProbability of (2); Is that Is a simplified representation of (1) representing a evaluatorAt the parameters ofUnder according to the state action pairThe resulting values.
If the training end condition is not reached, steps S31-S37 may be continued, and if the training end condition is reached, training may be ended, and a generator in the countermeasure network may act as generating a countermeasure simulation learning model. The training end condition may be set as needed, and is not limited in the present application. For example, the training end condition may include reaching the number of iterations or meeting an objective function requirement (e.g., a criterion index). Referring to fig. 4, the iterative execution process of steps S31-S37 will be described, and as shown in fig. 4, on the basis of determining the training process, defining a penalty function, initializing various parameters, acquiring a vehicle running track of an intersection, and grouping the vehicle running track in a flow direction to obtain a corresponding expert behavior set. After obtaining sample information of each sample vehicle in the plurality of sample vehicles at the current position, processing the sample information corresponding to each sample vehicle according to the current strategy of the generator, and generating tracks of each sample vehicle, evaluating whether the training times are met or the requirements of objective functions are met, and if the training times are not met or the requirements of the objective functions are not met, evaluating whether each state action pair in the tracks of each sample vehicle is metScoring to generate a reward value of each sample vehicle, keeping parameters of the evaluator unchanged, updating policy parameters of the evaluator based on a trust domain optimization method, wherein the reward value generated by the evaluator can be used in the trust domain optimization method, after updating the policy parameters of the evaluator, learning a new policy by the evaluator, keeping the policy parameters of the evaluator unchanged, and updating the evaluation parameters of the evaluator based on a state action pair of the expert and a state action pair generated by the new policy of the evaluatorJudgment parameters of the judgment deviceUpdating is performed by means of an objective function.
And ending the training if the training times are reached or the requirements of the objective function are met.
As another optional embodiment of the present application, for the vehicle track optimization method based on simulation learning provided in embodiment 6 of the present application, this embodiment is mainly an implementation of step S35 in the above embodiment 5, and step S35 may include, but is not limited to, the following steps:
step S351, passing penalty function And determining a penalty value corresponding to the track of each sample vehicle.
Wherein, Representing the minimum distance between any two sample vehicles,1 Represents a collision penalty value and,Representing the closest distance of the sample vehicle from the road edge,,Representing the closest distance of the sample vehicle from the left edge of the road,Representing the closest distance of the sample vehicle from the right edge of the road,A distance penalty value is indicated and,Indicating that the vehicle kinematic constraints are not met,Representing the constraint penalty value(s),The penalty value of sudden braking is indicated,Indicating acceleration.
It will be appreciated that the number of components,There may be one penalty value (i.e., one of a collision penalty value, a distance penalty value, a constraint penalty value, and a sudden braking penalty value) or a plurality of penalty values (i.e., at least two of a collision penalty value, a distance penalty value, a constraint penalty value, and a sudden braking penalty value).
As another optional embodiment of the present application, in order to provide a vehicle trajectory optimization method based on simulation learning according to embodiment 7 of the present application, this embodiment is mainly an implementation manner of the collision penalty value in the foregoing embodiment 6, where the collision penalty value is determined by the following manner:
Step S41, extracting the previous track from the track of the sample vehicle Successive location points.
In the present embodiment, the following can be applied as neededIs set, and is not limited in the present application. For example, the number of the cells to be processed,May be set to 6.
Front partThe successive position points can be expressed as
Step S42, for the front partEach of the plurality of consecutive location points is marked as abnormal if the sample vehicle did not collide with its surrounding vehicles before moving to the location point, but did collide with its surrounding vehicles at the location point, and marked as a candidate if the sample vehicle did not collide with its surrounding vehicles before moving to the location point, and did not collide with its surrounding vehicles at the location point.
For example, forIf the sample vehicle is moving toNo previous collision with its surrounding vehicles, but the sample vehicle isCollision with surrounding vehicles can makeThe mark is abnormal, that is, the travel of the present round is not performed. If the sample vehicle is moving toNo previous collision with its surrounding vehicles, but the sample vehicle isCan make no collision with surrounding vehiclesMarked as candidate, i.e. the run may proceed to
Step S43, if the front partAnd each position point in the continuous position points is provided with a position point marked as a candidate, the position point marked as the candidate is arranged at the last position point as a new current position of the sample vehicle, a punishment value corresponding to the position point marked as abnormal is determined, and the punishment values corresponding to the position points marked as abnormal are accumulated to obtain a collision punishment value.
For example, if,In (a)In order to mark a location point as a candidate,As a candidate point marked as abnormal, thenAs a new current position of the sample vehicle, and willAnd accumulating the corresponding punishment values to obtain the collision punishment value.
Step S44, if the front partEach successive location point is marked as abnormal, the front is marked as abnormalAnd taking the first position point in the continuous position points as the new current position of the sample vehicle, determining a punishment value corresponding to the position point marked as abnormal, and accumulating the punishment values corresponding to the position point marked as abnormal to obtain a collision punishment value.
For example, if,All marked as abnormal, can be used toAs a new current position of the sample vehicle, and willAnd accumulating the corresponding punishment values to obtain the collision punishment value.
If the front part isEach successive location point is marked as abnormal, the front is marked as abnormalThe first of the consecutive location points serves as a new current location of the sample vehicle to avoid failure to determine the new current location, affecting model training.
In this embodiment, by optimizing the anti-collision judgment logic, unnecessary computation can be reduced, thereby speeding up model training. Although the calculation process is simplified, by accumulating penalty values corresponding to the position points marked as abnormal, the risk of collision of the sample vehicle in the trajectory can be quantified, and such quantitative evaluation helps to more accurately understand the safety of the vehicle behavior. For example, as shown in fig. 5, after training the generation of the countermeasure learning model by the collision penalty value, the generation of the target trajectory data generated by the countermeasure learning model can avoid collision between the target vehicle and other vehicles after correcting the initial trajectory data.
Next, description will be made of the vehicle trajectory optimization device based on the imitation learning provided by the present application, and the vehicle trajectory optimization device based on the imitation learning described below and the vehicle trajectory optimization method based on the imitation learning described above can be referred to correspondingly with each other.
Referring to fig. 6, the vehicle trajectory optimization device based on the imitation learning includes a first obtaining module 100, a first determining module 200, a screening module 300, a second obtaining module 400, a second determining module 500, and a modifying module 600.
A first obtaining module 100 is configured to obtain initial trajectory data of a target vehicle.
A first determining module 200 is configured to determine a target flow direction of the target vehicle.
And the screening module 300 is used for screening the target lane from the at least one lane corresponding to the target flow direction based on the initial track data.
The second obtaining module 400 is configured to extract lane information of the target lane from configuration information of an intersection if the initial track data meets a correction condition, and obtain current position information and current movement information of the target vehicle and surrounding vehicles based on the initial track data.
The second determining module 500 is configured to input current position information, current motion information, and lane information of the target vehicle and surrounding vehicles thereof to a generated anti-imitation learning model, and obtain target track data determined by the generated anti-imitation learning model.
And a correction module 600, configured to correct the initial track data based on the target track data.
The first determining module 200 may specifically be configured to:
if the target vehicle has locked a flow direction, taking the flow direction of the target vehicle locked as a target flow direction of the target vehicle;
If the target vehicle is not in the flowing direction and the lane flowing direction of the entrance road of the target vehicle is configured to be in a single flowing direction, the single flowing direction is used as the target flowing direction of the target vehicle;
If the target vehicle is not in the flowing direction and the lane flowing direction of the entrance road of the target vehicle is configured to be in multiple flowing directions, selecting the flowing direction with the largest flow rate from the multiple flowing directions as the target flowing direction of the target vehicle by comparing the flow rates corresponding to the flowing directions in the multiple flowing directions;
And if the vehicle flow corresponding to each of the multiple directions is consistent, determining a temporary target point based on the historical track of the target vehicle, and if the temporary target point is positioned in one of the multiple directions, taking the direction containing the temporary target point in the multiple directions as the target direction of the target vehicle.
The screening module specifically may be used for:
determining the latest track course angle of the target vehicle based on the track point of the initial track data where the target vehicle is currently located;
Acquiring an outlet course angle corresponding to a target flow direction of the target vehicle;
determining an average included angle between the latest track course angle and the exit course angle;
and determining the angle between the target point of each lane in at least one lane corresponding to the target flow direction and the current track point of the target vehicle, and taking the lane with the smallest difference between the angle and the average included angle as the target lane.
The apparatus may further include:
The judging module is used for:
Determining a first distance between a first initial track point in the initial track data, which enters an intersection of the target lane, and a target point of the target lane;
Determining a second distance between a track point where the target vehicle is currently located and the first initial track point in the initial track data;
if the target flow direction is a left-turning flow direction, if the ratio of the second distance to the first distance is not smaller than a left-turning threshold value, a correction condition is satisfied;
if the target flow direction is a right turn flow direction, if the ratio of the second distance to the first distance is not smaller than a right turn threshold value, a correction condition is satisfied;
If the target flow direction is a straight-going flow direction, if the ratio of the second distance to the first distance is not smaller than a straight-going threshold value and the target vehicle fails in visual tracking, a correction condition is met;
and if the target flow direction is a u-turn flow direction, if the target vehicle fails in visual tracking, the correction condition is met.
The apparatus may also include a training module.
The generated countermeasure imitation learning model is trained based on a countermeasure network, and the countermeasure network comprises a generator and a judge;
training module for:
Acquiring a vehicle running track of an intersection, and determining a state action pair of an expert based on the vehicle running track of the intersection;
At the current moment Setting according to course distribution setSampling outA plurality of vehicles as a plurality of sample vehicles;
The method comprises the steps of acquiring sample information of each sample vehicle in the plurality of sample vehicles at the current position, wherein the sample information comprises position information and motion information of the sample vehicle at the current position, position information and motion information of surrounding sample vehicles and lane information of a target sample lane;
processing sample information corresponding to each sample vehicle according to the current strategy of the generator to generate the track of each sample vehicle;
Determining a punishment value corresponding to the track of each sample vehicle;
Action pairs for each state in the track of each sample vehicle based on the evaluator Scoring to generate a reward value of each sample vehicle, wherein the reward value is determined by the following steps:
wherein, Representation judging deviceAt the parameters ofUnder according to the state action pairThe value of the result is that,A penalty value corresponding to the trajectory of the sample vehicle;
The method for updating the policy parameters of the generator based on the trust domain optimization method comprises the following steps of solving the constraint optimization problem:
;
wherein, Representing policiesParameters of (2); representing the desire; Is shown in Current policy taken by time of day based on old parametersDefined as follows; representing a new policy; Indicating that the current policy is in Under the observation condition at the momentTake action downwardsProbability of (2); representing new strategies in Under the observation condition at the momentTake action downwardsProbability of (2); Representing current policies Under observation conditionsProbability distribution of action taken down; Representing new policies Under observation conditionsProbability distribution of action taken down; Representation of AndKL (Kullback-Leibler) divergence between; Representing a step size parameter for controlling the maximum variation of the strategy in each optimization step; representing dominance functions for measuring observation conditions Take action downwardsAction value expectations of (a)And observer deviceEstimated state value expectationsDegree of difference between them, actionRepresenting an action taken by the sample vehicle according to a policy;
The dominance function is estimated by the following generalized dominance estimation method:
wherein, Representing a discount rate; is a parameter between 0 and 1 for balancing TD (Temporal Difference) errors Weights of (2); Representing the prize value determined by the evaluator; And Respectively shown inTime of day and time of dayState value expectations at time;
maintaining the policy parameters of the generator unchanged, and updating the judgment parameters of the judgment device based on the state action pairs of the expert and the state action pairs generated by the new policy of the generator The judging parameters of the judging deviceUpdating is performed by the following objective function:
wherein, Representing an expert policy that is set up to,The new policy is represented by a representation of the new policy,Representing in-execution policiesTime state action pairProbability of being accessed; expressed in policy Down timeIn a state ofProbability of (2); Representing state based on current policy Take action downwardsProbability of (2); Representing in-execution policies Time state action pairProbability of being accessed; expressed in policy Down timeIn a state ofProbability of (2); representing expert policy based on state Take action downwardsProbability of (2); Is that Is a simplified representation of (1) representing a evaluatorAt the parameters ofUnder according to the state action pairThe resulting values.
The training module determining a penalty value corresponding to the track of each sample vehicle may include:
By penalty function Determining a penalty value corresponding to the track of each sample vehicle;
wherein, Representing the minimum distance between any two sample vehicles,1 Represents a collision penalty value and,Representing the closest distance of the sample vehicle from the road edge,,Representing the closest distance of the sample vehicle from the left edge of the road,Representing the closest distance of the sample vehicle from the right edge of the road,A distance penalty value is indicated and,Indicating that the vehicle kinematic constraints are not met,Representing the constraint penalty value(s),The penalty value of sudden braking is indicated,Indicating acceleration.
The collision penalty value may be determined by:
extracting a pre-image from the trajectory of the sample vehicle A plurality of consecutive location points;
For the front part Each of a plurality of consecutive location points, marking the location point as abnormal if the sample vehicle did not collide with its surrounding vehicles before moving to the location point, but did collide with its surrounding vehicles at the location point; if the sample vehicle did not collide with its surrounding vehicles before moving to the location point and the location point did not collide with its surrounding vehicles, marking the location point as a candidate;
If the front part is The method comprises the steps that position points marked as candidates exist in each of the continuous position points, the position points marked as candidates are arranged at the last position point in the position points marked as candidates to serve as a new current position of the sample vehicle, penalty values corresponding to the position points marked as abnormal are determined, and the penalty values corresponding to the position points marked as abnormal are accumulated to obtain collision penalty values;
If the front part is Each successive location point is marked as abnormal, the front is marked as abnormalAnd taking the first position point in the continuous position points as the new current position of the sample vehicle, determining a punishment value corresponding to the position point marked as abnormal, and accumulating the punishment values corresponding to the position point marked as abnormal to obtain a collision punishment value.
In another embodiment of the present application, there is provided an electronic apparatus including:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to enable the electronic device to implement a vehicle trajectory optimization method based on simulation learning as described in any one of embodiments 1-7.
In another embodiment of the application, a computer storage medium is provided, which carries one or more computer programs, which when executed by an electronic device, enable the electronic device to implement a method of vehicle trajectory optimization based on impersonation learning as introduced in any of embodiments 1-7.
It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

Claims (9)

1.一种基于模仿学习的车辆轨迹优化方法,其特征在于,包括:1. A vehicle trajectory optimization method based on imitation learning, characterized by comprising: 获得目标车辆的初始轨迹数据;Obtaining initial trajectory data of the target vehicle; 确定所述目标车辆的目标流向;Determining a target flow direction of the target vehicle; 基于所述初始轨迹数据,从所述目标流向对应的至少一个车道中筛选目标车道;Based on the initial trajectory data, selecting a target lane from at least one lane corresponding to the target flow direction; 如果所述初始轨迹数据满足修正条件,从路口的配置信息中提取所述目标车道的车道信息,并基于所述初始轨迹数据,获得所述目标车辆及其周围车辆的当前位置信息和当前运动信息;If the initial trajectory data satisfies the correction condition, extracting the lane information of the target lane from the configuration information of the intersection, and obtaining the current position information and current motion information of the target vehicle and its surrounding vehicles based on the initial trajectory data; 将所述目标车辆及其周围车辆的当前位置信息、当前运动信息及所述目标车道的车道信息输入至生成对抗模仿学习模型,得到所述生成对抗模仿学习模型确定的目标轨迹数据;所述生成对抗模仿学习模型基于路口的实际车辆行驶轨迹训练得到,所述生成对抗模仿学习模型用于模仿专家策略,生成与实际的车辆行为相似的动作参数,得到目标轨迹数据;所述生成对抗模仿学习模型基于对抗网络训练得到,所述对抗网络包括生成器和评判器;所述生成对抗模仿学习模型基于所述对抗网络进行训练的过程,包括:获取路口的车辆行驶轨迹,并基于所述路口的车辆行驶轨迹确定专家的状态动作对;在当前时刻,按照课程分布集的设定采样出个车辆,作为多个样本车辆;获取所述多个样本车辆中各所述样本车辆在当前位置的样本信息;所述样本信息包括:所述样本车辆在当前位置的位置信息和运动信息及其周围样本车辆的位置信息和运动信息及目标样本车道的车道信息;根据所述生成器的当前策略对各样本车辆对应的样本信息进行处理,生成各所述样本车辆的轨迹;确定各所述样本车辆的轨迹对应的惩罚值;基于所述评判器为各所述样本车辆的轨迹中的每个状态动作对进行评分,生成各所述样本车辆的奖励值;所述奖励值通过以下方式确定得到:The current position information, current motion information and lane information of the target lane of the target vehicle and its surrounding vehicles are input into the generative adversarial imitation learning model to obtain the target trajectory data determined by the generative adversarial imitation learning model; the generative adversarial imitation learning model is trained based on the actual vehicle driving trajectory at the intersection, and the generative adversarial imitation learning model is used to imitate the expert strategy, generate action parameters similar to the actual vehicle behavior, and obtain the target trajectory data; the generative adversarial imitation learning model is trained based on an adversarial network, and the adversarial network includes a generator and a judger; the process of training the generative adversarial imitation learning model based on the adversarial network includes: obtaining the vehicle driving trajectory at the intersection, and determining the expert state-action pair based on the vehicle driving trajectory at the intersection; at the current moment , according to the setting of course distribution set Sampling out vehicles as multiple sample vehicles; obtaining sample information of each of the multiple sample vehicles at the current position; the sample information includes: the position information and motion information of the sample vehicle at the current position and the position information and motion information of the surrounding sample vehicles and the lane information of the target sample lane; processing the sample information corresponding to each sample vehicle according to the current strategy of the generator to generate the trajectory of each sample vehicle; determining the penalty value corresponding to the trajectory of each sample vehicle; and determining the penalty value corresponding to each state-action pair in the trajectory of each sample vehicle based on the judgement. Scoring is performed to generate a reward value for each of the sample vehicles; the reward value is determined in the following manner: 其中,表示评判器在参数下根据状态动作对得出的值,表示所述样本车辆的轨迹对应的惩罚值;保持所述评判器的参数不变,基于信赖域优化方法更新所述生成器的策略参数;基于信赖域优化方法更新所述生成器的策略参数,包括:对以下约束优化问题进行求解: Among them, it means that the judge has parameters Next, according to the state action The value obtained, Represents the penalty value corresponding to the trajectory of the sample vehicle; keeps the parameters of the evaluator unchanged, and updates the strategy parameters of the generator based on the trust region optimization method; updates the strategy parameters of the generator based on the trust region optimization method, including: solving the following constrained optimization problem: ; 其中,表示策略的参数;表示期望;表示在时刻采取的当前策略,其根据旧参数定义的;表示新策略;表示当前策略在时刻下在观察条件下采取动作的概率;表示新策略在时刻下在观察条件下采取动作的概率;表示当前策略在观察条件下采取动作的概率分布;表示新策略在观察条件下采取动作的概率分布;表示之间的KL(Kullback-Leibler)散度;表示步长参数,用于控制每次优化步骤中策略的最大变化量;表示优势函数,用于衡量观察条件下采取动作的动作价值期望与观测器估计的状态价值期望之间的差异程度;动作表示所述样本车辆根据策略采取的行为;所述优势函数通过以下广义优势估计方法进行估计得到: in, Representation strategy Parameters; express expectations; Indicated in The current strategy adopted at any moment, which is based on the old parameters defined; Indicates new strategy; Indicates that the current strategy is Observe the conditions at all times Take action The probability of Indicates that the new strategy Observe the conditions at all times Take action probability; Indicates the current strategy Under observation conditions The probability distribution of taking actions under Indicates new strategy Under observation conditions The probability distribution of taking actions under express and The KL (Kullback-Leibler) divergence between; represents the step size parameter, which is used to control the maximum change of the strategy in each optimization step; represents the advantage function, which is used to measure the observation condition Take action The expected action value With observer Estimated state value expectation degree of difference between represents the behavior taken by the sample vehicle according to the strategy; the advantage function is estimated by the following generalized advantage estimation method: 其中,表示折扣率;是介于0和1之间的参数,用于平衡TD(Temporal Difference)误差的权重;表示评判器确定的奖励值;分别表示在时刻和时刻的状态价值期望;保持所述生成器的策略参数不变,基于所述专家的状态动作对和所述生成器的新策略生成的状态动作对,更新所述评判器的评判参数;所述评判器的评判参数通过以下目标函数进行更新: in, represents the discount rate; It is a parameter between 0 and 1, used to balance the TD (Temporal Difference) error The weight of represents the reward value determined by the judge; and Respectively expressed in Moment and Keep the strategy parameters of the generator unchanged, and update the evaluation parameters of the judge based on the state-action pairs of the expert and the state-action pairs generated by the new strategy of the generator. ; The judging parameters of the judge The update is performed by the following objective function: 其中,表示专家策略,表示所述新策略,表示在执行策略时状态动作对被访问到的概率;表示在策略下时间处于状态的概率;表示基于当前策略在状态下采取动作的概率;表示在执行策略时状态动作对被访问到的概率;表示在策略下时间处于状态的概率;表示基于专家策略在状态下采取动作的概率;的简化表达,表示评判器在参数下根据状态动作对得出的值;in, represents the expert strategy, represents the new strategy, Indicates that the policy is being executed Time state action pair The probability of being visited; Indicated in strategy Next time In state probability; Indicates that based on the current strategy in state Take action The probability of Indicates that the policy is being executed Time state action pair The probability of being visited; Indicated in strategy Next time In state probability; Indicates that based on the expert strategy in the state Take action probability; yes A simplified expression of In the parameters Next, according to the state action The resulting value; 基于所述目标轨迹数据,对所述初始轨迹数据进行修正。Based on the target trajectory data, the initial trajectory data is corrected. 2.根据权利要求1所述的基于模仿学习的车辆轨迹优化方法,其特征在于,确定所述目标车辆的目标流向,包括:2. The vehicle trajectory optimization method based on imitation learning according to claim 1, characterized in that determining the target flow direction of the target vehicle comprises: 如果所述目标车辆已锁流向,则将所述目标车辆已锁定的流向作为所述目标车辆的目标流向;If the target vehicle has locked the flow direction, the locked flow direction of the target vehicle is used as the target flow direction of the target vehicle; 如果所述目标车辆未锁流向,且所述目标车辆的进口道车道流向配置为单流向,则将所述单流向作为所述目标车辆的目标流向;If the target vehicle has an unlocked flow direction, and the flow direction of the entrance lane of the target vehicle is configured as a single flow direction, the single flow direction is used as the target flow direction of the target vehicle; 如果所述目标车辆未锁流向,且所述目标车辆的进口道车道流向配置为多流向,则通过比较所述多流向中各流向对应的车流量,从所述多流向中选择车流量最大的流向作为所述目标车辆的目标流向;If the target vehicle has an unlocked flow direction, and the flow direction of the entrance lane of the target vehicle is configured as multiple flow directions, then by comparing the vehicle flow corresponding to each flow direction in the multiple flow directions, the flow direction with the largest vehicle flow is selected from the multiple flow directions as the target flow direction of the target vehicle; 如果所述多流向中各流向对应的车流量一致,基于所述目标车辆的历史轨迹,确定临时目标点,如果所述临时目标点位于所述多流向中其中一个流向中,将所述多流向中包含所述临时目标点的流向作为所述目标车辆的目标流向。If the traffic volume corresponding to each flow direction in the multiple flow directions is consistent, a temporary target point is determined based on the historical trajectory of the target vehicle. If the temporary target point is located in one of the multiple flow directions, the flow direction in the multiple flow directions containing the temporary target point is used as the target flow direction of the target vehicle. 3.根据权利要求1所述的基于模仿学习的车辆轨迹优化方法,其特征在于,基于所述初始轨迹数据,从所述目标流向对应的至少一个车道中筛选目标车道,包括:3. The vehicle trajectory optimization method based on imitation learning according to claim 1, characterized in that, based on the initial trajectory data, selecting a target lane from at least one lane corresponding to the target flow direction comprises: 基于所述初始轨迹数据中所述目标车辆当前所处轨迹点,确定所述目标车辆的最新轨迹航向角;Determining the latest trajectory heading angle of the target vehicle based on the current trajectory point of the target vehicle in the initial trajectory data; 获取所述目标车辆的目标流向所对应的出口道航向角;Obtaining an exit track heading angle corresponding to a target flow direction of the target vehicle; 确定所述最新轨迹航向角和所述出口道航向角的平均夹角;Determine an average angle between the latest trajectory heading angle and the exit path heading angle; 确定所述目标流向对应的至少一个车道中各车道的目标点和所述目标车辆当前所处轨迹点之间的角度,将所述角度与所述平均夹角之差最小的车道作为目标车道。An angle between a target point of each lane in at least one lane corresponding to the target flow direction and a current trajectory point of the target vehicle is determined, and a lane having a minimum difference between the angle and the average angle is taken as a target lane. 4.根据权利要求1所述的基于模仿学习的车辆轨迹优化方法,其特征在于,所述初始轨迹数据满足修正条件通过以下方式判断得到:4. The vehicle trajectory optimization method based on imitation learning according to claim 1 is characterized in that the initial trajectory data satisfies the correction condition by judging in the following manner: 确定所述初始轨迹数据中进入所述目标车道的路口内的第一个初始轨迹点和所述目标车道的目标点之间的第一距离;Determine a first distance between a first initial trajectory point in the initial trajectory data that enters the target lane and a target point in the target lane; 确定所述初始轨迹数据中所述目标车辆当前所处轨迹点和所述第一个初始轨迹点之间的第二距离;Determine a second distance between the current trajectory point of the target vehicle and the first initial trajectory point in the initial trajectory data; 如果所述目标流向为左转流向,如果所述第二距离和所述第一距离的比值不小于左转阈值,则满足修正条件;If the target flow direction is a left turn flow direction, and if the ratio of the second distance to the first distance is not less than a left turn threshold, then the correction condition is satisfied; 如果所述目标流向为右转流向,如果所述第二距离和所述第一距离的比值不小于右转阈值,则满足修正条件;If the target flow direction is a right turn flow direction, and if the ratio of the second distance to the first distance is not less than a right turn threshold, then the correction condition is satisfied; 如果所述目标流向为直行流向,如果所述第二距离和所述第一距离的比值不小于直行阈值,且所述目标车辆在视觉跟踪中失效,则满足修正条件;If the target flow direction is a straight flow direction, if the ratio of the second distance to the first distance is not less than a straight threshold, and the target vehicle fails in visual tracking, then a correction condition is satisfied; 如果所述目标流向为掉头流向,如果所述目标车辆在视觉跟踪中失效,则满足修正条件。If the target flow direction is a U-turn flow direction, and if the target vehicle fails in visual tracking, the correction condition is met. 5.根据权利要求1所述的基于模仿学习的车辆轨迹优化方法,其特征在于,确定各所述样本车辆的轨迹对应的惩罚值,包括:5. The vehicle trajectory optimization method based on imitation learning according to claim 1, characterized in that determining the penalty value corresponding to the trajectory of each sample vehicle comprises: 通过惩罚函数Through the penalty function ,确定各所述样本车辆的轨迹对应的惩罚值;其中,表示任意两个样本车辆之间最小距离,1表示碰撞惩罚值,表示样本车辆距道路边缘最近距离,表示样本车辆距离道路左边缘最近距离,表示样本车辆距离道路右边缘最近距离,表示距离惩罚值,表示不满足车辆运动学约束,表示约束惩罚值,表示急刹车惩罚值,表示加速度。 , determine the penalty value corresponding to the trajectory of each sample vehicle; wherein, represents the minimum distance between any two sample vehicles, 1 indicates the collision penalty value, Indicates the shortest distance between the sample vehicle and the edge of the road, , Indicates the closest distance between the sample vehicle and the left edge of the road, Indicates the closest distance between the sample vehicle and the right edge of the road, Represents the distance penalty value, Indicates that the vehicle kinematic constraints are not met. represents the constraint penalty value, Indicates the emergency brake penalty value. Indicates acceleration. 6.根据权利要求5所述的基于模仿学习的车辆轨迹优化方法,其特征在于,所述碰撞惩罚值通过以下方式确定得到:6. The vehicle trajectory optimization method based on imitation learning according to claim 5, characterized in that the collision penalty value is determined by the following method: 从所述样本车辆的轨迹中提取前n个连续的位置点;Extracting the first n consecutive position points from the trajectory of the sample vehicle; 针对所述前n个连续的位置点中各位置点,如果所述样本车辆在移动到所述位置点之前与其周围车辆未发生碰撞,但在所述位置点与其周围车辆发生碰撞,将所述位置点标记为异常;如果所述样本车辆在移动到所述位置点之前与其周围车辆未发生碰撞,且在所述位置点与其周围车辆未发生碰撞,将所述位置点标记为候选;For each position point among the first n consecutive position points, if the sample vehicle does not collide with surrounding vehicles before moving to the position point, but collides with surrounding vehicles at the position point, the position point is marked as abnormal; if the sample vehicle does not collide with surrounding vehicles before moving to the position point, and does not collide with surrounding vehicles at the position point, the position point is marked as a candidate; 如果所述前n个连续的位置点中各位置点中存在标记为候选的位置点,将标记为候选的位置点中排列在最后的位置点作为所述样本车辆的新的当前位置,并确定标记为异常的位置点对应的惩罚值,将标记为异常的位置点对应的惩罚值进行累加,得到碰撞惩罚值;If there is a position point marked as a candidate among each of the first n consecutive position points, the position point arranged last among the position points marked as candidates is used as the new current position of the sample vehicle, and the penalty value corresponding to the position point marked as abnormal is determined, and the penalty values corresponding to the position points marked as abnormal are accumulated to obtain a collision penalty value; 如果所述前n个连续的位置点均标记为异常,将所述前n个连续的位置点中第一个位置点作为所述样本车辆的新的当前位置,并确定标记为异常的位置点对应的惩罚值,将标记为异常的位置点对应的惩罚值进行累加,得到碰撞惩罚值。If the first n consecutive position points are all marked as abnormal, the first position point among the first n consecutive position points is used as the new current position of the sample vehicle, and the penalty value corresponding to the position point marked as abnormal is determined, and the penalty values corresponding to the position points marked as abnormal are accumulated to obtain the collision penalty value. 7.一种基于模仿学习的车辆轨迹优化装置,其特征在于,包括:7. A vehicle trajectory optimization device based on imitation learning, characterized by comprising: 第一获得模块,用于获得目标车辆的初始轨迹数据;A first acquisition module is used to obtain initial trajectory data of the target vehicle; 第一确定模块,用于确定所述目标车辆的目标流向;A first determination module, used to determine the target flow direction of the target vehicle; 筛选模块,用于基于所述初始轨迹数据,从所述目标流向对应的至少一个车道中筛选目标车道;a screening module, configured to screen a target lane from at least one lane corresponding to the target flow direction based on the initial trajectory data; 第二获得模块,用于如果所述初始轨迹数据满足修正条件,从路口的配置信息中提取所述目标车道的车道信息,并基于所述初始轨迹数据,获得所述目标车辆及其周围车辆的当前位置信息和当前运动信息;A second acquisition module is used to extract the lane information of the target lane from the configuration information of the intersection if the initial trajectory data meets the correction condition, and obtain the current position information and current motion information of the target vehicle and its surrounding vehicles based on the initial trajectory data; 第二确定模块,用于将所述目标车辆及其周围车辆的当前位置信息、当前运动信息及所述目标车道的车道信息输入至生成对抗模仿学习模型,得到所述生成对抗模仿学习模型确定的目标轨迹数据;所述生成对抗模仿学习模型基于路口的实际车辆行驶轨迹训练得到,所述生成对抗模仿学习模型用于模仿专家策略,生成与实际的车辆行为相似的动作参数,得到目标轨迹数据;所述生成对抗模仿学习模型基于对抗网络训练得到,所述对抗网络包括生成器和评判器;所述生成对抗模仿学习模型基于所述对抗网络进行训练的过程,包括:获取路口的车辆行驶轨迹,并基于所述路口的车辆行驶轨迹确定专家的状态动作对;在当前时刻,按照课程分布集的设定采样出个车辆,作为多个样本车辆;获取所述多个样本车辆中各所述样本车辆在当前位置的样本信息;所述样本信息包括:所述样本车辆在当前位置的位置信息和运动信息及其周围样本车辆的位置信息和运动信息及目标样本车道的车道信息;根据所述生成器的当前策略对各样本车辆对应的样本信息进行处理,生成各所述样本车辆的轨迹;确定各所述样本车辆的轨迹对应的惩罚值;基于所述评判器为各所述样本车辆的轨迹中的每个状态动作对进行评分,生成各所述样本车辆的奖励值;所述奖励值通过以下方式确定得到:The second determination module is used to input the current position information, current motion information and lane information of the target lane of the target vehicle and its surrounding vehicles into the generative adversarial imitation learning model to obtain the target trajectory data determined by the generative adversarial imitation learning model; the generative adversarial imitation learning model is trained based on the actual vehicle driving trajectory at the intersection, and the generative adversarial imitation learning model is used to imitate the expert strategy, generate action parameters similar to the actual vehicle behavior, and obtain the target trajectory data; the generative adversarial imitation learning model is trained based on an adversarial network, and the adversarial network includes a generator and a judger; the process of training the generative adversarial imitation learning model based on the adversarial network includes: obtaining the vehicle driving trajectory at the intersection, and determining the expert state-action pair based on the vehicle driving trajectory at the intersection; at the current moment , according to the setting of course distribution set Sampling out vehicles as multiple sample vehicles; obtaining sample information of each of the multiple sample vehicles at the current position; the sample information includes: the position information and motion information of the sample vehicle at the current position and the position information and motion information of the surrounding sample vehicles and the lane information of the target sample lane; processing the sample information corresponding to each sample vehicle according to the current strategy of the generator to generate the trajectory of each sample vehicle; determining the penalty value corresponding to the trajectory of each sample vehicle; and determining the penalty value corresponding to each state-action pair in the trajectory of each sample vehicle based on the judgement. Scoring is performed to generate a reward value for each of the sample vehicles; the reward value is determined in the following manner: 其中,表示评判器在参数下根据状态动作对得出的值,表示所述样本车辆的轨迹对应的惩罚值;保持所述评判器的参数不变,基于信赖域优化方法更新所述生成器的策略参数;基于信赖域优化方法更新所述生成器的策略参数,包括:对以下约束优化问题进行求解: in, Representation Judger In the parameters Next, according to the state action The value obtained, Represents the penalty value corresponding to the trajectory of the sample vehicle; keeps the parameters of the evaluator unchanged, and updates the strategy parameters of the generator based on the trust region optimization method; updates the strategy parameters of the generator based on the trust region optimization method, including: solving the following constrained optimization problem: ; 其中,表示策略的参数;表示期望;表示在时刻采取的当前策略,其根据旧参数定义的;表示新策略;表示当前策略在时刻下在观察条件下采取动作的概率;表示新策略在时刻下在观察条件下采取动作的概率;表示当前策略在观察条件下采取动作的概率分布;表示新策略在观察条件下采取动作的概率分布;表示之间的KL(Kullback-Leibler)散度;表示步长参数,用于控制每次优化步骤中策略的最大变化量;表示优势函数,用于衡量观察条件下采取动作的动作价值期望与观测器估计的状态价值期望之间的差异程度;动作表示所述样本车辆根据策略采取的行为;所述优势函数通过以下广义优势估计方法进行估计得到: in, Representation strategy Parameters; express expectations; Indicated in The current strategy adopted at any moment, which is based on the old parameters defined; Indicates new strategy; Indicates that the current strategy is Observe the conditions at all times Take action The probability of Indicates that the new strategy Observe the conditions at all times Take action probability; Indicates the current strategy Under observation conditions The probability distribution of taking actions under Indicates new strategy Under observation conditions The probability distribution of taking actions under express and The KL (Kullback-Leibler) divergence between; represents the step size parameter, which is used to control the maximum change of the strategy in each optimization step; represents the advantage function, which is used to measure the observation condition Take action The expected action value With observer Estimated state value expectation degree of difference between represents the behavior taken by the sample vehicle according to the strategy; the advantage function is estimated by the following generalized advantage estimation method: 其中,表示折扣率;是介于0和1之间的参数,用于平衡TD(Temporal Difference)误差的权重;表示评判器确定的奖励值;分别表示在时刻和时刻的状态价值期望;保持所述生成器的策略参数不变,基于所述专家的状态动作对和所述生成器的新策略生成的状态动作对,更新所述评判器的评判参数;所述评判器的评判参数通过以下目标函数进行更新: in, represents the discount rate; It is a parameter between 0 and 1, used to balance the TD (Temporal Difference) error The weight of represents the reward value determined by the judge; and Respectively expressed in Moment and Keep the strategy parameters of the generator unchanged, and update the evaluation parameters of the judge based on the state-action pairs of the expert and the state-action pairs generated by the new strategy of the generator. ; The judging parameters of the judge The update is performed by the following objective function: 其中,表示专家策略,表示所述新策略,表示在执行策略时状态动作对被访问到的概率;表示在策略下时间处于状态的概率;表示基于当前策略在状态下采取动作的概率;表示在执行策略时状态动作对被访问到的概率;表示在策略下时间处于状态的概率;表示基于专家策略在状态下采取动作的概率;的简化表达,表示评判器在参数下根据状态动作对得出的值;in, represents the expert strategy, represents the new strategy, Indicates that the policy is being executed Time state action pair The probability of being visited; Indicated in strategy Next time In state probability; Indicates that based on the current strategy in state Take action probability; Indicates that the policy is being executed Time state action pair The probability of being visited; Indicated in strategy Next time In state probability; Indicates that based on the expert strategy in the state Take action The probability of yes A simplified expression of In the parameters Next, according to the state action The value obtained; 修正模块,用于基于所述目标轨迹数据,对所述初始轨迹数据进行修正。A correction module is used to correct the initial trajectory data based on the target trajectory data. 8.一种电子设备,其特征在于,包括:8. An electronic device, comprising: 存储器用于存储计算机程序;The memory is used to store computer programs; 处理器用于执行所述计算机程序,以使所述电子设备能够实现如权利要求1至6中任意一项所述的基于模仿学习的车辆轨迹优化方法。The processor is used to execute the computer program so that the electronic device can implement the vehicle trajectory optimization method based on imitation learning as described in any one of claims 1 to 6. 9.一种计算机存储介质,其特征在于,所述存储介质承载有一个或多个计算机程序,当所述一个或多个计算机程序被电子设备执行时,能够使所述电子设备能够实现如权利要求1至6中任意一项所述的基于模仿学习的车辆轨迹优化方法。9. A computer storage medium, characterized in that the storage medium carries one or more computer programs, and when the one or more computer programs are executed by an electronic device, the electronic device can implement the vehicle trajectory optimization method based on imitation learning as described in any one of claims 1 to 6.
CN202411780013.5A 2024-12-05 2024-12-05 A vehicle trajectory optimization method based on imitation learning and related device Active CN119252066B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411780013.5A CN119252066B (en) 2024-12-05 2024-12-05 A vehicle trajectory optimization method based on imitation learning and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411780013.5A CN119252066B (en) 2024-12-05 2024-12-05 A vehicle trajectory optimization method based on imitation learning and related device

Publications (2)

Publication Number Publication Date
CN119252066A CN119252066A (en) 2025-01-03
CN119252066B true CN119252066B (en) 2025-03-25

Family

ID=94016773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411780013.5A Active CN119252066B (en) 2024-12-05 2024-12-05 A vehicle trajectory optimization method based on imitation learning and related device

Country Status (1)

Country Link
CN (1) CN119252066B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120452201B (en) * 2025-06-16 2025-09-26 浙江中控信息产业股份有限公司 Intersection inlet flow direction identification method for preferential passing of pilot line

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110789528A (en) * 2019-08-29 2020-02-14 腾讯科技(深圳)有限公司 Vehicle driving track prediction method, device, equipment and storage medium
CN114004406A (en) * 2021-11-03 2022-02-01 京东鲲鹏(江苏)科技有限公司 Vehicle trajectory prediction method, device, storage medium and electronic device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111483468B (en) * 2020-04-24 2021-09-07 广州大学 A lane-changing decision-making method and system for unmanned vehicles based on adversarial imitation learning
US20230045360A1 (en) * 2021-07-14 2023-02-09 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Imitation Learning
US20230082365A1 (en) * 2021-09-16 2023-03-16 Waymo Llc Generating simulated agent trajectories using parallel beam search
CN114742317A (en) * 2022-05-11 2022-07-12 苏州易航远智智能科技有限公司 Vehicle track prediction method and device, electronic equipment and storage medium
CN115447574B (en) * 2022-08-25 2025-11-07 上汽大众汽车有限公司 Intersection vehicle track correction method and system combining signal lamp perception
CN117808113A (en) * 2022-09-23 2024-04-02 毫末智行科技有限公司 Training method and device of track planning model, terminal equipment and storage medium
CN116153084B (en) * 2023-04-20 2023-09-08 智慧互通科技股份有限公司 Vehicle flow direction prediction method, prediction system and urban traffic signal control method
CN118560530B (en) * 2024-08-02 2024-10-01 杭州电子科技大学 Multi-agent driving behavior modeling method based on generation of countermeasure imitation learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110789528A (en) * 2019-08-29 2020-02-14 腾讯科技(深圳)有限公司 Vehicle driving track prediction method, device, equipment and storage medium
CN114004406A (en) * 2021-11-03 2022-02-01 京东鲲鹏(江苏)科技有限公司 Vehicle trajectory prediction method, device, storage medium and electronic device

Also Published As

Publication number Publication date
CN119252066A (en) 2025-01-03

Similar Documents

Publication Publication Date Title
CN115578876A (en) Automatic driving method, system, equipment and storage medium of vehicle
CN115062202A (en) Method, device, equipment and storage medium for predicting driving behavior intention and track
Oh et al. Cvae-h: Conditionalizing variational autoencoders via hypernetworks and trajectory forecasting for autonomous driving
CN119252066B (en) A vehicle trajectory optimization method based on imitation learning and related device
CN120183200A (en) Intelligent traffic signal control method and system based on multimodal data fusion
CN116740664A (en) A trajectory prediction method and device
CN116206438A (en) Method for training a system for predicting future development of a traffic scene and corresponding system
Tran et al. A model predictive control-based lane merging strategy for autonomous vehicles
Zhu et al. A coordination graph based framework for network traffic signal control
CN120628135A (en) Trajectory planning method, device, vehicle, and computer-readable storage medium
KR102494953B1 (en) On-device real-time traffic signal control system based on deep learning
CN118701099B (en) Surrounding vehicle track prediction method for behavior cognition driving under multi-vehicle interference scene
CN119312673A (en) A microscopic simulation system of interweaving area based on space-time graph Transformer
CN118627662A (en) Driving scene prediction method and device, equipment, storage medium and program product
CN119452404A (en) Method for predicting the influence of a traffic participant on at least one other traffic participant and method for operating a vehicle
Zhu et al. Game-Theoretic Decision-Making for Autonomous Vehicles at Unsignalized Intersections under Communication Interferences: A Novel Risk-Adaptive Approach
Bian et al. A Markov jump system-based mathematical modeling method for collaborative control of unmanned vehicles
CN120952098B (en) Adaptive autonomous driving methods and systems for complex road scenarios
Yang et al. A hierarchical forecasting model of pedestrian crossing behavior for autonomous vehicle
CN118306427B (en) Automatic driving decision method, device, electronic equipment and storage medium
CN115985124B (en) Vehicle driving control method, device, storage medium and electronic device
HK40070961A (en) Driving track determining method, device, equipment and storage medium
CN120952098A (en) Adaptive autonomous driving methods and systems for complex road scenarios
Yang et al. Investigating the Critical Characteristics of Pedestrian-Vehicle Game Modes at Unsignalized Crosswalks: Based on MCMC and BP Network
CN119558475A (en) A fairness-aware overtaking decision optimization method for unstructured mixed scenarios with connected and non-connected vehicles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant