CN107544516A

CN107544516A - Automated driving system and method based on relative entropy depth against intensified learning

Info

Publication number: CN107544516A
Application number: CN201710940590.XA
Authority: CN
Inventors: 林嘉豪; 章宗长
Original assignee: Suzhou University
Current assignee: Nanqi Xiance Nanjing Technology Co ltd
Priority date: 2017-10-11
Filing date: 2017-10-11
Publication date: 2018-01-05
Also published as: WO2019071909A1

Abstract

The invention relates to an automatic driving system based on relative entropy deep inverse reinforcement learning, including: (1) client: display driving strategy; (2) driving basic data collection subsystem: collect road information; (3) storage module: with The client terminal and the driving basic data acquisition subsystem are connected to and store the road information collected by the driving basic data acquisition subsystem; wherein, the driving basic data acquisition subsystem collects road information and transmits the road information to the client terminal and the storage module, The storage module receives the road information, and stores a continuous section of road information as a historical track, analyzes and calculates the driving strategy based on the historical track, and the storage module transmits the driving strategy to the client for user selection, and the client receives the road information And implement automatic driving according to the user's choice. The system of the present invention adopts a deep inverse reinforcement learning algorithm of relative entropy to realize automatic driving without models.

Description

Automatic driving system and method based on relative entropy deep inverse reinforcement learning

技术领域technical field

本发明涉及一种基于相对熵深度逆强化学习的自动驾驶系统及方法，属于自动驾驶技术领域。The invention relates to an automatic driving system and method based on relative entropy deep inverse reinforcement learning, belonging to the technical field of automatic driving.

背景技术Background technique

随着我国汽车持有量的增加，道路交通拥堵现象越来越严重，每年发生的交通事故也在不断上升，为了更好的解决这一问题，研究和开发汽车自动驾驶系统很有必要。且随着人们对生活质量追求的提升，人们希望从疲劳的驾驶活动中得到解放，自动驾驶技术应运而生。With the increase of the number of cars in our country, road traffic congestion is becoming more and more serious, and the annual traffic accidents are also rising. In order to better solve this problem, it is necessary to research and develop auto-driving systems. And with the improvement of people's pursuit of quality of life, people hope to be liberated from tiring driving activities, and automatic driving technology has emerged as the times require.

现有的一种汽车自动驾驶系统是由装在驾驶室的摄像机和图像识别系统辨别驾驶环境，然后由车载主控计算机、GPS定位系统和路径规划软件根据预先存好的道路地图等信息对车辆进行导航，在车辆的当前位置和目的地之间规划出合理的行驶路径将车辆导向目的地。An existing auto-driving system for a vehicle uses a camera installed in the driver's cab and an image recognition system to identify the driving environment, and then the vehicle's main control computer, GPS positioning system, and path planning software provide information on the vehicle based on pre-stored road maps and other information. Carry out navigation, plan a reasonable driving path between the current position of the vehicle and the destination, and guide the vehicle to the destination.

上述汽车自动驾驶系统中，由于道路地图是预存于车辆内，其数据的更新依赖于驾驶员的人工操作，更新频率不能够保证，并且，即使驾驶员能够做到及时更新，也可能由于现有资源里没有关于道路的最新信息而使得最终得到的资料不能够反应当下的道路情况，最终造成行车路线不合理，导航准确率不高，给行车带来不便。并且，目前在自动驾驶技术领域的大部分汽车自动驾驶系统还需要人工进行干预，并不能达到完全的自动驾驶的地步。In the above autopilot system, since the road map is pre-stored in the vehicle, the update of its data depends on the manual operation of the driver, and the update frequency cannot be guaranteed. There is no latest information about roads in the resources, so the final data cannot reflect the current road conditions, which eventually leads to unreasonable driving routes and low navigation accuracy, which brings inconvenience to driving. Moreover, most of the current autopilot systems in the field of autopilot technology still require manual intervention, and cannot achieve complete autopilot.

发明内容Contents of the invention

本发明的目的在于提供一种基于相对熵深度逆强化学习的自动驾驶系统及方法，利用深度神经网络结构并输入用户驾驶员的历史驾驶轨迹信息，获取多种代表个性驾驶习惯的驾驶策略，通过这些驾驶策略进行个性、智能的自动驾驶。The purpose of the present invention is to provide an automatic driving system and method based on relative entropy deep inverse reinforcement learning, which uses a deep neural network structure and inputs the historical driving track information of the user driver to obtain a variety of driving strategies representing individual driving habits. These driving strategies carry out personalized and intelligent automatic driving.

为达到上述目的，本发明提供如下技术方案：一种基于相对熵深度逆强化学习的自动驾驶系统，所述系统包括：In order to achieve the above object, the present invention provides the following technical solution: an automatic driving system based on relative entropy deep inverse reinforcement learning, the system includes:

客户端：显示驾驶策略；Client: display driving strategy;

驾驶基础数据采集子系统：采集道路信息；Driving basic data acquisition subsystem: collect road information;

存储模块：与所述客户端及驾驶基础数据采集子系统连接并存储所述驾驶基础数据采集子系统所采集到的道路信息；Storage module: connected with the client and the driving basic data collection subsystem and storing the road information collected by the driving basic data collection subsystem;

其中，所述驾驶基础数据采集子系统采集道路信息并将所述道路信息传输给所述客户端及存储模块，所述存储模块接收所述道路信息，并将持续的一段道路信息存储为历史轨迹，根据所述历史轨迹进行分析计算模拟出驾驶策略，所述存储模块将所述驾驶策略传输至客户端以供用户选择，所述客户端接受并根据所述道路信息和用户个性选择的所述驾驶策略实施自动驾驶。Wherein, the driving basic data collection subsystem collects road information and transmits the road information to the client and the storage module, and the storage module receives the road information and stores a continuous section of road information as a historical track , analyze and calculate the driving strategy according to the historical trajectory, the storage module transmits the driving strategy to the client for selection by the user, and the client accepts and selects the driving strategy based on the road information and user personality Driving strategy implements autonomous driving.

进一步地，所述存储模块包括用于存储历史驾驶轨迹的驾驶轨迹库、根据驾驶轨迹及驾驶习惯计算并模拟出驾驶策略的轨迹信息处理子系统及存储驾驶策略的驾驶策略库；所述驾驶轨迹库将驾驶轨迹数据传输给所述轨迹信息处理子系统，所述轨迹信息处理子系统根据所述驾驶轨迹数据分析计算并模拟出驾驶策略并传输给所述驾驶策略库，所述驾驶策略库接收并存储所述驾驶策略。Further, the storage module includes a driving trajectory library for storing historical driving trajectories, a trajectory information processing subsystem for calculating and simulating a driving strategy according to driving trajectories and driving habits, and a driving strategy library for storing driving strategies; the driving trajectory The library transmits the driving trajectory data to the trajectory information processing subsystem, and the trajectory information processing subsystem analyzes and calculates and simulates a driving strategy based on the driving trajectory data and transmits it to the driving strategy library, and the driving strategy library receives And store the driving strategy.

进一步地，所述轨迹信息处理子系统采用多目标的相对熵深度逆强化学习算法计算并模拟驾驶策略。Further, the trajectory information processing subsystem adopts a multi-objective relative entropy deep inverse reinforcement learning algorithm to calculate and simulate the driving strategy.

进一步地，所述多目标的逆强化学习算法采用EM算法框架嵌套相对熵深度逆强化学习计算多奖赏函数的参数。Further, the multi-objective inverse reinforcement learning algorithm uses the EM algorithm framework to nest the relative entropy depth inverse reinforcement learning to calculate the parameters of the multi-reward function.

进一步地，所述驾驶基础数据采集子系统包括用于采集道路信息的传感器。Further, the driving basic data collection subsystem includes sensors for collecting road information.

本发明还提供了一种基于相对熵深度逆强化学习的自动驾驶的方法，所述方法包括如下步骤：The present invention also provides a method for automatic driving based on relative entropy deep inverse reinforcement learning, said method comprising the following steps:

包括如下步骤：Including the following steps:

S1：采集道路信息并将所述道路信息传输给客户端及存储模块；S1: collect road information and transmit the road information to the client and the storage module;

S2：所述存储模块接收所述道路信息并将持续的一段道路信息存储为历史轨迹，根据所述历史轨迹分析计算并模拟多种驾驶策略，并将所述驾驶策略传递给所述客户端；S2: The storage module receives the road information and stores a continuous piece of road information as a historical trajectory, analyzes and simulates various driving strategies according to the historical trajectory, and transmits the driving strategy to the client;

S3：所述客户端接收所述道路信息及驾驶策略，并根据用户选择的个性驾驶策略及道路信息实施自动驾驶。S3: The client receives the road information and driving strategy, and implements automatic driving according to the personalized driving strategy and road information selected by the user.

进一步地，所述存储模块包括用于存储历史驾驶轨迹的驾驶轨迹库、根据驾驶规划及驾驶习惯计算并模拟出驾驶策略的轨迹信息处理子系统及存储驾驶策略的驾驶策略库；所述驾驶轨迹库将驾驶轨迹数据传输给所述轨迹信息处理子系统，所述轨迹信息处理子系统根据所述驾驶轨迹数据分析计算并模拟出驾驶策略并传输给所述驾驶策略库，所述驾驶策略库接收并存储所述驾驶策略。Further, the storage module includes a driving trajectory library for storing historical driving trajectories, a trajectory information processing subsystem for calculating and simulating a driving strategy according to driving planning and driving habits, and a driving strategy library for storing driving strategies; the driving trajectory The library transmits the driving trajectory data to the trajectory information processing subsystem, and the trajectory information processing subsystem analyzes and calculates and simulates a driving strategy based on the driving trajectory data and transmits it to the driving strategy library, and the driving strategy library receives And store the driving strategy.

本发明的有益效果在于：通过在系统中设置驾驶基础数据采集子系统，实时采集道路信息，并将道路信息传递给存储模块，存储模块接收道路信息后并将持续的一段道路信息存储为历史轨迹，根据历史驾驶轨迹模拟驾驶策略，实现个性、智能的自动驾驶。The beneficial effects of the present invention are: by setting the driving basic data collection subsystem in the system, the road information is collected in real time, and the road information is transmitted to the storage module, and the storage module receives the road information and stores a continuous section of road information as a historical track , to simulate driving strategies based on historical driving trajectories to realize personalized and intelligent automatic driving.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，并可依照说明书的内容予以实施，以下以本发明的较佳实施例并配合附图详细说明如后。The above description is only an overview of the technical solutions of the present invention. In order to understand the technical means of the present invention more clearly and implement them according to the contents of the description, the preferred embodiments of the present invention and accompanying drawings are described in detail below.

附图说明Description of drawings

图1为本发明的基于相对熵深度逆强化学习的自动驾驶系统及方法的流程图。FIG. 1 is a flow chart of the automatic driving system and method based on relative entropy deep inverse reinforcement learning of the present invention.

图2为马尔科夫决策过程MDP示意图。Figure 2 is a schematic diagram of a Markov decision process MDP.

具体实施方式detailed description

下面结合附图和实施例，对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明，但不用来限制本发明的范围。The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

请参见图1，本发明的一较佳实施例的基于相对熵深度逆强化学习的自动驾驶系统包括：Please refer to Fig. 1, the automatic driving system based on relative entropy deep inverse reinforcement learning of a preferred embodiment of the present invention includes:

客户端1：显示驾驶策略；Client 1: display driving strategy;

驾驶基础数据采集子系统2：采集道路信息；Driving basic data acquisition subsystem 2: collect road information;

存储模块3：与所述客户端1及驾驶基础数据采集子系统2连接并存储所述驾驶基础数据采集子系统2所采集到的道路信息；Storage module 3: connected with the client 1 and the driving basic data collection subsystem 2 and storing the road information collected by the driving basic data collection subsystem 2;

其中，所述驾驶基础数据采集子系统2采集道路信息并将所述道路信息传输给所述客户端1及存储模块3，所述存储模块3接收所述道路信息，并将持续的一段道路信息存储为历史轨迹，根据所述历史轨迹进行分析计算模拟出驾驶策略，所述存储模块3将所述驾驶策略传输至客户端1以供用户选择，所述客户端1接收所述道路信息并根据用户选择的个性驾驶策略实施自动驾驶。在本实施例中，所述存储模块3为云端。Wherein, the driving basic data collection subsystem 2 collects road information and transmits the road information to the client 1 and the storage module 3, and the storage module 3 receives the road information and stores a continuous section of road information It is stored as a historical trajectory, and the driving strategy is simulated according to the analysis and calculation of the historical trajectory. The storage module 3 transmits the driving strategy to the client 1 for the user to choose, and the client 1 receives the road information and according to the The personalized driving strategy selected by the user implements automatic driving. In this embodiment, the storage module 3 is a cloud.

所述客户端1最主要的功能是与用户完成人机交互过程，提供给个性的、智能的多种驾驶策略选择以及服务。客户端1根据用户个性的驾驶策略选择，从云端3驾驶策略库33下载相应的驾驶策略，接着根据驾驶策略和基础数据进行实时的驾驶决策，实现实时的无人驾驶控制。The main function of the client 1 is to complete the human-computer interaction process with the user, and provide personalized and intelligent multiple driving strategy selections and services. The client 1 downloads the corresponding driving strategy from the cloud 3 driving strategy library 33 according to the user's personalized driving strategy selection, and then makes real-time driving decisions according to the driving strategy and basic data to realize real-time unmanned driving control.

所述驾驶基础数据采集子系统2通过传感器(未图示)采集道路信息。采集到的信息有两个用途：将信息传递给客户端1，为当前的驾驶决策提供基础数据；将信息传递到云端3的驾驶轨迹库31，存储为用户驾驶员的历史驾驶轨迹数据。The driving basic data collection subsystem 2 collects road information through sensors (not shown). The collected information has two purposes: transfer the information to the client 1 to provide basic data for the current driving decision; transfer the information to the driving track database 31 in the cloud 3 and store it as the historical driving track data of the user driver.

所述云端3包括用于历史驾驶轨迹的驾驶轨迹库31、根据驾驶规划及驾驶习惯计算并模拟出驾驶策略的轨迹信息处理子系统32及存储驾驶策略的驾驶策略库33；所述驾驶轨迹库31将驾驶轨迹数据传输给所述轨迹信息处理子系统32，所述轨迹信息处理子系统32根据所述驾驶轨迹数据分析计算并模拟出驾驶策略并传输给所述驾驶策略库33，所述驾驶策略库33接收并存储所述驾驶策略。所述轨迹信息处理子系统32采用多目标的相对熵深度逆强化学习算法计算并模拟驾驶策略。在本实施例中，所述多目标的逆强化学习算法采用EM算法框架嵌套相对熵深度逆强化学习计算多奖赏函数的参数。所述历史驾驶轨迹包括专家历史驾驶轨迹及用户的历史轨迹。The cloud 3 includes a driving track library 31 for historical driving tracks, a track information processing subsystem 32 that calculates and simulates a driving strategy according to driving planning and driving habits, and a driving strategy library 33 that stores driving strategies; the driving track library 31 transmits the driving trajectory data to the trajectory information processing subsystem 32, and the trajectory information processing subsystem 32 analyzes and calculates and simulates a driving strategy according to the driving trajectory data and transmits it to the driving strategy library 33. The strategy library 33 receives and stores the driving strategies. The trajectory information processing subsystem 32 uses a multi-objective relative entropy deep inverse reinforcement learning algorithm to calculate and simulate driving strategies. In this embodiment, the multi-objective inverse reinforcement learning algorithm uses the EM algorithm framework to nest relative entropy depth inverse reinforcement learning to calculate the parameters of multiple reward functions. The historical driving trajectories include expert historical driving trajectories and user historical trajectories.

所述逆强化学习IRL是指在环境已知的马尔科夫决策过程MDP中奖赏函数R未知的问题。在一般的强化学习问题RL中，往往利用已知的环境、给定的奖赏函数R以及马尔科夫性质来估计一个状态动作对的值Q(s,a)(也可称为动作累积奖赏值)，然后利用收敛的各个状态动作对的值Q(s,a)来求取策略π，智能体(Agent)便可利用策略π进行决策。在现实中，奖赏函数R往往是极难获知的，但是一些优秀的轨迹T^N是比较容易获得的。在奖赏函数未知的马尔科夫决策过程MDP/R中，利用优秀的轨迹T^N还原奖赏函数R的问题被称为逆强化学习问题IRL。The inverse reinforcement learning IRL refers to the problem that the reward function R is unknown in the environment-known Markov decision process MDP. In the general reinforcement learning problem RL, the known environment, the given reward function R and the Markov property are often used to estimate the value Q(s, a) of a state-action pair (also called the action cumulative reward value ), and then use the converged value Q(s, a) of each state-action pair to obtain the strategy π, and the agent (Agent) can use the strategy π to make decisions. In reality, the reward function R is often extremely difficult to obtain, but some excellent trajectories T ^N are relatively easy to obtain. In the Markov decision process MDP/R where the reward function is unknown, the problem of using an excellent trajectory T ^N to restore the reward function R is called the inverse reinforcement learning problem IRL.

在本实施例中，利用所述驾驶轨迹库31中已知的用户历史驾驶轨迹数据，进行相对熵深度逆强化学习，还原出多种用户个性的奖赏函数R，进而模拟出相应的驾驶策略π。相对熵深度逆强化学习算法是一种无模型的算法，无需已知环境模型中的状态转移函数T(s,a,s′)，相对熵逆强化学习算法可以利用重要性采样的方法在计算中避开状态转移函数T(s,a,s′)。In this embodiment, the user’s historical driving trajectory data known in the driving trajectory database 31 is used to perform relative entropy depth inverse reinforcement learning, restore the reward function R of various user personalities, and then simulate the corresponding driving strategy π . The relative entropy deep inverse reinforcement learning algorithm is a model-free algorithm. It does not need to know the state transition function T(s, a, s′) in the environment model. The relative entropy inverse reinforcement learning algorithm can use the method of importance sampling to calculate Avoid the state transition function T(s, a, s′) in it.

在本实施例中，汽车的自动驾驶决策过程是一个没有奖赏函数的马尔科夫决策过程MDP/R，可以表示为集合{状态空间S,动作空间A，环境定义的状态转移概率T(省略对环境转移概率T的要求)。汽车Agent的值函数(累计奖赏值)可以表示为而汽车Agent的状态动作值函数可以表示为Q(s,a)＝R_θ(s,a)+γE_T(s,a,s′)[V(s′)]。为了解决更加复杂的真实驾驶问题，对奖赏函数的假设不再只是简单的线性组合，而是假设为一个深度神经网络R(s,a,θ)＝g₁(g₂(…(g_n(f(s,a),θ_n),…),θ₂),θ₁),其中f(s,a)表示(s,a)处的驾驶的道路特征信息，θ_i表示深度神经网络第i层的参数。In this embodiment, the automatic driving decision process of the car is a Markov decision process MDP/R without a reward function, which can be expressed as a set {state space S, action space A, state transition probability T defined by the environment (omitted for environment transition probability T requirements). The value function (cumulative reward value) of the car Agent can be expressed as The state-action-value function of the vehicle Agent can be expressed as Q(s,a)=R _θ (s,a)+γE _T(s,a,s′) [V(s′)]. In order to solve more complex real driving problems, the assumption of the reward function is no longer a simple linear combination, but a deep neural network R(s,a,θ)=g ₁ (g ₂ (…(g _n ( f(s,a),θ _n ),…),θ ₂ ),θ ₁ ), where f(s,a) represents the road feature information of driving at (s,a), θ _i represents the deep neural network The parameters of layer i.

同时，为了满足更个性、更智能的真实驾驶场景，假设有多个奖赏函数R(目标)同时存在，代表用户驾驶员不同的驾驶习惯。假设存在G个奖赏函数，令这G个奖赏函数的先验概率分布为ρ₁,…,ρ_G,奖赏权重为θ¹,…,θ^G,令Θ＝(ρ₁,…,ρ_G,θ¹,…,θ^G),表示这G个奖赏函数的参数集合。At the same time, in order to meet more personalized and intelligent real driving scenarios, it is assumed that there are multiple reward functions R (objectives) existing at the same time, representing the different driving habits of the user driver. Suppose there are G reward functions, let the prior probability distribution of these G reward functions be ρ ₁ ,…,ρ _G , the reward weights are θ ¹ ,…,θ ^G , let Θ=(ρ ₁ ,…,ρ _G , θ ¹ ,…,θ ^G ), which represent the set of parameters of these G reward functions.

请参见图2，在已知有假设奖赏函数(由初始化或经过迭代获得)的条件下，此时我们可以把问题描述为一个完全的马尔科夫决策过程MDP。此时在完全的马尔科夫决策过程MDP下，根据强化学习的知识，利用奖赏函数R(s,a,θ)＝g₁(g₂(…(g_n(f,θ_n),…),θ₂),θ₁)，我们可以对V值以及Q值进行评估。对于强化学习的评估算法，采用一种新的软最大化方法(MellowMax)来估计V值的期望值。MellowMax的生成器定义为：MellowMax是一种更优化的算法，它能够保证对V值的估计能够收敛于唯一一点。同时，MellowMax又具备特质：科学的概率分配机制和期望估计方法。在本实施例中，结合了MellowMax的强化学习算法在自动驾驶过程中对环境的探索和利用方面将更加合理。这保证了在强化学习过程收敛时，自动驾驶系统对各种情景已经有了足够的学习并能对当前状态产生较科学的评估。Please refer to Figure 2, under the condition that there is a hypothetical reward function (obtained by initialization or iteration), we can describe the problem as a complete Markov decision process MDP at this time. At this time, under the complete Markov decision process MDP, according to the knowledge of reinforcement learning, use the reward function R(s,a,θ)=g ₁ (g ₂ (…(g _n (f,θ _n ),…) ,θ ₂ ),θ ₁ ), we can evaluate the V value and Q value. For the evaluation algorithm of reinforcement learning, a new soft maximization method (MellowMax) is adopted to estimate the expected value of V value. The generator for MellowMax is defined as: MellowMax is a more optimized algorithm, which can guarantee that the estimation of V value can converge to a unique point. At the same time, MellowMax has special features: scientific probability distribution mechanism and expectation estimation method. In this embodiment, the reinforcement learning algorithm combined with MellowMax will be more reasonable in the exploration and utilization of the environment during the automatic driving process. This ensures that when the reinforcement learning process converges, the autopilot system has learned enough about various scenarios and can produce a more scientific assessment of the current state.

在本实施例中，根据结合了一种软最大化算法MellowMax的强化学习，可以获得对状态的特征的期望值更科学的评价。利用MellowMax可以获得动作选取的概率分布为在该软最大化的动作选取的规则下，利用强化学习的迭代过程，可以获得在以当前深度神经网络的参数为θ构成的奖赏函数所能够获得特征的期望值μ。μ可以理解为特征的累计的期望。In this embodiment, according to reinforcement learning combined with a soft maximization algorithm MellowMax, a more scientific evaluation of the expected value of the feature of the state can be obtained. Using MellowMax, the probability distribution of action selection can be obtained as Under the rule of soft maximization action selection, using the iterative process of reinforcement learning, the expected value μ of the features that can be obtained in the reward function composed of the parameters of the current deep neural network as θ can be obtained. μ can be understood as the cumulative expectation of features.

在本实施例中，利用EM算法来求解上述带隐变量的多目标逆强化学习问题。EM算法按步骤可分为E步和M步，通过E步、M步的不断迭代，逼近似然估计的极大值。In this embodiment, the EM algorithm is used to solve the above multi-objective inverse reinforcement learning problem with hidden variables. The EM algorithm can be divided into E step and M step according to the steps. Through the continuous iteration of E step and M step, the maximum value of the likelihood estimate is approximated.

E步：首先计算其中Z为正则项。z_ij代表第i个驾驶轨迹属于驾驶习惯(奖赏函数)j的概率。Step E: first calculate where Z is a regular term. z _ij represents the probability that the i-th driving trajectory belongs to driving habit (reward function) j.

令y_i＝j表示第i个驾驶轨迹属于驾驶习惯j，并用y＝(y₁,…,y_N)的集合表示N个驾驶轨迹的从属集合。Let y _i =j indicate that the i-th driving trajectory belongs to driving habit j, and use the set of y=(y ₁ ,...,y _N ) to represent the subordinate set of N driving trajectories.

计算似然估计值Q(Θ,Θ^t)＝∑_yL(Θ|D,y)Pr(y|D,Θ^t)(这里所指的Q函数Q(Θ,Θ^t)是EM算法的更新目标函数，注意与强化学习中的Q动作状态值函数相区别)，经过推算获得似然估计值 Calculate the likelihood estimate Q(Θ,Θ ^t )=∑ _y L(Θ|D,y)Pr(y|D,Θ ^t ) (the Q function Q(Θ,Θ ^t ) referred to here is the EM algorithm Update the objective function, pay attention to the difference from the Q action state value function in reinforcement learning), and obtain the likelihood estimate after calculation

M步：选取合适的多驾驶习惯参数集合Θ(ρ_l和θ_l)使得E步中的似然估计值Q(Θ,Θ^t)极大化。由于ρ_l和θ_l的相互独立性，可以分开求它们的极大化。可以得到后半部分 Step M: Select an appropriate multi-driving habit parameter set Θ(ρ _l and θ _l ) to maximize the likelihood estimate Q(Θ,Θ ^t ) in step E. Due to the mutual independence of ρ _l and θ _l , their maximization can be obtained separately. can get second half

对于极大化Q(Θ,Θ^t)后半部分的更新目标:可以理解为是关于在第l簇目标的参数为θ_l的条件下得到观察的轨迹集合所能够获得最大似然方程。我们可以利用相对熵深度逆强化学习的知识来求解这个最大似然方程。相对熵的求解公式，在符合最大似然更新目标的同时，可以很自然应用到深度神经网络参数的反向传播更新。令深度神经网络的最大化目标函数为L(θ)＝logP(D,θ|r),根据联合似然函数的分解公式，可以获得L(θ)＝logP(D,θ|r)＝logP(D|r)+logP(θ)。对该联合似然目标函数求偏导可以获得对于该偏导的前半部分,可以进一步做分解，表示为 For the update objective that maximizes the second half of Q(Θ,Θ ^t ): can be understood as is the set of observed trajectories under the condition that the parameter of the l-th cluster target is θ _l The maximum likelihood equation can be obtained. We can use the knowledge of relative entropy deep inverse reinforcement learning to solve this maximum likelihood equation. The solution formula of relative entropy can be naturally applied to the backpropagation update of deep neural network parameters while meeting the maximum likelihood update objective. Let the maximization objective function of the deep neural network be L(θ)=logP(D,θ|r), and according to the decomposition formula of the joint likelihood function, L(θ)=logP(D,θ|r)=logP can be obtained (D|r)+logP(θ). The partial derivative of the joint likelihood objective function can be obtained For the first half of the partial derivative, it can be further decomposed, expressed as

其中根据相对熵逆强化学习的知识，可以得到求解结果为当前奖赏函数下的特征期望值与专家特征值的差值其中,利用重要性采样，其中，π是一种给定的策略，根据这种策略π采样获得个轨迹。其中其中τ＝s₁a₁,…,s_Ha_H。进一步的，其中表示为更新深度神经网络中隐藏层参数时通过反向传播算法计算的梯度。in According to the knowledge of relative entropy inverse reinforcement learning, the solution result can be obtained as the difference between the feature expectation value under the current reward function and the expert feature value Among them, using importance sampling, Among them, π is a given strategy, and according to this strategy π sampling is obtained track. in where τ=s ₁ a ₁ ,...,s _H a _H . further, in Represented as gradients computed by the backpropagation algorithm when updating hidden layer parameters in a deep neural network.

梯度更新完成标志着一次相对熵深度逆强化学习迭代更新的完成。利用更新完成了参数更新的新的深度网络奖赏函数产生新的策略π，进行新的迭代。The completion of the gradient update marks the completion of an iterative update of relative entropy deep inverse reinforcement learning. The new deep network reward function that has completed the parameter update is used to generate a new strategy π for a new iteration.

不断迭代进行E步和M步的计算，直至似然估计值Q(Θ,Θ^t)收敛至极大值。此时获得的参数集合Θ＝(ρ₁,…,ρ_G,θ¹,…,θ^G)，就是我们想要求解的代表多驾驶习惯的奖赏函数的先验分布和权重。E-step and M-step calculations are iteratively performed until the likelihood estimate Q(Θ,Θ ^t ) converges to the maximum value. The parameter set Θ=(ρ ₁ ,...,ρ _G ,θ ¹ ,...,θ ^G ) obtained at this time is the prior distribution and weight of the reward function representing multiple driving habits that we want to solve.

在本实施例中，根据这个参数集合Θ，经过强化学习RL的计算，获得每个驾驶习惯R的驾驶策略π。输出多驾驶策略，并保存在云端的驾驶策略库中。用户便可在客户端中选择个性、智能的驾驶策略。In this embodiment, according to this parameter set Θ, the driving strategy π for each driving habit R is obtained through reinforcement learning RL calculation. Output multiple driving strategies and save them in the driving strategy library in the cloud. The user can choose a personalized and intelligent driving strategy in the client.

S2：所述存储模块接收所述道路信息并根据所述道路信息分析计算并模拟多种驾驶策略，并将所述驾驶策略传递给所述客户端；S2: The storage module receives the road information, analyzes and calculates and simulates various driving strategies according to the road information, and transmits the driving strategies to the client;

综上所述：通过在系统中设置驾驶基础数据采集子系统2，实时采集道路信息，并将道路信息传递给存储模块3及客户端1，存储模块3接收道路信息后根据历史驾驶轨迹模拟驾驶策略，实现个性、智能的自动驾驶。To sum up: by setting the driving basic data acquisition subsystem 2 in the system, road information is collected in real time, and the road information is transmitted to the storage module 3 and the client 1. After receiving the road information, the storage module 3 simulates driving according to the historical driving trajectory strategies to realize personalized and intelligent autonomous driving.

基于本方法的自动驾驶中，驾驶策略都在云端3中实现计算，而不是在客户端1中运行计算过程。当用户在需要进行自动驾驶的时候，所有驾驶策略都已经在云端3完成。用户只需要选择下载自己所需的驾驶策略，车体就可以根据用户所选择的驾驶策略和实时道路信息进行实时的自动驾驶。同时，在完成任何一次驾驶之后，大量的道路信息上传到云端3被存储为历史驾驶轨迹。利用存储的历史驾驶轨迹大数据，再实现对驾驶策略库的更新。利用轨迹信息大数据，本系统将实现更加贴近用户需求的自动驾驶。In the automatic driving based on this method, the driving strategy is calculated in the cloud 3 instead of running the calculation process in the client 1 . When the user needs to drive automatically, all driving strategies have been completed on the cloud 3. Users only need to choose to download the driving strategy they need, and the car body can perform real-time automatic driving according to the driving strategy selected by the user and real-time road information. Simultaneously, after completing any driving, a large amount of road information is uploaded to the cloud 3 and stored as historical driving tracks. Utilize the big data of stored historical driving trajectories to update the driving strategy library. Using the big data of trajectory information, this system will realize automatic driving that is closer to the needs of users.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The various technical features of the above-mentioned embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, should be considered as within the scope of this specification.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present invention, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the patent scope of the invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.

Claims

1. a kind of automated driving system based on relative entropy depth against intensified learning, it is characterised in that the system includes：

Client：Show driving strategy；

Drive basic data acquisition subsystem：Gather road information；

Memory module：It is connected with the client and driving basic data acquisition subsystem and stores the driving basic data and is adopted The road information that subsystem is collected；

Wherein, the driving basic data acquisition subsystem gathers road information and the road information is transferred into the client End and memory module, the memory module receives the road information, and one section of lasting road information is stored as into history rail Mark, analysis calculating is carried out according to the historical track and simulates driving strategy, the memory module transmits the driving strategy To client for selection by the user, the driving that the client receives and selected according to the road information and user personality Strategy implement automatic Pilot.

2. the automated driving system based on relative entropy depth against intensified learning as claimed in claim 1, it is characterised in that described Memory module is including being used to store the driving locus storehouse of history driving locus, calculating and simulate according to driving locus and driving habit Go out the trace information processing subsystem of driving strategy and the driving strategy storehouse of memory of driving strategy；The driving locus storehouse will drive Track data is transferred to the trace information processing subsystem, and the trace information processing subsystem is according to the driving locus number Calculated according to analysis and simulate driving strategy and be transferred to the driving strategy storehouse, the driving strategy storehouse receives and stored described Driving strategy.

3. the automated driving system based on relative entropy depth against intensified learning as claimed in claim 2, it is characterised in that described Trace information processing subsystem is calculated using the relative entropy depth of multiple target against nitrification enhancement and drive simulating strategy.

4. the automated driving system based on relative entropy depth against intensified learning as claimed in claim 3, it is characterised in that described The inverse nitrification enhancement of multiple target calculates more reward functions using EM algorithm frames nesting relative entropy depth against intensified learning Parameter.

5. the personalized automated driving system based on relative entropy depth against intensified learning, its feature exist as claimed in claim 1 In the basic data acquisition subsystem that drives includes being used for the sensor for gathering road information.

A kind of 6. method based on relative entropy depth against the automatic Pilot of intensified learning, it is characterised in that methods described is included such as Lower step：

S1：The road information is simultaneously transferred to client and memory module by collection road information；

S2：The memory module receives the road information and one section of lasting road information is stored as into historical track, according to The historical track analysis is calculated and simulates a variety of driving strategies, and the driving strategy is passed into the client；

S3：The client receives the road information and driving strategy, and the individual character driving strategy and road selected according to user Road information implements automatic Pilot.

7. the method based on relative entropy depth against the automatic Pilot of intensified learning as claimed in claim 6, it is characterised in that institute Memory module is stated including being used to store the driving locus storehouse of history driving locus, calculating simultaneously mould according to driving planning and driving habit Draw up the trace information processing subsystem of driving strategy and the driving strategy storehouse of memory of driving strategy；The driving locus storehouse will drive Sail track data and be transferred to the trace information processing subsystem, the trace information processing subsystem is according to the driving locus Data analysis calculates and simulates driving strategy and be transferred to the driving strategy storehouse, and the driving strategy storehouse receives and stores institute State driving strategy.

8. the method based on relative entropy depth against the automatic Pilot of intensified learning as claimed in claim 7, it is characterised in that institute State trace information processing subsystem and simultaneously drive simulating strategy is calculated against nitrification enhancement using the relative entropy depth of multiple target.

9. the method based on relative entropy depth against the automatic Pilot of intensified learning as claimed in claim 8, it is characterised in that institute The inverse nitrification enhancement for stating multiple target calculates more reward functions using EM algorithm frames nesting relative entropy depth against intensified learning Parameter.