CN110458244B

CN110458244B - Traffic accident severity prediction method applied to regional road network

Info

Publication number: CN110458244B
Application number: CN201910770584.3A
Authority: CN
Inventors: 石琴; 杨慧敏; 陈一锴; 骆仁佳; 于淑君; 董满生
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2021-03-30
Anticipated expiration: 2039-08-20
Also published as: CN110458244A

Abstract

The invention discloses a traffic accident severity prediction method applied to a regional road network, which comprises the following steps: 1. collecting and preprocessing traffic accident data of a regional road network; 2. establishing a potential category analysis model based on the traffic accident data of the regional road network; 3. respectively establishing a CART decision tree model for each subcategory according to the potential category analysis result; 4. and (3) establishing an accident severity model (considering independent variables and interaction terms) based on binary logistic regression for each subcategory, and taking the intersection point of the sensitivity curve and the specificity curve as a model prediction classification threshold. The method can reduce the adverse effect of the heterogeneity of accident data on the analysis result, overcome the problems that the traditional traffic accident severity prediction model ignores interaction items and the comprehensive prediction effect of unbalanced data is poor, and improve the prediction precision and the fitting goodness of the accident severity model.

Description

A traffic accident severity prediction method applied to regional road network

技术领域technical field

本发明涉及一种应用于区域路网的交通事故严重度预测方法，属于道路交通安全分析技术领域。The invention relates to a traffic accident severity prediction method applied to a regional road network, and belongs to the technical field of road traffic safety analysis.

背景技术Background technique

据全球道路安全状况报告，道路交通事故是全球第八大死亡原因，造成每年超过135万人死亡，道路交通安全逐渐成为全球都在关注的重大焦点问题。依靠交通事故数据分析来确定影响事故严重度的因素和提出降低死亡事故风险的对策，是目前最实际的交通安全改善措施之一。然而，道路交通事故是涉及各种驾驶员对外部环境反应，以及车辆、道路状况、交通因素和环境因素之间相互作用的复杂事件，可能存在未观测到的事故影响因素，这使得交通事故数据具有高度异质性，而且事故严重度可能受到各因素之间交互作用的影响。According to the Global State of Road Safety Report, road traffic accidents are the eighth leading cause of death in the world, killing more than 1.35 million people every year. Road traffic safety has gradually become a major focus of global attention. Relying on the analysis of traffic accident data to determine the factors affecting the severity of the accident and to propose countermeasures to reduce the risk of fatal accidents is one of the most practical measures to improve traffic safety at present. However, road traffic accidents are complex events involving various driver responses to the external environment, as well as the interaction between vehicles, road conditions, traffic factors and environmental factors, and there may be unobserved accident influencing factors, which makes traffic accident data There is high heterogeneity, and accident severity may be affected by the interaction between various factors.

在事故严重度(死亡和非死亡事故)分析方法方面，二元logistic回归模型应用最为广泛。然而，该方法忽略了事故数据的异质性和各自变量之间的交互作用对分析结果的影响，可能会导致不准确的参数估计或忽略重要的隐藏的关系。余荣杰等人利用潜在类别分析将事故数据划分为若干同质潜在类别降低事故数据异质性对分析结果的影响(Yu R,Wang X,Abdel-Aty M.A Hybrid Latent Class Analysis Modeling Approach toAnalyze Urban Expressway Crash Risk[J].AccidentAnalysis and Prevention,2017,101:37-43.)。Rusli等人利用决策树筛选自变量间的高阶交互作用，并将高阶交互项和主效应相结合纳入事故严重度模型，定量分析自变量的交互作用对事故严重度的影响，而该方法仅考虑了自变量间的高阶交互作用忽略了自变量间存在的各阶交互作用(RusdiRusli,Md.Mazharul Haque,Mohammad Saifuzzaman,Mark King.Crash severity alongrural mountainous highways in Malaysia:An application of a combined decisiontree and logistic regression model[J].Traffic Injury Prevention,2018,19(7):741-748.)。此外，传统的二元logistic回归模型仅考虑模型的整体预测精度，选取0.5作为模型分类阈值。然而，交通事故数据中死亡事故往往占比较少(即该数据为非平衡数据)，采用0.5作为分类阈值虽然使模型能够获得较高的整体预测精度，但会使敏感度过低，使其失去预测意义。In the analysis of accident severity (fatality and non-fatal accidents), the binary logistic regression model is the most widely used. However, this method ignores the heterogeneity of accident data and the influence of interactions between individual variables on the analysis results, which may lead to inaccurate parameter estimates or ignore important hidden relationships. Yu Rongjie et al. used latent class analysis to divide accident data into several homogeneous latent classes to reduce the impact of accident data heterogeneity on analysis results (Yu R, Wang X, Abdel-Aty M.A Hybrid Latent Class Analysis Modeling Approach to Analyze Urban Expressway Crash Risk [J]. Accident Analysis and Prevention, 2017, 101: 37-43.). Rusli et al. used decision trees to screen high-order interactions between independent variables, and combined high-order interaction terms and main effects into the accident severity model to quantitatively analyze the impact of the interaction of independent variables on accident severity. Only the higher-order interactions between independent variables are considered and the various-order interactions between independent variables are ignored (RusdiRusli, Md. Mazharul Haque, Mohammad Saifuzzaman, Mark King. Crash severity alongrural mountainouss in Malaysia: An application of a combined decisiontree and logistic regression model[J].Traffic Injury Prevention,2018,19(7):741-748.). In addition, the traditional binary logistic regression model only considers the overall prediction accuracy of the model, and selects 0.5 as the model classification threshold. However, the proportion of fatal accidents in the traffic accident data is often small (that is, the data is unbalanced data). Although the use of 0.5 as the classification threshold allows the model to obtain a higher overall prediction accuracy, it will make the sensitivity too low and make it lose predictive significance.

发明内容SUMMARY OF THE INVENTION

本发明为克服现有技术的不足之处，提出一种应用于区域路网的交通事故严重度预测方法，以期能降低事故数据异质性对分析结果的不利影响、识别自变量的交互作用项和调整预测模型分类阈值，从而能克服传统交通事故严重度预测模型忽略交互作用项和非平衡数据综合预测效果差的问题，提高事故严重度模型的预测精度和拟合优度。In order to overcome the shortcomings of the prior art, the present invention proposes a traffic accident severity prediction method applied to a regional road network, in order to reduce the adverse effect of the heterogeneity of accident data on the analysis results, and to identify the interaction terms of independent variables. By adjusting the classification threshold of the prediction model, it can overcome the problem that the traditional traffic accident severity prediction model ignores the interaction term and the poor comprehensive prediction effect of the unbalanced data, and improves the prediction accuracy and goodness of fit of the accident severity model.

为达到上述目的，本发明采用如下技术方案：To achieve the above object, the present invention adopts the following technical solutions:

本发明一种应用于区域路网的交通事故严重度预测方法的特点是按如下步骤进行：The characteristics of a traffic accident severity prediction method applied to a regional road network according to the present invention are carried out according to the following steps:

步骤一、区域路网道路交通事故数据的采集与预处理；Step 1. Collection and preprocessing of road traffic accident data on the regional road network;

从道路交通事故数据库中获取N起事故数据作为事故数据集D，并从任意第i起事故数据中选取K个分类变量组成集合X＝{x₁,x₂,…,x_k,…,x_K}来表征第i起事故，其中，x_k表示第k个分类变量，且第k个分类变量x_k包含C_k种类别，第k个分类变量x_k在C_k种类别中的取值记为s_k，令s_ik表示第i起事故的第k个分类变量的取值，则第i起事故中所有K个分类变量的取值所组成的分类变量取值集合记为S_i＝{s_i1,s_i2,...,s_ik,...,s_iK}；令

表示第i起事故的K个分类变量的所有可能取值中的任意一种取值集合；k＝1,2,3,...,K；i＝1,2,3,...,N；Obtain N accident data from the road traffic accident database as accident data set D, and select K categorical variables from any i-th accident data to form a set X={x ₁ ,x ₂ ,...,x _k ,...,x _K } to represent the i-th accident, where x _k represents the k-th categorical variable, and the k-th categorical variable x _k contains C _k categories, and the value of the k-th categorical variable x _k in the C _k categories Denoted as s _k , let s _ik represent the value of the kth categorical variable of the ith accident, then the set of categorical variable values composed of the values of all K categorical variables in the ith accident is denoted as S _i = {s _i1 ,s _i2 ,...,s _ik ,...,s _iK }; let

Represents any set of possible values of the K categorical variables of the ith accident; k=1,2,3,...,K; i=1,2,3,..., N;

将第i起事故的严重度作为预测变量，记为y_i，且y_i的取值为“0”或“1”分别表示非死亡事故和死亡事故；Take the severity of the i-th accident as a predictor variable, denoted as y _i , and the value of y _i is "0" or "1" to indicate a non-fatal accident and a fatal accident, respectively;

步骤二、根据区域路网道路交通事故数据，建立潜在类别分析模型；Step 2: Establish a potential category analysis model according to the road traffic accident data of the regional road network;

步骤2.1、定义所述潜在类别分析模型中存在一个潜在类别变量V，V包含T种类别，且任意一种类别记为t，t＝1,2,...,T；令第i起事故中潜在类别变量V的取值记为V_i；Step 2.1. Define that there is a latent category variable V in the latent category analysis model, V contains T categories, and any category is denoted as t, t=1,2,...,T; let the ith accident The value of the latent categorical variable V in is denoted as V _i ;

步骤2.1.1、定义外循环次数为τ、最大外循环迭代次数为τ_max；令第τ次所设置的类别数目为T_τ；初始化τ＝1；Step 2.1.1, define the number of outer loops to be τ, and the maximum number of outer loop iterations to be τ _max ; make the number of categories set for the τ th time be T _τ ; initialize τ=1;

步骤2.1.2、初始化t＝1；Step 2.1.2, initialize t=1;

步骤2.1.3、初利用式(1)得到第i起事故V_i取值为t，即属于第t种潜在类别时，第i起事故在K个分类变量上的取值集合为

的条件概率

Step 2.1.3. Initially use formula (1) to obtain the value of the i-th accident V _i as t, that is, when it belongs to the t-th potential category, the value set of the i-th accident on the K categorical variables is:

The conditional probability of

式(1)中，P(s_ik＝s_k|V_i＝t)表示第i起事故属于第t个潜在类别时，第k个分类变量上取值为s_k的条件概率；In formula (1), P(s _ik =s _k |V _i =t) represents the conditional probability of the value of s _k on the k-th categorical variable when the i-th accident belongs to the t-th potential category;

步骤2.1.4、利用式(2)得到第i起事故中K个分类变量取值集合为

的非条件概率即潜在类别分析模型的联合概率

Step 2.1.4, use formula (2) to obtain the value set of K categorical variables in the ith accident as

The unconditional probability of is the joint probability of the latent class analysis model

式(2)中，P(V_i＝t)是第i起事故属于第t个潜在类别的概率，潜在类别t占总体的比率；In formula (2), P(V _i =t) is the probability that the i-th accident belongs to the t-th potential category, and the ratio of the potential category t to the population;

步骤2.2、采用极大似然法进行模型参数估计，得到潜在类别概率和分类变量条件概率的估计值

以及潜在类别分析模型的第τ次极大似然函数值L_τ；Step 2.2. Use the maximum likelihood method to estimate the model parameters to obtain the estimated values of the latent class probability and the conditional probability of the categorical variable

and the τth maximum likelihood function value L _τ of the latent class analysis model;

步骤2.3、利用式(3)计算第i起事故被分类到第t个潜在类别的后验概率

Step 2.3. Use equation (3) to calculate the posterior probability that the i-th accident is classified into the t-th potential category

步骤2.4、令t+1赋值给t，并判断t＞T_τ是否成立，若成立，则执行步骤2.5；否则，返回步骤2.1.3执行；Step 2.4, assign t+1 to t, and determine whether t>T _τ is established, if so, execute step 2.5; otherwise, return to step 2.1.3 to execute;

步骤2.5、利用式(4)、式(5)、式(6)和式(7)得到模型拟合评价指标，包括：第τ次信息评价指标AIC_τ、第τ次贝叶斯信息准则BIC_τ、第τ次样本校正的贝叶斯信息准则aBIC_τ、第τ次熵值

Step 2.5, use formula (4), formula (5), formula (6) and formula (7) to obtain the model fitting evaluation index, including: the τth information evaluation index AIC _τ , the τth Bayesian information criterion BIC _τ , the Bayesian Information Criterion aBIC _τ of the τth sample correction τ , the τth entropy value

AIC_τ＝-2ln(L_τ)+2M (4)AIC _τ = -2ln(L _τ )+2M (4)

BIC_τ＝-2ln(L_τ)+ln(N)×M (5)BIC _τ =-2ln(L _τ )+ln(N)×M (5)

aBIC_τ＝-2ln(L_τ)+ln(n^*)×M (6)aBIC _τ =-2ln(L _τ )+ln(n ^* )×M (6)

式(4)、式(5)、式(6)和式(7)中，M为潜在类别分析模型中未知参数的个数；n^*是调整后的样本量，且n^*＝(N+2)/24；In formula (4), formula (5), formula (6) and formula (7), M is the number of unknown parameters in the latent class analysis model; n ^* is the adjusted sample size, and n ^* =(N+ 2)/24;

步骤2.6、将τ+1赋值给后τ，判断τ＞τ_max是否成立，若成立，则执行步骤2.7；否则，返回步骤2.1.3执行；Step 2.6, assign τ+1 to the latter τ, and judge whether τ>τ _max is established, if so, execute step 2.7; otherwise, return to step 2.1.3 to execute;

步骤2.7、从τ_max次信息评价指标AIC、贝叶斯信息准则BIC、样本校正的贝叶斯信息准则aBIC和熵值R²中选出各个模型拟合评价指标均取到最优值时所对应的潜在类别个数，记为T^*；将所述事故数据集D划分为T^*个事故子类别，记为

表示第t^*个事故子类别中的事故数据，t^*＝1,2,…,T^*；Step 2.7. Select the optimal value for each model fitting evaluation index from among the τ _max times information evaluation index AIC, the Bayesian information criterion BIC, the sample-corrected Bayesian information criterion aBIC and the entropy value R ² . The number of corresponding potential categories is denoted as T ^* ; the accident data set D is divided into T ^* accident sub-categories, denoted as

represents the accident data in the t ^* th accident sub-category, t ^* =1,2,…,T ^* ;

步骤三、根据潜在类别分析模型结果，对T^*个事故子类别分别建立CART决策树模型；Step 3: According to the results of the potential category analysis model, establish a CART decision tree model for the T ^* accident subcategories;

步骤3.1、令所述第t^*个事故子类别中的事故数据

作为训练样本集，令K个分类变量所组成的集合X为所述CART决策树模型中的特征集；令结点样本阈值为σ、特征值切分点为α、Gini指数阈值为ε；Step 3.1. Let the accident data in the t ^* th accident sub-category

As a training sample set, let the set X composed of K categorical variables be the feature set in the CART decision tree model; let the node sample threshold be σ, the feature value segmentation point be α, and the Gini index threshold be ε;

步骤3.2、初始化t^*＝1；Step 3.2, initialize t ^* =1;

步骤3.3、将所述训练样本集

特征集X、定义结点样本阈值σ和Gini指数阈值ε输入所述CART决策树模型；Step 3.3, the training sample set

The feature set X, the defined node sample threshold σ and the Gini index threshold ε are input into the CART decision tree model;

步骤3.4、令t^*+1赋值给t^*，并判断t^*＞T^*是否成立，若成立，则表示得到T^*个二叉决策树，并执行步骤3.5；否则，返回步骤3.3执行；Step 3.4, assign t ^* +1 to t ^* , and judge whether t ^* > T ^* is established, if so, it means that T ^* binary decision trees are obtained, and step 3.5 is performed; otherwise, return to step 3.3 for execution;

步骤3.5、根据所述T^*个二叉决策树的树形图，确定分类变量间的交互作用项，其中，第t^*个事故子类别对应的二叉决策树所确定的交互作用项；Step 3.5, according to the dendrogram of the T ^* binary decision trees, determine the interaction term between the categorical variables, wherein, the interaction term determined by the binary decision tree corresponding to the t ^* th accident subcategory;

步骤四、对T^*个事故子类别分别建立基于二元logistic回归的事故严重度模型；Step 4: Establish an accident severity model based on binary logistic regression for the T ^* accident sub-categories;

步骤4.1、将所述第t^*个事故子类别中的事故数据

作为事故严重度模型的拟合数据，以K个分类变量所组成集合X和第t^*个事故子类别的交互作用项共同作为所述事故严重度模型的自变量X^*；定义第t^*个事故子类别包含J个事故数据，J的值为

第j起事故的预测变量记为y_j；Step 4.1. Combine the accident data in the t ^* th accident sub-category

As the fitting data of the accident severity model, the set X composed of K categorical variables and the interaction term of the t ^* th accident sub-category are taken together as the independent variable X ^* of the accident severity model; define the t ^* th The accident subcategory contains J accident data, and the value of J is

The predictor of the jth accident is denoted as y _j ;

步骤4.2、初始化t^*＝1；Step 4.2, initialize t ^* =1;

步骤4.3、利用式(11)得到基于二元logistic回归在自变量X^*条件下死亡事故即y_j＝1的发生概率P(y＝1|X^*)：Step 4.3, use formula (11) to obtain the probability P(y=1|X ^* ) of fatal accident under the condition of independent variable X ^* based on binary logistic regression: y _j =1:

式(11)中，w^*为自变量X^*的回归系数；In formula (11), w ^* is the regression coefficient of the independent variable X ^* ;

步骤4.4、利用极大似然法估计所述二元logistic回归的事故严重度模型的参数w^*：Step 4.4, using the maximum likelihood method to estimate the parameter w ^* of the accident severity model of the binary logistic regression:

对于第j起事故，

为给定自变量

条件下y_j＝1的概率，则给定自变量

条件下y_j＝0的概率为1-P_j；并利用式(12)得到似然函数L(w^*)：For the jth accident,

for the given independent variable

The probability of y _j = 1 under the condition, then given the independent variable

The probability of y _j = 0 under the condition is 1-P _j ; and the likelihood function L(w ^* ) is obtained by using equation (12):

利用极大似然估计，求出使得L(w^*)取得最大值时的估计参数w′；Using the maximum likelihood estimation, find the estimated parameter w' when L(w ^* ) takes the maximum value;

根据估计参数w′得到第j起事故在自变量

条件下y_j＝1的预测概率

从而得到J起事故的预测概率

并进行升序排序，得到排序后的预测概率集合记为{P′₁,...,P′_j,...,P′_J}；According to the estimated parameter w', the jth accident is obtained in the independent variable

Predicted probability of y _j = 1

Thus, the predicted probability of J accidents is obtained

And sort in ascending order to get the sorted set of predicted probabilities as {P′ ₁ ,...,P′ _j ,...,P′ _J };

步骤4.5、调整事故严重度模型的预测分类阈值；Step 4.5, adjust the prediction classification threshold of the accident severity model;

步骤4.6、令t^*+1赋值给t^*，并判断t^*＞T^*是否成立，若成立，则表示获得T^*个事故严重度预测模型，否则，返回步骤4.3执行。Step 4.6, assign t ^* +1 to t ^* , and judge whether t ^* >T ^* is true, if true, it means that T ^* accident severity prediction models are obtained, otherwise, return to step 4.3 for execution.

本发明所述的交通事故严重度预测方法的特点也在于，所述步骤3.3是按如下过程进行：The characteristic of the traffic accident severity prediction method of the present invention is that the step 3.3 is carried out according to the following process:

步骤3.3.1、CART决策树使用Gini系数作为判定决策树是否进行分支的依据，建立二叉决策树模型，根据特征值切分点α，将所述训练样本集

分为第一子集D_α1和第二子集D_α2，利用式(8)得到所述特征值切分点α的Gini指数Gini(D_α)：Step 3.3.1. The CART decision tree uses the Gini coefficient as the basis for judging whether the decision tree is branched, establishes a binary decision tree model, and divides the training sample set according to the eigenvalue segmentation point α.

It is divided into a first subset D _α1 and a second subset D _α2 , and the Gini index Gini(D _α ) of the eigenvalue segmentation point α is obtained by using formula (8):

式(8)中，

|D_α1|和|D_α2|分别表示训练样本集

第一子集D_α1和第二子集D_α2中包含事故总数；In formula (8),

|D _α1 | and |D _α2 | denote the training sample set, respectively

The first subset D _α1 and the second subset D _α2 contain the total number of accidents;

Gini(D_α1)表示第一子集D_α1的Gini指数，并有：Gini(D _α1 ) represents the Gini index of the first subset D _α1 , and has:

式(9)中，

和

分别表示第一子集D_α1中非死亡和死亡事故的概率；In formula (9),

and

are the probabilities of non-fatal and fatal accidents in the first subset D _α1 , respectively;

式(8)中，Gini(D_α2)表示第二子集D_α2的Gini指数，并有：In formula (8), Gini(D _α2 ) represents the Gini index of the second subset D _α2 , and has:

式(10)中，

和

分别表示第二子集D_α2中非死亡和死亡事故的概率；In formula (10),

and

are the probabilities of non-fatal and fatal accidents in the second subset D _α2 , respectively;

步骤3.3.2、遍历所述特征集X中每个特征值的切分点，并计算每个特征值的切分点的Gini指数；若特征集X中每个特征值的切分点的Gini指数小于阈值ε，则表示所述CART决策树模型是一棵单结点的树，并输出所述单结点的树；否则执行步骤3.3.3；Step 3.3.2, traverse the segmentation points of each eigenvalue in the feature set X, and calculate the Gini index of the segmentation points of each eigenvalue; if the Gini index of the segmentation points of each eigenvalue in the feature set X is If the index is less than the threshold ε, it means that the CART decision tree model is a single-node tree, and the single-node tree is output; otherwise, step 3.3.3 is performed;

步骤3.3.3、选择特征集X中最小切分点的Gini指数所对应的特征值X_min及其相应的切分点α_min，并根据所述切分点α_min将训练样本集

分为两个子集D_min1和D_min2，再将子集D_min1和子集D_min2分别分配到以训练样本集

为父节点的两个子结点中；Step 3.3.3. Select the feature value X _min corresponding to the Gini index of the minimum segmentation point in the feature set X and its corresponding segmentation point α _min , and divide the training sample set according to the segmentation point α _min .

Divide into two subsets D _min1 and D _min2 , and then assign the subset D _min1 and the subset D _min2 to the training sample set respectively

In the two child nodes of the parent node;

若子集D_min1和子集D_min2的样本数均小于给定的结点样本阈值σ，则表示两个子集D_min1和D_min2所在的子结点均是叶子结点，输出二叉决策树；若子集D_min1和/或子集D_min2的样本数大于所述结点样本阈值σ，则表示子集D_min1或子集D_min2所在的子结点是非叶子结点可进一步进行划分，并执行步骤3.3.4；If the number of samples of the subset D _min1 and the subset D _min2 are both smaller than the given node sample threshold σ, it means that the child nodes where the two subsets D _min1 and D _min2 are located are both leaf nodes, and a binary decision tree is output; If the number of samples of the subset D _min1 and/or the subset D _min2 is greater than the node sample threshold σ, it means that the child node where the subset D _min1 or the subset D _min2 is located is a non-leaf node and can be further divided, and Perform step 3.3.4;

步骤3.3.4、对于非叶子结点，令训练样本集

等于非叶子结点所对应的子集，并将最小切分点的Gini指数所对应的特征值X_min从特征集X中删除后，返回执行步骤3.3.1，直到所有子结点的样本数均小于结点样本阈值σ或特征集X为空时，输出最终的二叉决策树。Step 3.3.4. For non-leaf nodes, let the training sample set

It is equal to the subset corresponding to the non-leaf node, and after deleting the feature value X _min corresponding to the Gini index of the minimum segmentation point from the feature set X, return to step 3.3.1 until the number of samples of all child nodes When all are smaller than the node sample threshold σ or the feature set X is empty, the final binary decision tree is output.

所述步骤4.5是按如下过程进行：Said step 4.5 is carried out as follows:

步骤4.5.1、定义θ为模型的预测分类阈值，且0<θ<1；

表示事故严重度模型预测第j起事故预测为死亡事故；

表示事故严重度模型预测第j起事故预测为非死亡事故；Step 4.5.1. Define θ as the prediction classification threshold of the model, and 0<θ<1;

Indicates that the accident severity model predicts that the jth accident is predicted to be a fatal accident;

Indicates that the accident severity model predicts that the jth accident is predicted to be a non-fatal accident;

步骤4.5.2、初始化j′＝1；Step 4.5.2, initialize j'=1;

步骤4.5.3、令模型的第j′个分类阈值θ_j′等于P′_j′，利用式(13)得到事故严重度模型预测的第j′个敏感度Se(θ_j′)，即事故数据集中死亡事故预测为死亡事故的概率：Step 4.5.3. Set the j'th classification threshold θ _j' of the model to be equal to P'_j' , and use the formula (13) to obtain the j'th sensitivity Se(θ _j' ) predicted by the accident severity model, that is, the accident The probability of fatal accidents predicted as fatal accidents in the dataset:

式(13)中，

表示第s起事故预测为死亡事故的概率，y_s＝1表示第s起事故为死亡事故，1≤s≤J；In formula (13),

Represents the probability that the sth accident is predicted to be a fatal accident, y _s =1 indicates that the sth accident is a fatal accident, 1≤s≤J;

利用式(14)得到事故严重度模型预测的第j′个特异性Sp(θ_j′)，即事故数据集中非死亡事故预测为非死亡事故的概率：Using Equation (14), the j′-th specificity Sp(θ _j′ ) predicted by the accident severity model is obtained, that is, the probability that a non-fatal accident is predicted to be a non-fatal accident in the accident data set:

式(14)中，

表示第s起事故预测为死亡事故的概率，y_s＝0表示第s起事故为死亡事故，1≤s≤J；In formula (14),

Represents the probability that the sth accident is predicted to be a fatal accident, y _s =0 indicates that the sth accident is a fatal accident, 1≤s≤J;

步骤4.5.4、令j′+1赋值给j′，并判断j′＞J是否成立，若成立，则表示得到J对敏感度和特异性取值，并执行步骤4.5.5；否则，返回步骤4.5.3执行；Step 4.5.4. Assign j'+1 to j', and judge whether j'>J is true. If it is true, it means that the sensitivity and specificity values of J are obtained, and step 4.5.5 is executed; otherwise, return Step 4.5.3 is executed;

步骤4.5.5、以第j′个分类阈值θ_j′为横坐标，分别以第j′个分类阈值θ_j′所对应的敏感度Se(θ_j′)和特异性Sp(θ_j′)值为纵坐标，绘制敏感度与特异性的曲线，以两曲线的交点对应的阈值作为最佳模型预测分类阈值θ′。Step 4.5.5, take the j'th classification threshold θ _j' as the abscissa, respectively take the sensitivity Se(θ _j' ) and specificity Sp(θ _j' ) corresponding to the j'th classification threshold θ _j ' The value is the ordinate, and the curve of sensitivity and specificity is drawn, and the threshold corresponding to the intersection of the two curves is used as the optimal model to predict the classification threshold θ′.

与现有技术相比，本发明的有益效果在于：Compared with the prior art, the beneficial effects of the present invention are:

1、本发明方法基于区域路网交通事故数据，建立潜在类别分析模型，将事故数据划分为若干同质子类别；其次，对各子类别分别建立CART决策树模型，识别自变量间交互作用项；然后，基于二元logistic回归对各子类别分别建立考虑交互作用项事故严重度模型，并设置敏感度与特异性曲线交点作为事故严重度模型的预测分类阈值。该方法降低了事故数据异质性对分析结果的不利影响，克服了传统交通事故严重度预测模型忽略交互作用项和非平衡数据综合预测效果差的问题，提高了事故严重度模型的预测精度和拟合优度。1. The method of the present invention establishes a potential category analysis model based on the traffic accident data of the regional road network, and divides the accident data into several homogenous sub-categories; secondly, a CART decision tree model is established for each sub-category, and interaction items between independent variables are identified; Then, an accident severity model considering the interaction term is established for each sub-category based on binary logistic regression, and the intersection of the sensitivity and specificity curves is set as the predicted classification threshold of the accident severity model. This method reduces the adverse impact of accident data heterogeneity on the analysis results, overcomes the problems that the traditional traffic accident severity prediction model ignores the interaction term and the comprehensive prediction effect of unbalanced data is poor, and improves the prediction accuracy and accuracy of the accident severity model. goodness of fit.

2、本发明方法通过潜在类别分析将交通事故数据划分为若干同质子类别，既能够反映事故数据异质性，又能精准识别、分析潜在的道路交通事故发生模式和机理；2. The method of the present invention divides the traffic accident data into several homogeneous sub-categories through potential class analysis, which can not only reflect the heterogeneity of accident data, but also accurately identify and analyze the potential occurrence mode and mechanism of road traffic accidents;

3、本发明方法通过CART决策树模型识别自变量间的各阶交互作用项，并纳入二元logistic回归模型，提高了模型的拟合优度，并识别出影响区域路网交通事故严重度的重要自变量和交互作用项，有助于提高区域路网道路交通安全水平；3. The method of the present invention identifies the interaction terms of each order between the independent variables through the CART decision tree model, and incorporates them into the binary logistic regression model, which improves the goodness of fit of the model and identifies factors that affect the severity of traffic accidents on the regional road network. Important independent variables and interaction terms are helpful to improve the level of road traffic safety in the regional road network;

4、本发明方法使用敏感度和特异性曲线交点对应阈值作为二元logistic回归模型的分类阈值解决了非平衡数据分类问题，提高了事故严重度模型的预测准确度。4. The method of the present invention solves the problem of unbalanced data classification by using the threshold corresponding to the intersection of the sensitivity and specificity curves as the classification threshold of the binary logistic regression model, and improves the prediction accuracy of the accident severity model.

附图说明Description of drawings

图1为本发明类别1CART决策树图；Fig. 1 is a CART decision tree diagram of category 1 of the present invention;

图2为本发明类别1的灵敏度与特异度曲线图；Fig. 2 is the sensitivity and specificity curve diagram of category 1 of the present invention;

图3为本发明类别1的ROC曲线图；Fig. 3 is the ROC curve diagram of category 1 of the present invention;

图4为本发明方法流程图。Figure 4 is a flow chart of the method of the present invention.

具体实施方式Detailed ways

本实施例中，如图4所示，一种应用于区域路网的交通事故严重度预测方法是按如下步骤进行：In this embodiment, as shown in FIG. 4 , a traffic accident severity prediction method applied to a regional road network is performed according to the following steps:

步骤1.1、从道路交通事故平台中采集某区域路网的交通事故数据，删除交通事故数据库中记录不全(具有空白项)或记录不合理的事故数据，共获取2595(N＝2595)起事故数据作为分析事故数据集D，从人、车、事故特征、路和环境五个方面选取26个分类变量组成集合X＝{x₁,x₂,...,x₂₆}来表征第i起事故，并将他们作为预测模型的自变量，自变量具体取值见表1；其中，x_k表示第k个分类变量，且第k个分类变量x_k包含C_k种类别，x_k在C_k种类别中的取值记为s_k(例如：x₁表示第一个分类变量包括两种类别即C₁的值为2，则s₁为1女性或2男性)，每起事故都可以表示为26个分类变量取值的集合S_i＝{s_i1,s_i2,...,s_ik,...,s_i26}；令

表示第i起事故的K个分类变量的所有可能取值中的任意一种取值集合；k＝1,2,3,...,K；i＝1,2,3,...,N；Step 1.1. Collect the traffic accident data of a certain regional road network from the road traffic accident platform, delete the incomplete records (with blank items) or unreasonable accident data in the traffic accident database, and obtain a total of 2595 (N=2595) accident data As the analysis accident data set D, 26 categorical variables are selected from five aspects of people, vehicles, accident characteristics, road and environment to form a set X={x ₁ ,x ₂ ,...,x ₂₆ } to represent the ith accident , and use them as independent variables of the prediction model. The specific values of the independent variables are shown in Table 1; among them, x _k represents the k-th categorical variable, and the k-th categorical variable x _k includes C _k categories, and x _k is in C _k The value in each category is recorded as _sk (for example: x ₁ indicates that the first categorical variable includes two categories, that is, the value of C ₁ is 2, then s ₁ is 1 female or 2 male), and each accident can be represented by Set S _i = {s _i1 ,s _i2 ,...,s _ik ,...,s _i26 } for 26 categorical variables; let

每一起事故的事故严重度作为预测变量，记为y_i，y_i的取值为“0”或“1”分别表示非死亡事故和死亡事故；The accident severity of each accident is used as a predictor variable, denoted as y _i , and the value of y _i is "0" or "1" to indicate a non-fatal accident and a fatal accident, respectively;

步骤1.2、利用SPSS软件进行多重共线性检验，删除具有共线性的分类变量，通过共线性检验发现方差膨胀因子(VIF)均小于5，对应容差(TOL)均大于0.1(如表1所示)，证明26分类变量之间无共线性关系，均可纳入模型分析。Step 1.2. Use SPSS software to perform multicollinearity test, delete the categorical variables with collinearity, and find that the variance inflation factor (VIF) is less than 5 and the corresponding tolerance (TOL) is greater than 0.1 through the collinearity test (as shown in Table 1). ), proving that there is no collinear relationship among the 26 categorical variables, all of which can be included in the model analysis.

表1自变量定义与赋值及共线性检验Table 1 Definition and assignment of independent variables and collinearity test

步骤2.1、定义潜在类别分析模型中存在一个潜在类别变量V，V包含T种类别，且任意一种类别记为t，t＝1,2,...,T；令第i起事故中潜在类别变量V的取值记为V_i；Step 2.1. Define that there is a latent category variable V in the latent category analysis model, V contains T categories, and any category is denoted as t, t=1,2,...,T; let the potential in the i-th accident be The value of the categorical variable V is denoted as V _i ;

步骤2.1.1、定义外循环次数为τ、最大外循环迭代次数为5；令第τ次所设置的类别数目为T_τ且T_τ＝τ；初始化τ＝1；Step 2.1.1. Define the number of outer loops as τ and the maximum number of outer loop iterations as 5; let the number of categories set for the τth time be T _τ and T _τ =τ; initialize τ = 1;

步骤2.1.2、初始化t＝1；Step 2.1.2, initialize t=1;

的条件概率

The conditional probability of

步骤2.1.4、利用式(2)得到第i起事故中K个分类变量取值集合为

的非条件概率即潜在类别分析模型的联合概率

此外，潜在类别分析模型的基本限定条件为各潜在类别概率以及每个分类变量的条件概率总和均为1，如式(3)、式(4)所示：In addition, the basic limitation of the latent category analysis model is that the sum of the probability of each latent category and the conditional probability of each categorical variable is 1, as shown in equations (3) and (4):

步骤2.3、根据贝叶斯理论，利用式(5)计算第i起事故被分类到第t个潜在类别的后验概率

Step 2.3. According to Bayesian theory, use formula (5) to calculate the posterior probability that the ith accident is classified into the tth latent category

其中，

由式(6)表示：in,

It is represented by formula (6):

第i起事故归属于某一类别的后验概率最大，则第i起事故被划分到该子类别，对所有N起事故数据进行后验概率的计算与比较，从而实现聚类的目的；The posterior probability of the i-th accident belonging to a certain category is the largest, then the i-th accident is divided into this sub-category, and the posterior probability is calculated and compared for all N accident data, so as to achieve the purpose of clustering;

步骤2.5、利用式(7)、式(8)、式(9)和式(10)得到模型拟合评价指标，包括：第τ次信息评价指标AIC_τ、第τ次贝叶斯信息准则BIC_τ、第τ次样本校正的贝叶斯信息准则aBIC_τ、第τ次熵值

Step 2.5, use formula (7), formula (8), formula (9) and formula (10) to obtain the model fitting evaluation index, including: the τth information evaluation index AIC _τ , the τth Bayesian information criterion BIC _τ , the Bayesian Information Criterion aBIC _τ of the τth sample correction τ , the τth entropy value

AIC_τ＝-2ln(L_τ)+2M (7)AIC _τ = -2ln(L _τ )+2M (7)

BIC_τ＝-2ln(L_τ)+ln(N)×M (8)BIC _τ =-2ln(L _τ )+ln(N)×M (8)

aBIC_τ＝-2ln(L_τ)+ln(n^*)×M (9)aBIC _τ =-2ln(L _τ )+ln(n ^* )×M (9)

利用式(7)、式(8)、式(9)和式(10)中，M为潜在类别分析模型中未知参数的个数；n^*是调整后的样本量，且n^*＝(N+2)/24；Using formula (7), formula (8), formula (9) and formula (10), M is the number of unknown parameters in the latent class analysis model; n ^* is the adjusted sample size, and n ^* = (N +2)/24;

步骤2.7、潜在类别分析模型的建模和参数估计采用Mplus vision7.4软件进行，通过限定潜在类别数目T。从T＝1开始逐渐增大潜在类别数目到T＝5，得到5个不同的潜在类别分析模型估计参数ln(L)，即τ的值为5。分别计算5个模型的拟合评价指标，包括：第τ次信息评价指标AIC_τ、第τ次贝叶斯信息准则BIC_τ、第τ次样本校正的贝叶斯信息准则aBIC_τ、第τ次熵值

对应的模型拟合指标见表2。Step 2.7, the modeling and parameter estimation of the latent category analysis model is carried out with Mplus vision7.4 software, by limiting the number of latent categories T. From T=1, the number of latent classes is gradually increased to T=5, and five different latent class analysis models are obtained to estimate the parameter ln(L), that is, the value of τ is 5. Calculate the fitting evaluation indicators of the five models respectively, including: the τth information evaluation index AIC _τ , the τth Bayesian information criterion BIC _τ , the τth sample-corrected Bayesian information criterion aBIC _τ , the τth time entropy value

The corresponding model fitting indicators are shown in Table 2.

表2模型拟合指标汇总Table 2 Summary of model fitting indicators

表2中，AIC、BIC、aBIC的值越小模型的拟合程度越高，熵值大于0.8表明有90％以上分类正确率，LMR和BLRT是相对拟合指标，P值显著表示T个类别优于T-1个类别显著。因此，考虑将事故数据划分为3个类别进行分析即T^*＝3。T^*＝3时潜在类别分析模型估计结果如表3所示，由条件概率分布识别出各事故子类别的事故特点，将类别1命名为县道上的乘用车事故，类别2乡村道路上的机动车事故，类别3老年人非机动车事故，识别出潜在的道路交通事故发生模式。In Table 2, the smaller the values of AIC, BIC, and aBIC, the higher the fitting degree of the model, and the entropy value greater than 0.8 indicates that the classification accuracy rate is more than 90%. LMR and BLRT are relative fitting indicators, and the P value significantly indicates T categories Significantly better than T-1 categories. Therefore, consider dividing the accident data into 3 categories for analysis ie T ^* =3. When T ^* =3, the estimation results of the latent category analysis model are shown in Table 3. The accident characteristics of each accident sub-category are identified by the conditional probability distribution. Motor vehicle accidents, Category 3 elderly non-motor vehicle accidents, identify potential patterns of road traffic accident occurrence.

根据贝叶斯理论，利用式(5)计算第i起观测事故数据被分类到第3个潜在类别的后验概率

对所有事故数据进行后验概率的计算与比较，从而将2595起事故数据划分为3个事故子类别，记为{D₁,D₂,D₃}，分别包含1104、485和1006起事故数据；According to Bayesian theory, use Equation (5) to calculate the posterior probability that the i-th observed accident data is classified into the third latent category

Calculate and compare the posterior probability of all accident data, so as to divide the 2595 accident data into 3 accident sub-categories, denoted as {D ₁ , D ₂ , D ₃ }, including 1104, 485 and 1006 accident data respectively ;

表3 T^*＝3时潜在类别概率和自变量条件概率(部分)Table 3 Latent class probability and independent variable conditional probability when T ^* =3 (part)

步骤三、根据潜在类别分析模型结果，对3个事故子类别分别建立CART决策树模型；Step 3: According to the results of the potential category analysis model, establish a CART decision tree model for the three accident sub-categories respectively;

步骤3.1、令第t^*个事故子类别中的事故数据

作为训练样本集t^*＝1,2,3.，令26个分类变量所组成的集合X为CART决策树模型中的特征集；令结点样本阈值为σ、特征值切分点为α、Gini指数阈值为ε；Step 3.1. Let the accident data in the t ^* th accident subcategory

As the training sample set t ^* =1,2,3., let the set X composed of 26 categorical variables be the feature set in the CART decision tree model; let the node sample threshold be σ, the feature value cut point be α, Gini index threshold is ε;

步骤3.2、初始化t^*＝1；Step 3.2, initialize t ^* =1;

步骤3.3、利用SPSS软件，构建CART决策树模型，输入事故数据集

设置特征集X为步骤3.1中识别出显著性的变量、结点样本阈值σ为50和Gini指数阈值ε为0.001；Step 3.3. Use SPSS software to build a CART decision tree model and input the accident data set

Set the feature set X to be the variable identified as significant in step 3.1, the node sample threshold σ to be 50 and the Gini index threshold ε to be 0.001;

步骤3.3.1、CART决策树使用Gini系数作为判定决策树是否进行分支的依据，建立二叉决策树模型，根据特征值切分点α，将训练样本集

分为第一子集D_α1和第二子集D_α2，即将分类变量x_k的某一类别C_k作为切分点α，可以将样本集D划分为两个子集D_α1和D_α2；利用式(11)得到特征值切分点α的Gini指数Gini(D_α)：Step 3.3.1. The CART decision tree uses the Gini coefficient as the basis for judging whether the decision tree is branched, establishes a binary decision tree model, and divides the training sample set according to the eigenvalue segmentation point α.

Divided into the first subset D _α1 and the second subset D _α2 , that is, a certain category C _k of the categorical variable x _k is used as the cutting point α, and the sample set D can be divided into two subsets D _α1 and D _α2 ; using Equation (11) obtains the Gini index Gini(D _α ) of the eigenvalue segmentation point α:

式(11)中，

|D_α1|和|D_α2|分别表示训练样本集

第一子集D_α1和第二子集D_α2中包含事故总数；In formula (11),

|D _α1 | and |D _α2 | denote the training sample set, respectively

式(12)中，

和

分别表示第一子集D_α1中非死亡和死亡事故的概率；In formula (12),

and

式(11)中，Gini(D_α2)表示第二子集D_α2的Gini指数，并有：In formula (11), Gini(D _α2 ) represents the Gini index of the second subset D _α2 , and has:

式(13)中，

和

分别表示第二子集D_α2中非死亡和死亡事故的概率；In formula (13),

and

步骤3.3.2、遍历特征集X中每个特征值的切分点，并计算每个特征值的切分点的Gini指数；若特征集X中每个特征值的切分点的Gini指数小于阈值0.001，则表示CART决策树模型是一棵单结点的树，并输出单结点的树，此时无交互作用项；否则执行步骤3.3.3；Step 3.3.2. Traverse the segmentation point of each eigenvalue in the feature set X, and calculate the Gini index of the segmentation point of each eigenvalue; if the Gini index of the segmentation point of each eigenvalue in the feature set X is less than If the threshold is 0.001, it means that the CART decision tree model is a single-node tree and outputs a single-node tree, and there is no interaction item at this time; otherwise, go to step 3.3.3;

步骤3.3.3、选择特征集X中最小切分点的Gini指数所对应的特征值X_min及其相应的切分点α_min，并根据切分点α_min将训练样本集

In the two child nodes of the parent node;

若子集D_min1和子集D_min2的样本数均小于给定的结点样本阈值50，则表示两个子集D_min1和D_min2所在的子结点均是叶子结点，输出二叉决策树，此时仅存在二阶交互作用项；若子集D_min1和/或子集D_min2的样本数大于结点样本阈值50，则表示子集D_min1或子集D_min2所在的子结点是非叶子结点可进一步进行划分，并执行步骤3.3.4；If the number of samples of both subsets D _min1 and D _min2 is less than the given node sample threshold of 50, it means that the child nodes where the two subsets D _min1 and D _min2 are located are leaf nodes, and the binary decision tree is output, At this time, there is only a second-order interaction term; if the number of samples of subset D _min1 and/or subset D _min2 is greater than the node sample threshold of 50, it means that the child node where subset D _min1 or subset D _min2 is located is a non-leaf Nodes can be further divided and step 3.3.4 is executed;

步骤3.3.4、对于非叶子结点，令训练样本集

等于非叶子结点所对应的子集，并将最小切分点的Gini指数所对应的特征值X_min从特征集X中删除后，返回执行步骤3.3.1，直到所有子结点的样本数均小于结点样本阈值50或特征集X为空时，输出最终的二叉决策树；Step 3.3.4. For non-leaf nodes, let the training sample set

It is equal to the subset corresponding to the non-leaf node, and after deleting the feature value X _min corresponding to the Gini index of the minimum segmentation point from the feature set X, return to step 3.3.1 until the number of samples of all child nodes When both are less than the node sample threshold of 50 or the feature set X is empty, the final binary decision tree is output;

步骤3.4、令t^*+1赋值给t^*，并判断t^*＞3是否成立，若成立，则表示得到3个二叉决策树模型，并执行步骤3.5；否则，返回步骤3.3执行；Step 3.4, assign t ^* +1 to t ^* , and judge whether t ^* > 3 is established, if so, it means that three binary decision tree models are obtained, and step 3.5 is executed; otherwise, return to step 3.3 to execute;

步骤3.5、根据3个二叉决策树的树形图，确定分类变量间的交互作用项，其中，第t^*个事故子类别对应的二叉决策树所确定的交互作用项；Step 3.5, according to the dendrogram of the three binary decision trees, determine the interaction term between the categorical variables, wherein, the interaction term determined by the binary decision tree corresponding to the t ^* th accident subcategory;

图1所示是类别1的二叉决策树树形图，该图以类别1中所有数据为根结点，包含4层树高，5个叶子结点。图中每个结点矩形框都标明了该结点包含的事故总数、死亡事故和非死亡事故数及二者比例。从树形图(图1)可知车辆类型与乘客、车辆类型与道路技术等级、道路技术等级与道路线型之间存在二阶交互作用，车辆类型、道路技术等级和道路线型之间存在三阶交互作用；Figure 1 shows a binary decision tree tree diagram of category 1, which takes all the data in category 1 as the root node, including 4 layers of tree height and 5 leaf nodes. The rectangular box of each node in the figure indicates the total number of accidents, the number of fatal accidents and non-fatal accidents and the proportion of the two included in the node. From the tree diagram (Figure 1), it can be seen that there are second-order interactions between vehicle types and passengers, vehicle types and road technical grades, and road technical grades and road alignments. There are three-way interactions between vehicle types, road technical grades and road alignments. order interaction;

同理，确定类别2中存在二阶交互项分别是事故形态和照明条件、事故形态和车辆类型，类别3中存在二阶交互作用项是车辆类型和驾驶员年龄。Similarly, it is determined that there are second-order interaction terms in category 2, which are accident shape and lighting conditions, accident shape and vehicle type, and there are second-order interaction terms in category 3, which are vehicle type and driver age.

步骤四、对3个事故子类别分别建立基于二元logistic回归的事故严重度模型；Step 4: Establish an accident severity model based on binary logistic regression for the three accident sub-categories;

步骤4.1、将第t^*个事故子类别中的事故数据

作为事故严重度模型的拟合数据，以K个分类变量所组成集合X和第t^*个事故子类别的交互作用项共同作为事故严重度模型的自变量X^*；定义第t^*个事故子类别包含J个事故数据，J的值为

As the fitting data of the accident severity model, the set X composed of K categorical variables and the interaction term of the t ^* th accident sub-category are used as the independent variable X ^* of the accident severity model; define the t ^* th accident subcategory The category contains J accident data, and the value of J is

The predictor of the jth accident is denoted as y _j ;

利用SPSS对各事故子类别进行单因素卡方检验，其中P值小于0.05表示自变量与因变量显著相关。单因素卡方检验结果见表4，类别1中16个变量与事故严重度显著相关。One-way chi-square test was performed on each accident sub-category using SPSS, where the P value less than 0.05 indicated that the independent variable was significantly correlated with the dependent variable. The results of the one-way chi-square test are shown in Table 4, and 16 variables in category 1 are significantly correlated with the accident severity.

表4各事故子类别单因素卡方检验结果Table 4 Single-factor chi-square test results for each accident subcategory

步骤4.2、初始化t^*＝1；Step 4.2, initialize t ^* =1;

步骤4.3、利用式(14)得到基于二元logistic回归在自变量X^*条件下死亡事故即y_j＝1的发生概率P(y＝1|X^*)：Step 4.3, use formula (14) to obtain the probability P(y=1|X ^* ) of a fatal accident based on binary logistic regression under the condition of independent variable X ^* , that is, y _j =1:

式(13)中，w^*为自变量X^*的回归系数；In formula (13), w ^* is the regression coefficient of the independent variable X ^* ;

步骤4.4、利用极大似然法估计二元logistic回归的事故严重度模型的参数w^*：Step 4.4, use the maximum likelihood method to estimate the parameter w ^* of the accident severity model of binary logistic regression:

对于第j起事故，

为给定自变量

条件下y_j＝1的概率，则给定自变量

条件下y_j＝0的概率为1-P_j；并利用式(15)得到似然函数L(w^*)：For the jth accident,

for the given independent variable

The probability of y _j = 0 under the condition is 1-P _j ; and the likelihood function L(w ^* ) is obtained by using equation (15):

利用极大似然估计，求出使得L(w^*)取得最大值时的估计参数w′；利用SPSS软件进行事故严重度模型的参数估计，其中分类变量的交互作用项以分类变量乘积的形式作为模型分析的自变量，为方便模型结果解释并对各自变量设置哑变量；自变量进入或剔除模型采用Wald检验，进入或剔除标准分别为P<0.05和P>0.1，设置迭代次数为20次；The maximum likelihood estimation is used to obtain the estimated parameter w' when L(w ^* ) reaches the maximum value; SPSS software is used to estimate the parameters of the accident severity model, in which the interaction term of the categorical variables is in the form of the product of the categorical variables As the independent variable of the model analysis, to facilitate the interpretation of the model results and set dummy variables for the respective variables; Wald test was used for the entry or exclusion of the independent variables into the model, the entry or exclusion criteria were P<0.05 and P>0.1, and the number of iterations was set to 20 ;

根据估计参数w′得到第j起事故在自变量

条件下y_j＝1的预测概率

从而得到J起事故的预测概率

Predicted probability of y _j = 1

Thus, the predicted probability of J accidents is obtained

步骤4.5.1、定义θ为模型预测的分类阈值，且0<θ<1；

表示事故严重度模型预测第j起事故预测为死亡事故；

表示事故严重度模型预测第j起事故预测为非死亡事故；Step 4.5.1. Define θ as the classification threshold predicted by the model, and 0<θ<1;

步骤4.5.2、初始化j′＝1；Step 4.5.2, initialize j'=1;

步骤4.5.3、令模型的第j′个分类阈值θ_j′等于P′_j′，利用式(15)得到事故严重度模型预测的第j′个敏感度Se(θ_j′)，即事故数据集中死亡事故预测为死亡事故的概率：Step 4.5.3. Set the j'th classification threshold θ _j' of the model equal to P'_j' , and use the formula (15) to obtain the j'th sensitivity Se(θ _j' ) predicted by the accident severity model, that is, the accident The probability of fatal accidents predicted as fatal accidents in the dataset:

式(15)中，

表示第s起事故预测为死亡事故的概率，y_s＝1表示第s起事故为死亡事故，1≤s≤J；In formula (15),

利用式(16)得到事故严重度模型预测的第j′个特异性Sp(θ_j′)，即事故数据集中非死亡事故预测为非死亡事故的概率：Using Equation (16), the j'th specificity Sp(θ _j' ) predicted by the accident severity model is obtained, that is, the probability that a non-fatal accident is predicted to be a non-fatal accident in the accident data set:

式(16)中，

表示第s起事故预测为死亡事故的概率，y_s＝0表示第s起事故为死亡事故，1≤s≤J；In formula (16),

步骤4.5.5、以第j′个分类阈值θ_j′为横坐标，分别以第j′个分类阈值θ_j′所对应的敏感度Se(θ_j′)和特异性Sp(θ_j′)值为纵坐标，绘制敏感度与特异性的曲线，以两曲线的交点对应的阈值作为最佳模型预测分类阈值θ′；Step 4.5.5, take the j'th classification threshold θ _j' as the abscissa, respectively take the sensitivity Se(θ _j' ) and specificity Sp(θ _j' ) corresponding to the j'th classification threshold θ _j ' The value is the ordinate, and the curve of sensitivity and specificity is drawn, and the threshold corresponding to the intersection of the two curves is used as the optimal model to predict the classification threshold θ′;

步骤4.6、令t^*+1赋值给t^*，并判断t^*＞3是否成立，若成立，则表示获得3个事故严重度预测模型，否则，返回步骤4.3执行。Step 4.6, assign t ^* +1 to t ^* , and judge whether t ^* >3 is established, if so, it means that three accident severity prediction models are obtained, otherwise, return to step 4.3 for execution.

得到3个二元logistic回归模型得到事故严重度模型参数估计结果

如表6所示；回归系数w^*是由常数项β₀和自变量回归系数B构成的向量，其中，B值表示自变量的系数，其值为正表示对死亡事故的发生有正向影响，为负则表示有负向影响；OR＝exp(B)表示某一自变量的存在使死亡事故发生的概率增大或减少的量。Obtain three binary logistic regression models to obtain the parameter estimation results of the accident severity model

As shown in Table 6; the regression coefficient w ^* is a vector composed of the constant term β ₀ and the independent variable regression coefficient B, where the B value represents the coefficient of the independent variable, and a positive value indicates a positive impact on the occurrence of fatal accidents , if it is negative, it means there is a negative impact; OR=exp(B) means that the existence of a certain independent variable increases or decreases the probability of fatal accidents.

表6事故严重度模型估计结果Table 6 Estimation results of accident severity model

注：B为模型回归系数；OR为优势比，OR＝exp(B)；Note: B is the regression coefficient of the model; OR is the odds ratio, OR=exp(B);

同时，根据估计参数w′得到第j起事故在自变量

条件下y_j＝1的预测概率

从而得到以敏感度和特异性交点对应的预测分类阈值，如图2所示为类别1的敏感度与特异性曲线图。从而，得到3个事故子类别的预测分类阈值分别为0.2930、0.3928和0.4133，并求解出对应分类阈值下的模型预测准确度68.8％、75.5％和66.3％；At the same time, according to the estimated parameter w', the independent variable of the jth accident is obtained.

Predicted probability of y _j = 1

As a result, the predicted classification threshold corresponding to the intersection of sensitivity and specificity is obtained, as shown in Figure 2, which is a graph of sensitivity and specificity of category 1. Therefore, the predicted classification thresholds of the three accident sub-categories are 0.2930, 0.3928 and 0.4133, respectively, and the model prediction accuracy under the corresponding classification thresholds is 68.8%, 75.5% and 66.3%;

步骤4.6.1、事故严重度模型结果分析：Step 4.6.1. Analysis of accident severity model results:

由表6可知，各事故子类别中影响事故严重度的因素之间存在显著差异，其中，无证驾驶、酒驾、超速、中央隔离设施、地形，摩托车与乘客的二阶交互作用，以及货车与四级公路、道路线形的三阶交互作用仅在类别1中显著；农用车、撞击固定物、非高峰时段、道路线型、能见度仅在类别2中显著；坠车、等外公路、交通控制设施、年龄与非机动车的交互作用仅在类别3中显著。From Table 6, it can be seen that there are significant differences between the factors affecting the severity of accidents in each accident sub-category. Among them, unlicensed driving, drunk driving, speeding, central isolation facilities, terrain, second-order interactions between motorcycles and passengers, and trucks. The third-order interaction with the fourth-class highway and road alignment is significant only in category 1; agricultural vehicles, hitting fixed objects, off-peak hours, road alignment, and visibility are only significant in category 2; crashes, out-of-class highways, traffic The interaction of control facility, age, and non-motorized vehicles was significant only in category 3.

以类别1为例，无证驾驶、超速和酒驾的回归系数均为正，三种情况下死亡事故发生概率分别增加约132％、140％和124％。在事故形态方面，撞击非固定物使死亡事故的发生概率增加96％；有乘客状态下死亡事故发生概率增加165％，缺少道路中央隔离设施使死亡事故发生的概率增加120％；夜晚时死亡事故的发生概率上升约44％。Taking category 1 as an example, the regression coefficients of unlicensed driving, speeding and drunk driving are all positive, and the probability of fatal accidents in the three cases increases by about 132%, 140% and 124%, respectively. In terms of accident form, hitting a non-fixed object increases the probability of fatal accidents by 96%; with passengers, the probability of fatal accidents increases by 165%, and the lack of central road isolation facilities increases the probability of fatal accidents by 120%; fatal accidents at night The probability of occurrence increased by about 44%.

变量交互作用方面，摩托车搭载乘客驾驶时死亡事故发生概率降低约60％；货车在四级公路上行驶时，事故严重度易受道路线型影响，其中弯坡组合路段影响最大(OR值为12.036)，其次是弯道路段(OR值为5.57)。In terms of variable interaction, the probability of fatal accidents is reduced by about 60% when motorcycles are driven with passengers; when trucks are driving on Class 4 highways, the accident severity is easily affected by the road alignment, and the combination of curved slopes has the greatest impact (OR value is 12.036), followed by the curved road segment (OR value of 5.57).

步骤4.6.2、模型比较：Step 4.6.2, model comparison:

为比较本发明方法与传统二元logistic回归模型在事故严重度分析方面的优劣性，采用模型预测准确度和ROC曲线两个指标衡量模型预测精度，采用Hosmer-Lemeshow(HL)统计量衡量模型的拟合优度。In order to compare the advantages and disadvantages of the method of the present invention and the traditional binary logistic regression model in the analysis of accident severity, the model prediction accuracy and ROC curve are used to measure the model prediction accuracy, and the Hosmer-Lemeshow (HL) statistic is used to measure the model. goodness of fit.

以敏感度和特异性曲线交点为分类阈值得到模型预测准确度，其值越高表明模型性能越好；以1-特异性为横坐标、敏感度为纵坐标绘制ROC曲线，ROC曲线下的面积即AUC来评价模型的分类效能，AUC值大于0.5表示优于随机猜测具有预测价值，AUC值越接近于1表示模型的预测分类能力越好；以类别1为例，以敏感度和特异性曲线交点对应的阈值作为模型预测分类阈值如图2所示，以1-特异性为横坐标、敏感度为纵坐标绘制ROC曲线如图3所示；此外，模型拟合优度采用Hosmer-Lemeshow(HL)统计量，其服从卡方分布，P值不显著(>0.05)表示模型拟合数据较好。Taking the intersection of the sensitivity and specificity curves as the classification threshold, the prediction accuracy of the model is obtained, and the higher the value, the better the performance of the model; the ROC curve is drawn with 1-specificity as the abscissa and sensitivity as the ordinate, and the area under the ROC curve That is, AUC is used to evaluate the classification performance of the model. The AUC value greater than 0.5 indicates that it has predictive value better than random guessing. The closer the AUC value is to 1, the better the prediction and classification ability of the model. The threshold corresponding to the intersection is used as the model prediction classification threshold as shown in Figure 2, and the ROC curve is drawn with 1-specificity as the abscissa and sensitivity as the ordinate, as shown in Figure 3; HL) statistic, which obeys the chi-square distribution, and the P value is not significant (>0.05), indicating that the model fits the data well.

表7模型检验指标汇总表Table 7 Model test index summary table

由表7可知，本发明提出的一种应用于区域路网的交通事故严重度预测方法在模型预测准确度和拟合优度方面优于传统的二元logistic回归模型。It can be seen from Table 7 that a traffic accident severity prediction method applied to a regional road network proposed by the present invention is superior to the traditional binary logistic regression model in terms of model prediction accuracy and goodness of fit.

Claims

1. A traffic accident severity prediction method applied to a regional road network is characterized by comprising the following steps:

step one, collecting and preprocessing road traffic accident data of a regional road network;

acquiring N accident data from a road traffic accident database as an accident data set D, and selecting K classification variables from any ith accident data to form a set X { X ═ X { (X) }₁,x₂,…,x_k,…,x_KH, x to characterize the ith accident, wherein_kRepresents the kth categorical variable, and the kth categorical variable x_kComprises C_kSeed class, kth categorical variable x_kAt C_kThe values in the species are denoted as s_kLet s_ikRepresenting the value of the kth categorical variable of the ith accident, and recording a categorical variable value set consisting of the values of all the K categorical variables in the ith accident as S_i＝{s_i1,s_i2,...,s_ik,...,s_iK}; order to

Any value set of all possible values of K classification variables representing the ith accident; k is 1,2,3,. K; 1,2,3, ·, N;

the severity of the ith accident was taken as the predictor variable and recorded as y_iAnd y is_iThe value of (1) is '0' or '1' which respectively represents non-death accidents and death accidents;

step two, establishing a potential category analysis model according to the road traffic accident data of the regional road network;

step 2.1, defining that a potential category variable V exists in the potential category analysis model, wherein V includes T categories, and any category is marked as T, T is 1, 2. Marking the value of a potential category variable V in the ith accident as V_i；

Step 2.1.1, defining the external circulation times as tau and the maximum external circulation iteration times as tau_max(ii) a Let the number of categories set at the τ th time be T_τ(ii) a Initializing tau as 1;

step 2.1.2, initializing t to 1;

step 2.1.3 obtaining the ith accident V by initially utilizing the formula (1)_iWhen the value is t, namely the t-th potential category belongs to, the value collection of the ith accident on the K classification variables is

Conditional probability of (2)

In formula (1), P(s)_ik＝s_k|V_iT) indicates that the ith incident belongs to the tth potentialIn category, the value of the kth categorical variable is s_kThe conditional probability of (a);

step 2.1.4, obtaining a value set of K classified variables in the ith accident as

Is the joint probability of the potential class analysis model

In the formula (2), P (V)_iT) is the probability that the ith incident belongs to the tth potential category, with the potential category t accounting for the ratio of the population;

step 2.2, carrying out model parameter estimation by adopting a maximum likelihood method to obtain estimated values of the potential category probability and the classification variable conditional probability

And the τ -th maximum likelihood function value L of the potential category analysis model_τ；

Step 2.3, calculating the posterior probability of classifying the ith accident into the tth potential class by using the formula (3)

Step 2.4, assigning T +1 to T, and judging that T is more than T_τIf yes, executing step 2.5; otherwise, returning to the step 2.1.3 for execution;

step 2.5 obtaining the mold by using the formula (4), the formula (5), the formula (6) and the formula (7)A type fit evaluation index comprising: tth information evaluation index AIC_ττ th Bayesian information criterion BIC_τBayesian information criterion aBIC of the Tth sample correction_τTh order entropy value

AIC_τ＝-2ln(L_τ)+2M (4)

BIC_τ＝-2ln(L_τ)+ln(N)×M (5)

aBIC_τ＝-2ln(L_τ)+ln(n^*)×M (6)

In the formulas (4), (5), (6) and (7), M is the number of unknown parameters in the potential category analysis model; n is^*Is the adjusted sample size, and n^*＝(N+2)/24；

Step 2.6, assigning tau +1 to the rear tau, and judging that tau is larger than tau_maxIf yes, executing step 2.7; otherwise, returning to the step 2.1.3 for execution;

step 2.7, from τ_maxSecondary information evaluation index AIC, Bayesian information criterion BIC, Bayesian information criterion aBIC for sample correction and entropy value R²Selecting the number of potential categories corresponding to the optimal value of each model fitting evaluation index, and recording as T^*(ii) a Dividing the incident data set D into T^*The sub-category of the individual accidents, note

Denotes the t-th^*Accident data in the individual accident subcategories, t^*＝1,2,…,T^*；

Step three, analyzing the model result according to the potential category, and comparing T^*Establishing a CART decision tree model for each accident subcategory;

step 3.1, order the t^*Accident data in individual accident subcategories

As a training sample set, a set X consisting of K classification variables is used as a feature set in the CART decision tree model; let the node sample threshold be sigma, the eigenvalue cut point be alpha, and the Gini exponential threshold be epsilon;

step 3.2, initialize t^*＝1；

Step 3.3, the training sample set

Inputting a feature set X, a defined node sample threshold value sigma and a Gini index threshold value epsilon into the CART decision tree model;

step 3.4, let t^*+1 assignment to t^*And determine t^*＞T^*If yes, T is obtained^*A binary decision tree and step 3.5 is executed; otherwise, returning to the step 3.3 for execution;

step 3.5, according to the T^*Determining interaction items among the classification variables by using a tree diagram of a binary decision tree, wherein the t-th^*Interaction items determined by a binary decision tree corresponding to each accident subcategory;

step four, for T^*Establishing an accident severity model based on binary logistic regression for each accident subcategory;

step 4.1, the t th step^*Accident data in individual accident subcategories

As fitting data of accident severity model, set X and t composed of K classification variables^*The interaction items of the individual accident subcategories are used together as the accidentIndependent variable X of severity model^*(ii) a Define t < th > t^*Each accident subcategory contains J accident data, the value of J is

The prediction variable of the jth accident is recorded as y_j；

Step 4.2, initialize t^*＝1；

Step 4.3, obtaining the independent variable X based on binary logistic regression by using the formula (11)^*Death accident under conditions y_jProbability of occurrence of 1P (y 1| X)^*)：

In the formula (11), w^*Is an independent variable X^*The regression coefficient of (2);

step 4.4, estimating the parameter w of the accident severity model of the binary logistic regression by utilizing a maximum likelihood method^*：

For the event of the jth occurrence,

for a given independent variable

Under the condition of y_jProbability of 1, then given the argument

Under the condition of y_jProbability of 0 being 1-P_j(ii) a And a likelihood function L (w) is obtained by using the formula (12)^*)：

Using maximum likelihood estimation, find L (w)^*) ObtainingAn estimated parameter w' at maximum;

obtaining the independent variable of the jth accident at the beginning according to the estimated parameter w

Under the condition of y_jPrediction probability of 1

Thereby obtaining the prediction probability of the J-start accident

And sequencing in ascending order to obtain a sequenced prediction probability set which is marked as { P'₁,...,P′_j,...,P′_J}；

Step 4.5, adjusting a prediction classification threshold value of the accident severity model;

step 4.6, let t^*+1 assignment to t^*And determine t^*＞T^*Whether the T is obtained or not, if so, the T is obtained^*And 4, performing the accident severity prediction model, and otherwise, returning to the step 4.3 for execution.

2. The method of predicting the severity of a traffic accident according to claim 1, wherein said step 3.3 is performed as follows:

step 3.3.1, the CART decision tree uses Gini coefficient as the basis for judging whether the decision tree branches, a binary decision tree model is established, the training sample set is cut into points alpha according to the characteristic value

Into a first subset D_α1And a second subset D_α2Obtaining the Gini index Gini (D) of the eigenvalue cut point alpha by using the formula (8)_α)：

In the formula (8), the reaction mixture is,

|D_α1i and I D_α2Respectively representing training sample sets

First subset D_α1And a second subset D_α2Including the total number of accidents;

Gini(D_α1) Represents the first subset D_α1And has a Gini index of:

in the formula (9), the reaction mixture is,

and

respectively represent a first subset D_α1Probability of median non-death and death incidents;

in formula (8), Gini (D)_α2) Represents a second subset D_α2And has a Gini index of:

in the formula (10), the compound represented by the formula (10),

and

respectively represent a second subset D_α2Probability of median non-death and death incidents;

step 3.3.2, traversing the segmentation points of each characteristic value in the characteristic set X, and calculating the Gini index of the segmentation points of each characteristic value; if the Gini index of the segmentation point of each feature value in the feature set X is smaller than the threshold epsilon, the CART decision tree model is a single-node tree, and the single-node tree is output; otherwise, executing step 3.3.3;

step 3.3.3, selecting the characteristic value X corresponding to the Gini index of the minimum cut point in the characteristic set X_minAnd its corresponding point of tangency alpha_minAnd according to said point of tangency alpha_minWill train sample set D_t*Divided into two subsets D_min1And D_min2Then, the subset D is further processed_min1And subset D_min2Are respectively assigned to training sample sets D_t*Two child nodes which are father nodes;

if subset D_min1And subset D_min2Is less than the given nodal sample threshold σ, two subsets D are represented_min1And D_min2All the subnodes are leaf nodes, and a binary decision tree is output; if subset D_min1And/or subset D_min2Is greater than the nodal sample threshold σ, then subset D is represented_min1Or subset D_min2The sub-node is a non-leaf node, which can be further divided and step 3.3.4 is executed;

step 3.3.4, for non-leaf nodes, order the training sample set

Equal to the subset corresponding to the non-leaf node, and the characteristic value X corresponding to the Gini index of the minimum cut point_minAnd (4) after the samples are deleted from the feature set X, returning to execute the step 3.3.1 until the number of the samples of all the subnodes is smaller than the node sample threshold value sigma or the feature set X is empty, and outputting a final binary decision tree.

3. The method of predicting the severity of a traffic accident according to claim 1, wherein said step 4.5 is performed as follows:

step 4.5.1, define θ as the pre-modelMeasure a classification threshold value of 0<θ<1；

Representing that the accident severity model predicts the jth accident prediction as a death accident;

representing that the accident severity model predicts that the jth accident is predicted to be a non-death accident;

step 4.5.2, initializing j' ═ 1;

step 4.5.3, let the jth' classification threshold θ of the model_j′Is equal to P'_j′The j' th sensitivity Se (theta) predicted by the accident severity model is obtained by using the formula (13)_j′) I.e. the probability that the death incident is predicted to be a death incident in the incident data set:

in the formula (13), the reaction mixture is,

representing the probability that the s-th accident is predicted to be a death accident, y_s1 represents that the s-th accident is a death accident, and s is more than or equal to 1 and less than or equal to J;

obtaining the j' th specific Sp (theta) predicted by the accident severity model by using the formula (14)_j′) I.e. the probability that a non-fatal accident is predicted to be a non-fatal accident in the accident data set:

in the formula (14), the compound represented by the formula (I),

representing the probability that the s-th accident is predicted to be a death accident, y_s0 means the s th accidentS is more than or equal to 1 and less than or equal to J for death accidents;

step 4.5.4, assigning J ' +1 to J ', judging whether J ' > J is true, if so, representing that J pair sensitivity and specificity values are obtained, and executing step 4.5.5; otherwise, returning to the step 4.5.3 for execution;

step 4.5.5, sorting threshold theta according to jth' number_j′As the abscissa, the j' th classification threshold θ is respectively set_j′Corresponding sensitivity Se (theta)_j′) And specificity Sp (. theta.)_j′) The value is the ordinate, the sensitivity and specificity curves are drawn, and the threshold corresponding to the intersection point of the two curves is used as the optimal model prediction classification threshold theta'.