[go: up one dir, main page]

CN110458244B - Traffic accident severity prediction method applied to regional road network - Google Patents

Traffic accident severity prediction method applied to regional road network Download PDF

Info

Publication number
CN110458244B
CN110458244B CN201910770584.3A CN201910770584A CN110458244B CN 110458244 B CN110458244 B CN 110458244B CN 201910770584 A CN201910770584 A CN 201910770584A CN 110458244 B CN110458244 B CN 110458244B
Authority
CN
China
Prior art keywords
accident
model
formula
value
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910770584.3A
Other languages
Chinese (zh)
Other versions
CN110458244A (en
Inventor
石琴
杨慧敏
陈一锴
骆仁佳
于淑君
董满生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201910770584.3A priority Critical patent/CN110458244B/en
Publication of CN110458244A publication Critical patent/CN110458244A/en
Application granted granted Critical
Publication of CN110458244B publication Critical patent/CN110458244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Development Economics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a traffic accident severity prediction method applied to a regional road network, which comprises the following steps: 1. collecting and preprocessing traffic accident data of a regional road network; 2. establishing a potential category analysis model based on the traffic accident data of the regional road network; 3. respectively establishing a CART decision tree model for each subcategory according to the potential category analysis result; 4. and (3) establishing an accident severity model (considering independent variables and interaction terms) based on binary logistic regression for each subcategory, and taking the intersection point of the sensitivity curve and the specificity curve as a model prediction classification threshold. The method can reduce the adverse effect of the heterogeneity of accident data on the analysis result, overcome the problems that the traditional traffic accident severity prediction model ignores interaction items and the comprehensive prediction effect of unbalanced data is poor, and improve the prediction precision and the fitting goodness of the accident severity model.

Description

一种应用于区域路网的交通事故严重度预测方法A traffic accident severity prediction method applied to regional road network

技术领域technical field

本发明涉及一种应用于区域路网的交通事故严重度预测方法,属于道路交通安全分析技术领域。The invention relates to a traffic accident severity prediction method applied to a regional road network, and belongs to the technical field of road traffic safety analysis.

背景技术Background technique

据全球道路安全状况报告,道路交通事故是全球第八大死亡原因,造成每年超过135万人死亡,道路交通安全逐渐成为全球都在关注的重大焦点问题。依靠交通事故数据分析来确定影响事故严重度的因素和提出降低死亡事故风险的对策,是目前最实际的交通安全改善措施之一。然而,道路交通事故是涉及各种驾驶员对外部环境反应,以及车辆、道路状况、交通因素和环境因素之间相互作用的复杂事件,可能存在未观测到的事故影响因素,这使得交通事故数据具有高度异质性,而且事故严重度可能受到各因素之间交互作用的影响。According to the Global State of Road Safety Report, road traffic accidents are the eighth leading cause of death in the world, killing more than 1.35 million people every year. Road traffic safety has gradually become a major focus of global attention. Relying on the analysis of traffic accident data to determine the factors affecting the severity of the accident and to propose countermeasures to reduce the risk of fatal accidents is one of the most practical measures to improve traffic safety at present. However, road traffic accidents are complex events involving various driver responses to the external environment, as well as the interaction between vehicles, road conditions, traffic factors and environmental factors, and there may be unobserved accident influencing factors, which makes traffic accident data There is high heterogeneity, and accident severity may be affected by the interaction between various factors.

在事故严重度(死亡和非死亡事故)分析方法方面,二元logistic回归模型应用最为广泛。然而,该方法忽略了事故数据的异质性和各自变量之间的交互作用对分析结果的影响,可能会导致不准确的参数估计或忽略重要的隐藏的关系。余荣杰等人利用潜在类别分析将事故数据划分为若干同质潜在类别降低事故数据异质性对分析结果的影响(Yu R,Wang X,Abdel-Aty M.A Hybrid Latent Class Analysis Modeling Approach toAnalyze Urban Expressway Crash Risk[J].AccidentAnalysis and Prevention,2017,101:37-43.)。Rusli等人利用决策树筛选自变量间的高阶交互作用,并将高阶交互项和主效应相结合纳入事故严重度模型,定量分析自变量的交互作用对事故严重度的影响,而该方法仅考虑了自变量间的高阶交互作用忽略了自变量间存在的各阶交互作用(RusdiRusli,Md.Mazharul Haque,Mohammad Saifuzzaman,Mark King.Crash severity alongrural mountainous highways in Malaysia:An application of a combined decisiontree and logistic regression model[J].Traffic Injury Prevention,2018,19(7):741-748.)。此外,传统的二元logistic回归模型仅考虑模型的整体预测精度,选取0.5作为模型分类阈值。然而,交通事故数据中死亡事故往往占比较少(即该数据为非平衡数据),采用0.5作为分类阈值虽然使模型能够获得较高的整体预测精度,但会使敏感度过低,使其失去预测意义。In the analysis of accident severity (fatality and non-fatal accidents), the binary logistic regression model is the most widely used. However, this method ignores the heterogeneity of accident data and the influence of interactions between individual variables on the analysis results, which may lead to inaccurate parameter estimates or ignore important hidden relationships. Yu Rongjie et al. used latent class analysis to divide accident data into several homogeneous latent classes to reduce the impact of accident data heterogeneity on analysis results (Yu R, Wang X, Abdel-Aty M.A Hybrid Latent Class Analysis Modeling Approach to Analyze Urban Expressway Crash Risk [J]. Accident Analysis and Prevention, 2017, 101: 37-43.). Rusli et al. used decision trees to screen high-order interactions between independent variables, and combined high-order interaction terms and main effects into the accident severity model to quantitatively analyze the impact of the interaction of independent variables on accident severity. Only the higher-order interactions between independent variables are considered and the various-order interactions between independent variables are ignored (RusdiRusli, Md. Mazharul Haque, Mohammad Saifuzzaman, Mark King. Crash severity alongrural mountainouss in Malaysia: An application of a combined decisiontree and logistic regression model[J].Traffic Injury Prevention,2018,19(7):741-748.). In addition, the traditional binary logistic regression model only considers the overall prediction accuracy of the model, and selects 0.5 as the model classification threshold. However, the proportion of fatal accidents in the traffic accident data is often small (that is, the data is unbalanced data). Although the use of 0.5 as the classification threshold allows the model to obtain a higher overall prediction accuracy, it will make the sensitivity too low and make it lose predictive significance.

发明内容SUMMARY OF THE INVENTION

本发明为克服现有技术的不足之处,提出一种应用于区域路网的交通事故严重度预测方法,以期能降低事故数据异质性对分析结果的不利影响、识别自变量的交互作用项和调整预测模型分类阈值,从而能克服传统交通事故严重度预测模型忽略交互作用项和非平衡数据综合预测效果差的问题,提高事故严重度模型的预测精度和拟合优度。In order to overcome the shortcomings of the prior art, the present invention proposes a traffic accident severity prediction method applied to a regional road network, in order to reduce the adverse effect of the heterogeneity of accident data on the analysis results, and to identify the interaction terms of independent variables. By adjusting the classification threshold of the prediction model, it can overcome the problem that the traditional traffic accident severity prediction model ignores the interaction term and the poor comprehensive prediction effect of the unbalanced data, and improves the prediction accuracy and goodness of fit of the accident severity model.

为达到上述目的,本发明采用如下技术方案:To achieve the above object, the present invention adopts the following technical solutions:

本发明一种应用于区域路网的交通事故严重度预测方法的特点是按如下步骤进行:The characteristics of a traffic accident severity prediction method applied to a regional road network according to the present invention are carried out according to the following steps:

步骤一、区域路网道路交通事故数据的采集与预处理;Step 1. Collection and preprocessing of road traffic accident data on the regional road network;

从道路交通事故数据库中获取N起事故数据作为事故数据集D,并从任意第i起事故数据中选取K个分类变量组成集合X={x1,x2,…,xk,…,xK}来表征第i起事故,其中,xk表示第k个分类变量,且第k个分类变量xk包含Ck种类别,第k个分类变量xk在Ck种类别中的取值记为sk,令sik表示第i起事故的第k个分类变量的取值,则第i起事故中所有K个分类变量的取值所组成的分类变量取值集合记为Si={si1,si2,...,sik,...,siK};令

Figure GDA0002805037980000021
表示第i起事故的K个分类变量的所有可能取值中的任意一种取值集合;k=1,2,3,...,K;i=1,2,3,...,N;Obtain N accident data from the road traffic accident database as accident data set D, and select K categorical variables from any i-th accident data to form a set X={x 1 ,x 2 ,...,x k ,...,x K } to represent the i-th accident, where x k represents the k-th categorical variable, and the k-th categorical variable x k contains C k categories, and the value of the k-th categorical variable x k in the C k categories Denoted as s k , let s ik represent the value of the kth categorical variable of the ith accident, then the set of categorical variable values composed of the values of all K categorical variables in the ith accident is denoted as S i = {s i1 ,s i2 ,...,s ik ,...,s iK }; let
Figure GDA0002805037980000021
Represents any set of possible values of the K categorical variables of the ith accident; k=1,2,3,...,K; i=1,2,3,..., N;

将第i起事故的严重度作为预测变量,记为yi,且yi的取值为“0”或“1”分别表示非死亡事故和死亡事故;Take the severity of the i-th accident as a predictor variable, denoted as y i , and the value of y i is "0" or "1" to indicate a non-fatal accident and a fatal accident, respectively;

步骤二、根据区域路网道路交通事故数据,建立潜在类别分析模型;Step 2: Establish a potential category analysis model according to the road traffic accident data of the regional road network;

步骤2.1、定义所述潜在类别分析模型中存在一个潜在类别变量V,V包含T种类别,且任意一种类别记为t,t=1,2,...,T;令第i起事故中潜在类别变量V的取值记为ViStep 2.1. Define that there is a latent category variable V in the latent category analysis model, V contains T categories, and any category is denoted as t, t=1,2,...,T; let the ith accident The value of the latent categorical variable V in is denoted as V i ;

步骤2.1.1、定义外循环次数为τ、最大外循环迭代次数为τmax;令第τ次所设置的类别数目为Tτ;初始化τ=1;Step 2.1.1, define the number of outer loops to be τ, and the maximum number of outer loop iterations to be τ max ; make the number of categories set for the τ th time be T τ ; initialize τ=1;

步骤2.1.2、初始化t=1;Step 2.1.2, initialize t=1;

步骤2.1.3、初利用式(1)得到第i起事故Vi取值为t,即属于第t种潜在类别时,第i起事故在K个分类变量上的取值集合为

Figure GDA0002805037980000022
的条件概率
Figure GDA0002805037980000023
Step 2.1.3. Initially use formula (1) to obtain the value of the i-th accident V i as t, that is, when it belongs to the t-th potential category, the value set of the i-th accident on the K categorical variables is:
Figure GDA0002805037980000022
The conditional probability of
Figure GDA0002805037980000023

Figure GDA0002805037980000024
Figure GDA0002805037980000024

式(1)中,P(sik=sk|Vi=t)表示第i起事故属于第t个潜在类别时,第k个分类变量上取值为sk的条件概率;In formula (1), P(s ik =s k |V i =t) represents the conditional probability of the value of s k on the k-th categorical variable when the i-th accident belongs to the t-th potential category;

步骤2.1.4、利用式(2)得到第i起事故中K个分类变量取值集合为

Figure GDA0002805037980000025
的非条件概率即潜在类别分析模型的联合概率
Figure GDA0002805037980000026
Step 2.1.4, use formula (2) to obtain the value set of K categorical variables in the ith accident as
Figure GDA0002805037980000025
The unconditional probability of is the joint probability of the latent class analysis model
Figure GDA0002805037980000026

Figure GDA0002805037980000027
Figure GDA0002805037980000027

式(2)中,P(Vi=t)是第i起事故属于第t个潜在类别的概率,潜在类别t占总体的比率;In formula (2), P(V i =t) is the probability that the i-th accident belongs to the t-th potential category, and the ratio of the potential category t to the population;

步骤2.2、采用极大似然法进行模型参数估计,得到潜在类别概率和分类变量条件概率的估计值

Figure GDA0002805037980000031
以及潜在类别分析模型的第τ次极大似然函数值Lτ;Step 2.2. Use the maximum likelihood method to estimate the model parameters to obtain the estimated values of the latent class probability and the conditional probability of the categorical variable
Figure GDA0002805037980000031
and the τth maximum likelihood function value L τ of the latent class analysis model;

步骤2.3、利用式(3)计算第i起事故被分类到第t个潜在类别的后验概率

Figure GDA0002805037980000032
Step 2.3. Use equation (3) to calculate the posterior probability that the i-th accident is classified into the t-th potential category
Figure GDA0002805037980000032

Figure GDA0002805037980000033
Figure GDA0002805037980000033

步骤2.4、令t+1赋值给t,并判断t>Tτ是否成立,若成立,则执行步骤2.5;否则,返回步骤2.1.3执行;Step 2.4, assign t+1 to t, and determine whether t>T τ is established, if so, execute step 2.5; otherwise, return to step 2.1.3 to execute;

步骤2.5、利用式(4)、式(5)、式(6)和式(7)得到模型拟合评价指标,包括:第τ次信息评价指标AICτ、第τ次贝叶斯信息准则BICτ、第τ次样本校正的贝叶斯信息准则aBICτ、第τ次熵值

Figure GDA0002805037980000034
Step 2.5, use formula (4), formula (5), formula (6) and formula (7) to obtain the model fitting evaluation index, including: the τth information evaluation index AIC τ , the τth Bayesian information criterion BIC τ , the Bayesian Information Criterion aBIC τ of the τth sample correction τ , the τth entropy value
Figure GDA0002805037980000034

AICτ=-2ln(Lτ)+2M (4)AIC τ = -2ln(L τ )+2M (4)

BICτ=-2ln(Lτ)+ln(N)×M (5)BIC τ =-2ln(L τ )+ln(N)×M (5)

aBICτ=-2ln(Lτ)+ln(n*)×M (6)aBIC τ =-2ln(L τ )+ln(n * )×M (6)

Figure GDA0002805037980000035
Figure GDA0002805037980000035

式(4)、式(5)、式(6)和式(7)中,M为潜在类别分析模型中未知参数的个数;n*是调整后的样本量,且n*=(N+2)/24;In formula (4), formula (5), formula (6) and formula (7), M is the number of unknown parameters in the latent class analysis model; n * is the adjusted sample size, and n * =(N+ 2)/24;

步骤2.6、将τ+1赋值给后τ,判断τ>τmax是否成立,若成立,则执行步骤2.7;否则,返回步骤2.1.3执行;Step 2.6, assign τ+1 to the latter τ, and judge whether τ>τ max is established, if so, execute step 2.7; otherwise, return to step 2.1.3 to execute;

步骤2.7、从τmax次信息评价指标AIC、贝叶斯信息准则BIC、样本校正的贝叶斯信息准则aBIC和熵值R2中选出各个模型拟合评价指标均取到最优值时所对应的潜在类别个数,记为T*;将所述事故数据集D划分为T*个事故子类别,记为

Figure GDA0002805037980000036
Figure GDA0002805037980000037
表示第t*个事故子类别中的事故数据,t*=1,2,…,T*;Step 2.7. Select the optimal value for each model fitting evaluation index from among the τ max times information evaluation index AIC, the Bayesian information criterion BIC, the sample-corrected Bayesian information criterion aBIC and the entropy value R 2 . The number of corresponding potential categories is denoted as T * ; the accident data set D is divided into T * accident sub-categories, denoted as
Figure GDA0002805037980000036
Figure GDA0002805037980000037
represents the accident data in the t * th accident sub-category, t * =1,2,…,T * ;

步骤三、根据潜在类别分析模型结果,对T*个事故子类别分别建立CART决策树模型;Step 3: According to the results of the potential category analysis model, establish a CART decision tree model for the T * accident subcategories;

步骤3.1、令所述第t*个事故子类别中的事故数据

Figure GDA0002805037980000038
作为训练样本集,令K个分类变量所组成的集合X为所述CART决策树模型中的特征集;令结点样本阈值为σ、特征值切分点为α、Gini指数阈值为ε;Step 3.1. Let the accident data in the t * th accident sub-category
Figure GDA0002805037980000038
As a training sample set, let the set X composed of K categorical variables be the feature set in the CART decision tree model; let the node sample threshold be σ, the feature value segmentation point be α, and the Gini index threshold be ε;

步骤3.2、初始化t*=1;Step 3.2, initialize t * =1;

步骤3.3、将所述训练样本集

Figure GDA0002805037980000049
特征集X、定义结点样本阈值σ和Gini指数阈值ε输入所述CART决策树模型;Step 3.3, the training sample set
Figure GDA0002805037980000049
The feature set X, the defined node sample threshold σ and the Gini index threshold ε are input into the CART decision tree model;

步骤3.4、令t*+1赋值给t*,并判断t*>T*是否成立,若成立,则表示得到T*个二叉决策树,并执行步骤3.5;否则,返回步骤3.3执行;Step 3.4, assign t * +1 to t * , and judge whether t * > T * is established, if so, it means that T * binary decision trees are obtained, and step 3.5 is performed; otherwise, return to step 3.3 for execution;

步骤3.5、根据所述T*个二叉决策树的树形图,确定分类变量间的交互作用项,其中,第t*个事故子类别对应的二叉决策树所确定的交互作用项;Step 3.5, according to the dendrogram of the T * binary decision trees, determine the interaction term between the categorical variables, wherein, the interaction term determined by the binary decision tree corresponding to the t * th accident subcategory;

步骤四、对T*个事故子类别分别建立基于二元logistic回归的事故严重度模型;Step 4: Establish an accident severity model based on binary logistic regression for the T * accident sub-categories;

步骤4.1、将所述第t*个事故子类别中的事故数据

Figure GDA00028050379800000410
作为事故严重度模型的拟合数据,以K个分类变量所组成集合X和第t*个事故子类别的交互作用项共同作为所述事故严重度模型的自变量X*;定义第t*个事故子类别包含J个事故数据,J的值为
Figure GDA00028050379800000411
第j起事故的预测变量记为yj;Step 4.1. Combine the accident data in the t * th accident sub-category
Figure GDA00028050379800000410
As the fitting data of the accident severity model, the set X composed of K categorical variables and the interaction term of the t * th accident sub-category are taken together as the independent variable X * of the accident severity model; define the t * th The accident subcategory contains J accident data, and the value of J is
Figure GDA00028050379800000411
The predictor of the jth accident is denoted as y j ;

步骤4.2、初始化t*=1;Step 4.2, initialize t * =1;

步骤4.3、利用式(11)得到基于二元logistic回归在自变量X*条件下死亡事故即yj=1的发生概率P(y=1|X*):Step 4.3, use formula (11) to obtain the probability P(y=1|X * ) of fatal accident under the condition of independent variable X * based on binary logistic regression: y j =1:

Figure GDA0002805037980000041
Figure GDA0002805037980000041

式(11)中,w*为自变量X*的回归系数;In formula (11), w * is the regression coefficient of the independent variable X * ;

步骤4.4、利用极大似然法估计所述二元logistic回归的事故严重度模型的参数w*Step 4.4, using the maximum likelihood method to estimate the parameter w * of the accident severity model of the binary logistic regression:

对于第j起事故,

Figure GDA0002805037980000042
为给定自变量
Figure GDA0002805037980000043
条件下yj=1的概率,则给定自变量
Figure GDA0002805037980000044
条件下yj=0的概率为1-Pj;并利用式(12)得到似然函数L(w*):For the jth accident,
Figure GDA0002805037980000042
for the given independent variable
Figure GDA0002805037980000043
The probability of y j = 1 under the condition, then given the independent variable
Figure GDA0002805037980000044
The probability of y j = 0 under the condition is 1-P j ; and the likelihood function L(w * ) is obtained by using equation (12):

Figure GDA0002805037980000045
Figure GDA0002805037980000045

利用极大似然估计,求出使得L(w*)取得最大值时的估计参数w′;Using the maximum likelihood estimation, find the estimated parameter w' when L(w * ) takes the maximum value;

根据估计参数w′得到第j起事故在自变量

Figure GDA0002805037980000046
条件下yj=1的预测概率
Figure GDA0002805037980000047
从而得到J起事故的预测概率
Figure GDA0002805037980000048
并进行升序排序,得到排序后的预测概率集合记为{P′1,...,P′j,...,P′J};According to the estimated parameter w', the jth accident is obtained in the independent variable
Figure GDA0002805037980000046
Predicted probability of y j = 1
Figure GDA0002805037980000047
Thus, the predicted probability of J accidents is obtained
Figure GDA0002805037980000048
And sort in ascending order to get the sorted set of predicted probabilities as {P′ 1 ,...,P′ j ,...,P′ J };

步骤4.5、调整事故严重度模型的预测分类阈值;Step 4.5, adjust the prediction classification threshold of the accident severity model;

步骤4.6、令t*+1赋值给t*,并判断t*>T*是否成立,若成立,则表示获得T*个事故严重度预测模型,否则,返回步骤4.3执行。Step 4.6, assign t * +1 to t * , and judge whether t * >T * is true, if true, it means that T * accident severity prediction models are obtained, otherwise, return to step 4.3 for execution.

本发明所述的交通事故严重度预测方法的特点也在于,所述步骤3.3是按如下过程进行:The characteristic of the traffic accident severity prediction method of the present invention is that the step 3.3 is carried out according to the following process:

步骤3.3.1、CART决策树使用Gini系数作为判定决策树是否进行分支的依据,建立二叉决策树模型,根据特征值切分点α,将所述训练样本集

Figure GDA0002805037980000051
分为第一子集Dα1和第二子集Dα2,利用式(8)得到所述特征值切分点α的Gini指数Gini(Dα):Step 3.3.1. The CART decision tree uses the Gini coefficient as the basis for judging whether the decision tree is branched, establishes a binary decision tree model, and divides the training sample set according to the eigenvalue segmentation point α.
Figure GDA0002805037980000051
It is divided into a first subset D α1 and a second subset D α2 , and the Gini index Gini(D α ) of the eigenvalue segmentation point α is obtained by using formula (8):

Figure GDA0002805037980000052
Figure GDA0002805037980000052

式(8)中,

Figure GDA0002805037980000053
|Dα1|和|Dα2|分别表示训练样本集
Figure GDA0002805037980000054
第一子集Dα1和第二子集Dα2中包含事故总数;In formula (8),
Figure GDA0002805037980000053
|D α1 | and |D α2 | denote the training sample set, respectively
Figure GDA0002805037980000054
The first subset D α1 and the second subset D α2 contain the total number of accidents;

Gini(Dα1)表示第一子集Dα1的Gini指数,并有:Gini(D α1 ) represents the Gini index of the first subset D α1 , and has:

Figure GDA0002805037980000055
Figure GDA0002805037980000055

式(9)中,

Figure GDA0002805037980000056
Figure GDA0002805037980000057
分别表示第一子集Dα1中非死亡和死亡事故的概率;In formula (9),
Figure GDA0002805037980000056
and
Figure GDA0002805037980000057
are the probabilities of non-fatal and fatal accidents in the first subset D α1 , respectively;

式(8)中,Gini(Dα2)表示第二子集Dα2的Gini指数,并有:In formula (8), Gini(D α2 ) represents the Gini index of the second subset D α2 , and has:

Figure GDA0002805037980000058
Figure GDA0002805037980000058

式(10)中,

Figure GDA0002805037980000059
Figure GDA00028050379800000510
分别表示第二子集Dα2中非死亡和死亡事故的概率;In formula (10),
Figure GDA0002805037980000059
and
Figure GDA00028050379800000510
are the probabilities of non-fatal and fatal accidents in the second subset D α2 , respectively;

步骤3.3.2、遍历所述特征集X中每个特征值的切分点,并计算每个特征值的切分点的Gini指数;若特征集X中每个特征值的切分点的Gini指数小于阈值ε,则表示所述CART决策树模型是一棵单结点的树,并输出所述单结点的树;否则执行步骤3.3.3;Step 3.3.2, traverse the segmentation points of each eigenvalue in the feature set X, and calculate the Gini index of the segmentation points of each eigenvalue; if the Gini index of the segmentation points of each eigenvalue in the feature set X is If the index is less than the threshold ε, it means that the CART decision tree model is a single-node tree, and the single-node tree is output; otherwise, step 3.3.3 is performed;

步骤3.3.3、选择特征集X中最小切分点的Gini指数所对应的特征值Xmin及其相应的切分点αmin,并根据所述切分点αmin将训练样本集

Figure GDA00028050379800000511
分为两个子集Dmin1和Dmin2,再将子集Dmin1和子集Dmin2分别分配到以训练样本集
Figure GDA00028050379800000512
为父节点的两个子结点中;Step 3.3.3. Select the feature value X min corresponding to the Gini index of the minimum segmentation point in the feature set X and its corresponding segmentation point α min , and divide the training sample set according to the segmentation point α min .
Figure GDA00028050379800000511
Divide into two subsets D min1 and D min2 , and then assign the subset D min1 and the subset D min2 to the training sample set respectively
Figure GDA00028050379800000512
In the two child nodes of the parent node;

若子集Dmin1和子集Dmin2的样本数均小于给定的结点样本阈值σ,则表示两个子集Dmin1和Dmin2所在的子结点均是叶子结点,输出二叉决策树;若子集Dmin1和/或子集Dmin2的样本数大于所述结点样本阈值σ,则表示子集Dmin1或子集Dmin2所在的子结点是非叶子结点可进一步进行划分,并执行步骤3.3.4;If the number of samples of the subset D min1 and the subset D min2 are both smaller than the given node sample threshold σ, it means that the child nodes where the two subsets D min1 and D min2 are located are both leaf nodes, and a binary decision tree is output; If the number of samples of the subset D min1 and/or the subset D min2 is greater than the node sample threshold σ, it means that the child node where the subset D min1 or the subset D min2 is located is a non-leaf node and can be further divided, and Perform step 3.3.4;

步骤3.3.4、对于非叶子结点,令训练样本集

Figure GDA00028050379800000513
等于非叶子结点所对应的子集,并将最小切分点的Gini指数所对应的特征值Xmin从特征集X中删除后,返回执行步骤3.3.1,直到所有子结点的样本数均小于结点样本阈值σ或特征集X为空时,输出最终的二叉决策树。Step 3.3.4. For non-leaf nodes, let the training sample set
Figure GDA00028050379800000513
It is equal to the subset corresponding to the non-leaf node, and after deleting the feature value X min corresponding to the Gini index of the minimum segmentation point from the feature set X, return to step 3.3.1 until the number of samples of all child nodes When all are smaller than the node sample threshold σ or the feature set X is empty, the final binary decision tree is output.

所述步骤4.5是按如下过程进行:Said step 4.5 is carried out as follows:

步骤4.5.1、定义θ为模型的预测分类阈值,且0<θ<1;

Figure GDA0002805037980000061
表示事故严重度模型预测第j起事故预测为死亡事故;
Figure GDA0002805037980000062
表示事故严重度模型预测第j起事故预测为非死亡事故;Step 4.5.1. Define θ as the prediction classification threshold of the model, and 0<θ<1;
Figure GDA0002805037980000061
Indicates that the accident severity model predicts that the jth accident is predicted to be a fatal accident;
Figure GDA0002805037980000062
Indicates that the accident severity model predicts that the jth accident is predicted to be a non-fatal accident;

步骤4.5.2、初始化j′=1;Step 4.5.2, initialize j'=1;

步骤4.5.3、令模型的第j′个分类阈值θj′等于P′j′,利用式(13)得到事故严重度模型预测的第j′个敏感度Se(θj′),即事故数据集中死亡事故预测为死亡事故的概率:Step 4.5.3. Set the j'th classification threshold θ j' of the model to be equal to P'j' , and use the formula (13) to obtain the j'th sensitivity Se(θ j' ) predicted by the accident severity model, that is, the accident The probability of fatal accidents predicted as fatal accidents in the dataset:

Figure GDA0002805037980000063
Figure GDA0002805037980000063

式(13)中,

Figure GDA0002805037980000064
表示第s起事故预测为死亡事故的概率,ys=1表示第s起事故为死亡事故,1≤s≤J;In formula (13),
Figure GDA0002805037980000064
Represents the probability that the sth accident is predicted to be a fatal accident, y s =1 indicates that the sth accident is a fatal accident, 1≤s≤J;

利用式(14)得到事故严重度模型预测的第j′个特异性Sp(θj′),即事故数据集中非死亡事故预测为非死亡事故的概率:Using Equation (14), the j′-th specificity Sp(θ j′ ) predicted by the accident severity model is obtained, that is, the probability that a non-fatal accident is predicted to be a non-fatal accident in the accident data set:

Figure GDA0002805037980000065
Figure GDA0002805037980000065

式(14)中,

Figure GDA0002805037980000066
表示第s起事故预测为死亡事故的概率,ys=0表示第s起事故为死亡事故,1≤s≤J;In formula (14),
Figure GDA0002805037980000066
Represents the probability that the sth accident is predicted to be a fatal accident, y s =0 indicates that the sth accident is a fatal accident, 1≤s≤J;

步骤4.5.4、令j′+1赋值给j′,并判断j′>J是否成立,若成立,则表示得到J对敏感度和特异性取值,并执行步骤4.5.5;否则,返回步骤4.5.3执行;Step 4.5.4. Assign j'+1 to j', and judge whether j'>J is true. If it is true, it means that the sensitivity and specificity values of J are obtained, and step 4.5.5 is executed; otherwise, return Step 4.5.3 is executed;

步骤4.5.5、以第j′个分类阈值θj′为横坐标,分别以第j′个分类阈值θj′所对应的敏感度Se(θj′)和特异性Sp(θj′)值为纵坐标,绘制敏感度与特异性的曲线,以两曲线的交点对应的阈值作为最佳模型预测分类阈值θ′。Step 4.5.5, take the j'th classification threshold θ j' as the abscissa, respectively take the sensitivity Se(θ j' ) and specificity Sp(θ j' ) corresponding to the j'th classification threshold θ j ' The value is the ordinate, and the curve of sensitivity and specificity is drawn, and the threshold corresponding to the intersection of the two curves is used as the optimal model to predict the classification threshold θ′.

与现有技术相比,本发明的有益效果在于:Compared with the prior art, the beneficial effects of the present invention are:

1、本发明方法基于区域路网交通事故数据,建立潜在类别分析模型,将事故数据划分为若干同质子类别;其次,对各子类别分别建立CART决策树模型,识别自变量间交互作用项;然后,基于二元logistic回归对各子类别分别建立考虑交互作用项事故严重度模型,并设置敏感度与特异性曲线交点作为事故严重度模型的预测分类阈值。该方法降低了事故数据异质性对分析结果的不利影响,克服了传统交通事故严重度预测模型忽略交互作用项和非平衡数据综合预测效果差的问题,提高了事故严重度模型的预测精度和拟合优度。1. The method of the present invention establishes a potential category analysis model based on the traffic accident data of the regional road network, and divides the accident data into several homogenous sub-categories; secondly, a CART decision tree model is established for each sub-category, and interaction items between independent variables are identified; Then, an accident severity model considering the interaction term is established for each sub-category based on binary logistic regression, and the intersection of the sensitivity and specificity curves is set as the predicted classification threshold of the accident severity model. This method reduces the adverse impact of accident data heterogeneity on the analysis results, overcomes the problems that the traditional traffic accident severity prediction model ignores the interaction term and the comprehensive prediction effect of unbalanced data is poor, and improves the prediction accuracy and accuracy of the accident severity model. goodness of fit.

2、本发明方法通过潜在类别分析将交通事故数据划分为若干同质子类别,既能够反映事故数据异质性,又能精准识别、分析潜在的道路交通事故发生模式和机理;2. The method of the present invention divides the traffic accident data into several homogeneous sub-categories through potential class analysis, which can not only reflect the heterogeneity of accident data, but also accurately identify and analyze the potential occurrence mode and mechanism of road traffic accidents;

3、本发明方法通过CART决策树模型识别自变量间的各阶交互作用项,并纳入二元logistic回归模型,提高了模型的拟合优度,并识别出影响区域路网交通事故严重度的重要自变量和交互作用项,有助于提高区域路网道路交通安全水平;3. The method of the present invention identifies the interaction terms of each order between the independent variables through the CART decision tree model, and incorporates them into the binary logistic regression model, which improves the goodness of fit of the model and identifies factors that affect the severity of traffic accidents on the regional road network. Important independent variables and interaction terms are helpful to improve the level of road traffic safety in the regional road network;

4、本发明方法使用敏感度和特异性曲线交点对应阈值作为二元logistic回归模型的分类阈值解决了非平衡数据分类问题,提高了事故严重度模型的预测准确度。4. The method of the present invention solves the problem of unbalanced data classification by using the threshold corresponding to the intersection of the sensitivity and specificity curves as the classification threshold of the binary logistic regression model, and improves the prediction accuracy of the accident severity model.

附图说明Description of drawings

图1为本发明类别1CART决策树图;Fig. 1 is a CART decision tree diagram of category 1 of the present invention;

图2为本发明类别1的灵敏度与特异度曲线图;Fig. 2 is the sensitivity and specificity curve diagram of category 1 of the present invention;

图3为本发明类别1的ROC曲线图;Fig. 3 is the ROC curve diagram of category 1 of the present invention;

图4为本发明方法流程图。Figure 4 is a flow chart of the method of the present invention.

具体实施方式Detailed ways

本实施例中,如图4所示,一种应用于区域路网的交通事故严重度预测方法是按如下步骤进行:In this embodiment, as shown in FIG. 4 , a traffic accident severity prediction method applied to a regional road network is performed according to the following steps:

步骤一、区域路网道路交通事故数据的采集与预处理;Step 1. Collection and preprocessing of road traffic accident data on the regional road network;

步骤1.1、从道路交通事故平台中采集某区域路网的交通事故数据,删除交通事故数据库中记录不全(具有空白项)或记录不合理的事故数据,共获取2595(N=2595)起事故数据作为分析事故数据集D,从人、车、事故特征、路和环境五个方面选取26个分类变量组成集合X={x1,x2,...,x26}来表征第i起事故,并将他们作为预测模型的自变量,自变量具体取值见表1;其中,xk表示第k个分类变量,且第k个分类变量xk包含Ck种类别,xk在Ck种类别中的取值记为sk(例如:x1表示第一个分类变量包括两种类别即C1的值为2,则s1为1女性或2男性),每起事故都可以表示为26个分类变量取值的集合Si={si1,si2,...,sik,...,si26};令

Figure GDA0002805037980000071
表示第i起事故的K个分类变量的所有可能取值中的任意一种取值集合;k=1,2,3,...,K;i=1,2,3,...,N;Step 1.1. Collect the traffic accident data of a certain regional road network from the road traffic accident platform, delete the incomplete records (with blank items) or unreasonable accident data in the traffic accident database, and obtain a total of 2595 (N=2595) accident data As the analysis accident data set D, 26 categorical variables are selected from five aspects of people, vehicles, accident characteristics, road and environment to form a set X={x 1 ,x 2 ,...,x 26 } to represent the ith accident , and use them as independent variables of the prediction model. The specific values of the independent variables are shown in Table 1; among them, x k represents the k-th categorical variable, and the k-th categorical variable x k includes C k categories, and x k is in C k The value in each category is recorded as sk (for example: x 1 indicates that the first categorical variable includes two categories, that is, the value of C 1 is 2, then s 1 is 1 female or 2 male), and each accident can be represented by Set S i = {s i1 ,s i2 ,...,s ik ,...,s i26 } for 26 categorical variables; let
Figure GDA0002805037980000071
Represents any set of possible values of the K categorical variables of the ith accident; k=1,2,3,...,K; i=1,2,3,..., N;

每一起事故的事故严重度作为预测变量,记为yi,yi的取值为“0”或“1”分别表示非死亡事故和死亡事故;The accident severity of each accident is used as a predictor variable, denoted as y i , and the value of y i is "0" or "1" to indicate a non-fatal accident and a fatal accident, respectively;

步骤1.2、利用SPSS软件进行多重共线性检验,删除具有共线性的分类变量,通过共线性检验发现方差膨胀因子(VIF)均小于5,对应容差(TOL)均大于0.1(如表1所示),证明26分类变量之间无共线性关系,均可纳入模型分析。Step 1.2. Use SPSS software to perform multicollinearity test, delete the categorical variables with collinearity, and find that the variance inflation factor (VIF) is less than 5 and the corresponding tolerance (TOL) is greater than 0.1 through the collinearity test (as shown in Table 1). ), proving that there is no collinear relationship among the 26 categorical variables, all of which can be included in the model analysis.

表1自变量定义与赋值及共线性检验Table 1 Definition and assignment of independent variables and collinearity test

Figure GDA0002805037980000081
Figure GDA0002805037980000081

步骤二、根据区域路网道路交通事故数据,建立潜在类别分析模型;Step 2: Establish a potential category analysis model according to the road traffic accident data of the regional road network;

步骤2.1、定义潜在类别分析模型中存在一个潜在类别变量V,V包含T种类别,且任意一种类别记为t,t=1,2,...,T;令第i起事故中潜在类别变量V的取值记为ViStep 2.1. Define that there is a latent category variable V in the latent category analysis model, V contains T categories, and any category is denoted as t, t=1,2,...,T; let the potential in the i-th accident be The value of the categorical variable V is denoted as V i ;

步骤2.1.1、定义外循环次数为τ、最大外循环迭代次数为5;令第τ次所设置的类别数目为Tτ且Tτ=τ;初始化τ=1;Step 2.1.1. Define the number of outer loops as τ and the maximum number of outer loop iterations as 5; let the number of categories set for the τth time be T τ and T τ =τ; initialize τ = 1;

步骤2.1.2、初始化t=1;Step 2.1.2, initialize t=1;

步骤2.1.3、初利用式(1)得到第i起事故Vi取值为t,即属于第t种潜在类别时,第i起事故在K个分类变量上的取值集合为

Figure GDA0002805037980000082
的条件概率
Figure GDA0002805037980000083
Step 2.1.3. Initially use formula (1) to obtain the value of the i-th accident V i as t, that is, when it belongs to the t-th potential category, the value set of the i-th accident on the K categorical variables is:
Figure GDA0002805037980000082
The conditional probability of
Figure GDA0002805037980000083

Figure GDA0002805037980000084
Figure GDA0002805037980000084

式(1)中,P(sik=sk|Vi=t)表示第i起事故属于第t个潜在类别时,第k个分类变量上取值为sk的条件概率;In formula (1), P(s ik =s k |V i =t) represents the conditional probability of the value of s k on the k-th categorical variable when the i-th accident belongs to the t-th potential category;

步骤2.1.4、利用式(2)得到第i起事故中K个分类变量取值集合为

Figure GDA0002805037980000091
的非条件概率即潜在类别分析模型的联合概率
Figure GDA0002805037980000092
Step 2.1.4, use formula (2) to obtain the value set of K categorical variables in the ith accident as
Figure GDA0002805037980000091
The unconditional probability of is the joint probability of the latent class analysis model
Figure GDA0002805037980000092

Figure GDA0002805037980000093
Figure GDA0002805037980000093

式(2)中,P(Vi=t)是第i起事故属于第t个潜在类别的概率,潜在类别t占总体的比率;In formula (2), P(V i =t) is the probability that the i-th accident belongs to the t-th potential category, and the ratio of the potential category t to the population;

此外,潜在类别分析模型的基本限定条件为各潜在类别概率以及每个分类变量的条件概率总和均为1,如式(3)、式(4)所示:In addition, the basic limitation of the latent category analysis model is that the sum of the probability of each latent category and the conditional probability of each categorical variable is 1, as shown in equations (3) and (4):

Figure GDA0002805037980000094
Figure GDA0002805037980000094

Figure GDA0002805037980000095
Figure GDA0002805037980000095

步骤2.2、采用极大似然法进行模型参数估计,得到潜在类别概率和分类变量条件概率的估计值

Figure GDA0002805037980000096
以及潜在类别分析模型的第τ次极大似然函数值Lτ;Step 2.2. Use the maximum likelihood method to estimate the model parameters to obtain the estimated values of the latent class probability and the conditional probability of the categorical variable
Figure GDA0002805037980000096
and the τth maximum likelihood function value L τ of the latent class analysis model;

步骤2.3、根据贝叶斯理论,利用式(5)计算第i起事故被分类到第t个潜在类别的后验概率

Figure GDA0002805037980000097
Step 2.3. According to Bayesian theory, use formula (5) to calculate the posterior probability that the ith accident is classified into the tth latent category
Figure GDA0002805037980000097

Figure GDA0002805037980000098
Figure GDA0002805037980000098

其中,

Figure GDA0002805037980000099
由式(6)表示:in,
Figure GDA0002805037980000099
It is represented by formula (6):

Figure GDA00028050379800000910
Figure GDA00028050379800000910

第i起事故归属于某一类别的后验概率最大,则第i起事故被划分到该子类别,对所有N起事故数据进行后验概率的计算与比较,从而实现聚类的目的;The posterior probability of the i-th accident belonging to a certain category is the largest, then the i-th accident is divided into this sub-category, and the posterior probability is calculated and compared for all N accident data, so as to achieve the purpose of clustering;

步骤2.4、令t+1赋值给t,并判断t>Tτ是否成立,若成立,则执行步骤2.5;否则,返回步骤2.1.3执行;Step 2.4, assign t+1 to t, and determine whether t>T τ is established, if so, execute step 2.5; otherwise, return to step 2.1.3 to execute;

步骤2.5、利用式(7)、式(8)、式(9)和式(10)得到模型拟合评价指标,包括:第τ次信息评价指标AICτ、第τ次贝叶斯信息准则BICτ、第τ次样本校正的贝叶斯信息准则aBICτ、第τ次熵值

Figure GDA0002805037980000101
Step 2.5, use formula (7), formula (8), formula (9) and formula (10) to obtain the model fitting evaluation index, including: the τth information evaluation index AIC τ , the τth Bayesian information criterion BIC τ , the Bayesian Information Criterion aBIC τ of the τth sample correction τ , the τth entropy value
Figure GDA0002805037980000101

AICτ=-2ln(Lτ)+2M (7)AIC τ = -2ln(L τ )+2M (7)

BICτ=-2ln(Lτ)+ln(N)×M (8)BIC τ =-2ln(L τ )+ln(N)×M (8)

aBICτ=-2ln(Lτ)+ln(n*)×M (9)aBIC τ =-2ln(L τ )+ln(n * )×M (9)

Figure GDA0002805037980000102
Figure GDA0002805037980000102

利用式(7)、式(8)、式(9)和式(10)中,M为潜在类别分析模型中未知参数的个数;n*是调整后的样本量,且n*=(N+2)/24;Using formula (7), formula (8), formula (9) and formula (10), M is the number of unknown parameters in the latent class analysis model; n * is the adjusted sample size, and n * = (N +2)/24;

步骤2.6、将τ+1赋值给后τ,判断τ>τmax是否成立,若成立,则执行步骤2.7;否则,返回步骤2.1.3执行;Step 2.6, assign τ+1 to the latter τ, and judge whether τ>τ max is established, if so, execute step 2.7; otherwise, return to step 2.1.3 to execute;

步骤2.7、潜在类别分析模型的建模和参数估计采用Mplus vision7.4软件进行,通过限定潜在类别数目T。从T=1开始逐渐增大潜在类别数目到T=5,得到5个不同的潜在类别分析模型估计参数ln(L),即τ的值为5。分别计算5个模型的拟合评价指标,包括:第τ次信息评价指标AICτ、第τ次贝叶斯信息准则BICτ、第τ次样本校正的贝叶斯信息准则aBICτ、第τ次熵值

Figure GDA0002805037980000103
对应的模型拟合指标见表2。Step 2.7, the modeling and parameter estimation of the latent category analysis model is carried out with Mplus vision7.4 software, by limiting the number of latent categories T. From T=1, the number of latent classes is gradually increased to T=5, and five different latent class analysis models are obtained to estimate the parameter ln(L), that is, the value of τ is 5. Calculate the fitting evaluation indicators of the five models respectively, including: the τth information evaluation index AIC τ , the τth Bayesian information criterion BIC τ , the τth sample-corrected Bayesian information criterion aBIC τ , the τth time entropy value
Figure GDA0002805037980000103
The corresponding model fitting indicators are shown in Table 2.

表2模型拟合指标汇总Table 2 Summary of model fitting indicators

Figure GDA0002805037980000104
Figure GDA0002805037980000104

表2中,AIC、BIC、aBIC的值越小模型的拟合程度越高,熵值大于0.8表明有90%以上分类正确率,LMR和BLRT是相对拟合指标,P值显著表示T个类别优于T-1个类别显著。因此,考虑将事故数据划分为3个类别进行分析即T*=3。T*=3时潜在类别分析模型估计结果如表3所示,由条件概率分布识别出各事故子类别的事故特点,将类别1命名为县道上的乘用车事故,类别2乡村道路上的机动车事故,类别3老年人非机动车事故,识别出潜在的道路交通事故发生模式。In Table 2, the smaller the values of AIC, BIC, and aBIC, the higher the fitting degree of the model, and the entropy value greater than 0.8 indicates that the classification accuracy rate is more than 90%. LMR and BLRT are relative fitting indicators, and the P value significantly indicates T categories Significantly better than T-1 categories. Therefore, consider dividing the accident data into 3 categories for analysis ie T * =3. When T * =3, the estimation results of the latent category analysis model are shown in Table 3. The accident characteristics of each accident sub-category are identified by the conditional probability distribution. Motor vehicle accidents, Category 3 elderly non-motor vehicle accidents, identify potential patterns of road traffic accident occurrence.

根据贝叶斯理论,利用式(5)计算第i起观测事故数据被分类到第3个潜在类别的后验概率

Figure GDA0002805037980000111
对所有事故数据进行后验概率的计算与比较,从而将2595起事故数据划分为3个事故子类别,记为{D1,D2,D3},分别包含1104、485和1006起事故数据;According to Bayesian theory, use Equation (5) to calculate the posterior probability that the i-th observed accident data is classified into the third latent category
Figure GDA0002805037980000111
Calculate and compare the posterior probability of all accident data, so as to divide the 2595 accident data into 3 accident sub-categories, denoted as {D 1 , D 2 , D 3 }, including 1104, 485 and 1006 accident data respectively ;

表3 T*=3时潜在类别概率和自变量条件概率(部分)Table 3 Latent class probability and independent variable conditional probability when T * =3 (part)

Figure GDA0002805037980000112
Figure GDA0002805037980000112

步骤三、根据潜在类别分析模型结果,对3个事故子类别分别建立CART决策树模型;Step 3: According to the results of the potential category analysis model, establish a CART decision tree model for the three accident sub-categories respectively;

步骤3.1、令第t*个事故子类别中的事故数据

Figure GDA0002805037980000116
作为训练样本集t*=1,2,3.,令26个分类变量所组成的集合X为CART决策树模型中的特征集;令结点样本阈值为σ、特征值切分点为α、Gini指数阈值为ε;Step 3.1. Let the accident data in the t * th accident subcategory
Figure GDA0002805037980000116
As the training sample set t * =1,2,3., let the set X composed of 26 categorical variables be the feature set in the CART decision tree model; let the node sample threshold be σ, the feature value cut point be α, Gini index threshold is ε;

步骤3.2、初始化t*=1;Step 3.2, initialize t * =1;

步骤3.3、利用SPSS软件,构建CART决策树模型,输入事故数据集

Figure GDA0002805037980000113
设置特征集X为步骤3.1中识别出显著性的变量、结点样本阈值σ为50和Gini指数阈值ε为0.001;Step 3.3. Use SPSS software to build a CART decision tree model and input the accident data set
Figure GDA0002805037980000113
Set the feature set X to be the variable identified as significant in step 3.1, the node sample threshold σ to be 50 and the Gini index threshold ε to be 0.001;

步骤3.3.1、CART决策树使用Gini系数作为判定决策树是否进行分支的依据,建立二叉决策树模型,根据特征值切分点α,将训练样本集

Figure GDA0002805037980000114
分为第一子集Dα1和第二子集Dα2,即将分类变量xk的某一类别Ck作为切分点α,可以将样本集D划分为两个子集Dα1和Dα2;利用式(11)得到特征值切分点α的Gini指数Gini(Dα):Step 3.3.1. The CART decision tree uses the Gini coefficient as the basis for judging whether the decision tree is branched, establishes a binary decision tree model, and divides the training sample set according to the eigenvalue segmentation point α.
Figure GDA0002805037980000114
Divided into the first subset D α1 and the second subset D α2 , that is, a certain category C k of the categorical variable x k is used as the cutting point α, and the sample set D can be divided into two subsets D α1 and D α2 ; using Equation (11) obtains the Gini index Gini(D α ) of the eigenvalue segmentation point α:

Figure GDA0002805037980000115
Figure GDA0002805037980000115

式(11)中,

Figure GDA0002805037980000121
|Dα1|和|Dα2|分别表示训练样本集
Figure GDA0002805037980000122
第一子集Dα1和第二子集Dα2中包含事故总数;In formula (11),
Figure GDA0002805037980000121
|D α1 | and |D α2 | denote the training sample set, respectively
Figure GDA0002805037980000122
The first subset D α1 and the second subset D α2 contain the total number of accidents;

Gini(Dα1)表示第一子集Dα1的Gini指数,并有:Gini(D α1 ) represents the Gini index of the first subset D α1 , and has:

Figure GDA0002805037980000123
Figure GDA0002805037980000123

式(12)中,

Figure GDA0002805037980000124
Figure GDA0002805037980000125
分别表示第一子集Dα1中非死亡和死亡事故的概率;In formula (12),
Figure GDA0002805037980000124
and
Figure GDA0002805037980000125
are the probabilities of non-fatal and fatal accidents in the first subset D α1 , respectively;

式(11)中,Gini(Dα2)表示第二子集Dα2的Gini指数,并有:In formula (11), Gini(D α2 ) represents the Gini index of the second subset D α2 , and has:

Figure GDA0002805037980000126
Figure GDA0002805037980000126

式(13)中,

Figure GDA0002805037980000127
Figure GDA0002805037980000128
分别表示第二子集Dα2中非死亡和死亡事故的概率;In formula (13),
Figure GDA0002805037980000127
and
Figure GDA0002805037980000128
are the probabilities of non-fatal and fatal accidents in the second subset D α2 , respectively;

步骤3.3.2、遍历特征集X中每个特征值的切分点,并计算每个特征值的切分点的Gini指数;若特征集X中每个特征值的切分点的Gini指数小于阈值0.001,则表示CART决策树模型是一棵单结点的树,并输出单结点的树,此时无交互作用项;否则执行步骤3.3.3;Step 3.3.2. Traverse the segmentation point of each eigenvalue in the feature set X, and calculate the Gini index of the segmentation point of each eigenvalue; if the Gini index of the segmentation point of each eigenvalue in the feature set X is less than If the threshold is 0.001, it means that the CART decision tree model is a single-node tree and outputs a single-node tree, and there is no interaction item at this time; otherwise, go to step 3.3.3;

步骤3.3.3、选择特征集X中最小切分点的Gini指数所对应的特征值Xmin及其相应的切分点αmin,并根据切分点αmin将训练样本集

Figure GDA0002805037980000129
分为两个子集Dmin1和Dmin2,再将子集Dmin1和子集Dmin2分别分配到以训练样本集
Figure GDA00028050379800001210
为父节点的两个子结点中;Step 3.3.3. Select the feature value X min corresponding to the Gini index of the minimum segmentation point in the feature set X and its corresponding segmentation point α min , and divide the training sample set according to the segmentation point α min .
Figure GDA0002805037980000129
Divide into two subsets D min1 and D min2 , and then assign the subset D min1 and the subset D min2 to the training sample set respectively
Figure GDA00028050379800001210
In the two child nodes of the parent node;

若子集Dmin1和子集Dmin2的样本数均小于给定的结点样本阈值50,则表示两个子集Dmin1和Dmin2所在的子结点均是叶子结点,输出二叉决策树,此时仅存在二阶交互作用项;若子集Dmin1和/或子集Dmin2的样本数大于结点样本阈值50,则表示子集Dmin1或子集Dmin2所在的子结点是非叶子结点可进一步进行划分,并执行步骤3.3.4;If the number of samples of both subsets D min1 and D min2 is less than the given node sample threshold of 50, it means that the child nodes where the two subsets D min1 and D min2 are located are leaf nodes, and the binary decision tree is output, At this time, there is only a second-order interaction term; if the number of samples of subset D min1 and/or subset D min2 is greater than the node sample threshold of 50, it means that the child node where subset D min1 or subset D min2 is located is a non-leaf Nodes can be further divided and step 3.3.4 is executed;

步骤3.3.4、对于非叶子结点,令训练样本集

Figure GDA00028050379800001211
等于非叶子结点所对应的子集,并将最小切分点的Gini指数所对应的特征值Xmin从特征集X中删除后,返回执行步骤3.3.1,直到所有子结点的样本数均小于结点样本阈值50或特征集X为空时,输出最终的二叉决策树;Step 3.3.4. For non-leaf nodes, let the training sample set
Figure GDA00028050379800001211
It is equal to the subset corresponding to the non-leaf node, and after deleting the feature value X min corresponding to the Gini index of the minimum segmentation point from the feature set X, return to step 3.3.1 until the number of samples of all child nodes When both are less than the node sample threshold of 50 or the feature set X is empty, the final binary decision tree is output;

步骤3.4、令t*+1赋值给t*,并判断t*>3是否成立,若成立,则表示得到3个二叉决策树模型,并执行步骤3.5;否则,返回步骤3.3执行;Step 3.4, assign t * +1 to t * , and judge whether t * > 3 is established, if so, it means that three binary decision tree models are obtained, and step 3.5 is executed; otherwise, return to step 3.3 to execute;

步骤3.5、根据3个二叉决策树的树形图,确定分类变量间的交互作用项,其中,第t*个事故子类别对应的二叉决策树所确定的交互作用项;Step 3.5, according to the dendrogram of the three binary decision trees, determine the interaction term between the categorical variables, wherein, the interaction term determined by the binary decision tree corresponding to the t * th accident subcategory;

图1所示是类别1的二叉决策树树形图,该图以类别1中所有数据为根结点,包含4层树高,5个叶子结点。图中每个结点矩形框都标明了该结点包含的事故总数、死亡事故和非死亡事故数及二者比例。从树形图(图1)可知车辆类型与乘客、车辆类型与道路技术等级、道路技术等级与道路线型之间存在二阶交互作用,车辆类型、道路技术等级和道路线型之间存在三阶交互作用;Figure 1 shows a binary decision tree tree diagram of category 1, which takes all the data in category 1 as the root node, including 4 layers of tree height and 5 leaf nodes. The rectangular box of each node in the figure indicates the total number of accidents, the number of fatal accidents and non-fatal accidents and the proportion of the two included in the node. From the tree diagram (Figure 1), it can be seen that there are second-order interactions between vehicle types and passengers, vehicle types and road technical grades, and road technical grades and road alignments. There are three-way interactions between vehicle types, road technical grades and road alignments. order interaction;

同理,确定类别2中存在二阶交互项分别是事故形态和照明条件、事故形态和车辆类型,类别3中存在二阶交互作用项是车辆类型和驾驶员年龄。Similarly, it is determined that there are second-order interaction terms in category 2, which are accident shape and lighting conditions, accident shape and vehicle type, and there are second-order interaction terms in category 3, which are vehicle type and driver age.

步骤四、对3个事故子类别分别建立基于二元logistic回归的事故严重度模型;Step 4: Establish an accident severity model based on binary logistic regression for the three accident sub-categories;

步骤4.1、将第t*个事故子类别中的事故数据

Figure GDA0002805037980000131
作为事故严重度模型的拟合数据,以K个分类变量所组成集合X和第t*个事故子类别的交互作用项共同作为事故严重度模型的自变量X*;定义第t*个事故子类别包含J个事故数据,J的值为
Figure GDA0002805037980000132
第j起事故的预测变量记为yj;Step 4.1. Combine the accident data in the t * th accident sub-category
Figure GDA0002805037980000131
As the fitting data of the accident severity model, the set X composed of K categorical variables and the interaction term of the t * th accident sub-category are used as the independent variable X * of the accident severity model; define the t * th accident subcategory The category contains J accident data, and the value of J is
Figure GDA0002805037980000132
The predictor of the jth accident is denoted as y j ;

利用SPSS对各事故子类别进行单因素卡方检验,其中P值小于0.05表示自变量与因变量显著相关。单因素卡方检验结果见表4,类别1中16个变量与事故严重度显著相关。One-way chi-square test was performed on each accident sub-category using SPSS, where the P value less than 0.05 indicated that the independent variable was significantly correlated with the dependent variable. The results of the one-way chi-square test are shown in Table 4, and 16 variables in category 1 are significantly correlated with the accident severity.

表4各事故子类别单因素卡方检验结果Table 4 Single-factor chi-square test results for each accident subcategory

Figure GDA0002805037980000133
Figure GDA0002805037980000133

步骤4.2、初始化t*=1;Step 4.2, initialize t * =1;

步骤4.3、利用式(14)得到基于二元logistic回归在自变量X*条件下死亡事故即yj=1的发生概率P(y=1|X*):Step 4.3, use formula (14) to obtain the probability P(y=1|X * ) of a fatal accident based on binary logistic regression under the condition of independent variable X * , that is, y j =1:

Figure GDA0002805037980000141
Figure GDA0002805037980000141

式(13)中,w*为自变量X*的回归系数;In formula (13), w * is the regression coefficient of the independent variable X * ;

步骤4.4、利用极大似然法估计二元logistic回归的事故严重度模型的参数w*Step 4.4, use the maximum likelihood method to estimate the parameter w * of the accident severity model of binary logistic regression:

对于第j起事故,

Figure GDA0002805037980000142
为给定自变量
Figure GDA0002805037980000143
条件下yj=1的概率,则给定自变量
Figure GDA0002805037980000144
条件下yj=0的概率为1-Pj;并利用式(15)得到似然函数L(w*):For the jth accident,
Figure GDA0002805037980000142
for the given independent variable
Figure GDA0002805037980000143
The probability of y j = 1 under the condition, then given the independent variable
Figure GDA0002805037980000144
The probability of y j = 0 under the condition is 1-P j ; and the likelihood function L(w * ) is obtained by using equation (15):

Figure GDA0002805037980000145
Figure GDA0002805037980000145

利用极大似然估计,求出使得L(w*)取得最大值时的估计参数w′;利用SPSS软件进行事故严重度模型的参数估计,其中分类变量的交互作用项以分类变量乘积的形式作为模型分析的自变量,为方便模型结果解释并对各自变量设置哑变量;自变量进入或剔除模型采用Wald检验,进入或剔除标准分别为P<0.05和P>0.1,设置迭代次数为20次;The maximum likelihood estimation is used to obtain the estimated parameter w' when L(w * ) reaches the maximum value; SPSS software is used to estimate the parameters of the accident severity model, in which the interaction term of the categorical variables is in the form of the product of the categorical variables As the independent variable of the model analysis, to facilitate the interpretation of the model results and set dummy variables for the respective variables; Wald test was used for the entry or exclusion of the independent variables into the model, the entry or exclusion criteria were P<0.05 and P>0.1, and the number of iterations was set to 20 ;

根据估计参数w′得到第j起事故在自变量

Figure GDA0002805037980000146
条件下yj=1的预测概率
Figure GDA0002805037980000147
从而得到J起事故的预测概率
Figure GDA0002805037980000148
并进行升序排序,得到排序后的预测概率集合记为{P′1,...,P′j,...,P′J};According to the estimated parameter w', the jth accident is obtained in the independent variable
Figure GDA0002805037980000146
Predicted probability of y j = 1
Figure GDA0002805037980000147
Thus, the predicted probability of J accidents is obtained
Figure GDA0002805037980000148
And sort in ascending order to get the sorted set of predicted probabilities as {P′ 1 ,...,P′ j ,...,P′ J };

步骤4.5、调整事故严重度模型的预测分类阈值;Step 4.5, adjust the prediction classification threshold of the accident severity model;

步骤4.5.1、定义θ为模型预测的分类阈值,且0<θ<1;

Figure GDA0002805037980000149
表示事故严重度模型预测第j起事故预测为死亡事故;
Figure GDA00028050379800001410
表示事故严重度模型预测第j起事故预测为非死亡事故;Step 4.5.1. Define θ as the classification threshold predicted by the model, and 0<θ<1;
Figure GDA0002805037980000149
Indicates that the accident severity model predicts that the jth accident is predicted to be a fatal accident;
Figure GDA00028050379800001410
Indicates that the accident severity model predicts that the jth accident is predicted to be a non-fatal accident;

步骤4.5.2、初始化j′=1;Step 4.5.2, initialize j'=1;

步骤4.5.3、令模型的第j′个分类阈值θj′等于P′j′,利用式(15)得到事故严重度模型预测的第j′个敏感度Se(θj′),即事故数据集中死亡事故预测为死亡事故的概率:Step 4.5.3. Set the j'th classification threshold θ j' of the model equal to P'j' , and use the formula (15) to obtain the j'th sensitivity Se(θ j' ) predicted by the accident severity model, that is, the accident The probability of fatal accidents predicted as fatal accidents in the dataset:

Figure GDA00028050379800001411
Figure GDA00028050379800001411

式(15)中,

Figure GDA00028050379800001412
表示第s起事故预测为死亡事故的概率,ys=1表示第s起事故为死亡事故,1≤s≤J;In formula (15),
Figure GDA00028050379800001412
Represents the probability that the sth accident is predicted to be a fatal accident, y s =1 indicates that the sth accident is a fatal accident, 1≤s≤J;

利用式(16)得到事故严重度模型预测的第j′个特异性Sp(θj′),即事故数据集中非死亡事故预测为非死亡事故的概率:Using Equation (16), the j'th specificity Sp(θ j' ) predicted by the accident severity model is obtained, that is, the probability that a non-fatal accident is predicted to be a non-fatal accident in the accident data set:

Figure GDA00028050379800001413
Figure GDA00028050379800001413

式(16)中,

Figure GDA0002805037980000151
表示第s起事故预测为死亡事故的概率,ys=0表示第s起事故为死亡事故,1≤s≤J;In formula (16),
Figure GDA0002805037980000151
Represents the probability that the sth accident is predicted to be a fatal accident, y s =0 indicates that the sth accident is a fatal accident, 1≤s≤J;

步骤4.5.4、令j′+1赋值给j′,并判断j′>J是否成立,若成立,则表示得到J对敏感度和特异性取值,并执行步骤4.5.5;否则,返回步骤4.5.3执行;Step 4.5.4. Assign j'+1 to j', and judge whether j'>J is true. If it is true, it means that the sensitivity and specificity values of J are obtained, and step 4.5.5 is executed; otherwise, return Step 4.5.3 is executed;

步骤4.5.5、以第j′个分类阈值θj′为横坐标,分别以第j′个分类阈值θj′所对应的敏感度Se(θj′)和特异性Sp(θj′)值为纵坐标,绘制敏感度与特异性的曲线,以两曲线的交点对应的阈值作为最佳模型预测分类阈值θ′;Step 4.5.5, take the j'th classification threshold θ j' as the abscissa, respectively take the sensitivity Se(θ j' ) and specificity Sp(θ j' ) corresponding to the j'th classification threshold θ j ' The value is the ordinate, and the curve of sensitivity and specificity is drawn, and the threshold corresponding to the intersection of the two curves is used as the optimal model to predict the classification threshold θ′;

步骤4.6、令t*+1赋值给t*,并判断t*>3是否成立,若成立,则表示获得3个事故严重度预测模型,否则,返回步骤4.3执行。Step 4.6, assign t * +1 to t * , and judge whether t * >3 is established, if so, it means that three accident severity prediction models are obtained, otherwise, return to step 4.3 for execution.

得到3个二元logistic回归模型得到事故严重度模型参数估计结果

Figure GDA0002805037980000152
如表6所示;回归系数w*是由常数项β0和自变量回归系数B构成的向量,其中,B值表示自变量的系数,其值为正表示对死亡事故的发生有正向影响,为负则表示有负向影响;OR=exp(B)表示某一自变量的存在使死亡事故发生的概率增大或减少的量。Obtain three binary logistic regression models to obtain the parameter estimation results of the accident severity model
Figure GDA0002805037980000152
As shown in Table 6; the regression coefficient w * is a vector composed of the constant term β 0 and the independent variable regression coefficient B, where the B value represents the coefficient of the independent variable, and a positive value indicates a positive impact on the occurrence of fatal accidents , if it is negative, it means there is a negative impact; OR=exp(B) means that the existence of a certain independent variable increases or decreases the probability of fatal accidents.

表6事故严重度模型估计结果Table 6 Estimation results of accident severity model

Figure GDA0002805037980000153
Figure GDA0002805037980000153

Figure GDA0002805037980000161
Figure GDA0002805037980000161

注:B为模型回归系数;OR为优势比,OR=exp(B);Note: B is the regression coefficient of the model; OR is the odds ratio, OR=exp(B);

同时,根据估计参数w′得到第j起事故在自变量

Figure GDA0002805037980000162
条件下yj=1的预测概率
Figure GDA0002805037980000163
从而得到以敏感度和特异性交点对应的预测分类阈值,如图2所示为类别1的敏感度与特异性曲线图。从而,得到3个事故子类别的预测分类阈值分别为0.2930、0.3928和0.4133,并求解出对应分类阈值下的模型预测准确度68.8%、75.5%和66.3%;At the same time, according to the estimated parameter w', the independent variable of the jth accident is obtained.
Figure GDA0002805037980000162
Predicted probability of y j = 1
Figure GDA0002805037980000163
As a result, the predicted classification threshold corresponding to the intersection of sensitivity and specificity is obtained, as shown in Figure 2, which is a graph of sensitivity and specificity of category 1. Therefore, the predicted classification thresholds of the three accident sub-categories are 0.2930, 0.3928 and 0.4133, respectively, and the model prediction accuracy under the corresponding classification thresholds is 68.8%, 75.5% and 66.3%;

步骤4.6.1、事故严重度模型结果分析:Step 4.6.1. Analysis of accident severity model results:

由表6可知,各事故子类别中影响事故严重度的因素之间存在显著差异,其中,无证驾驶、酒驾、超速、中央隔离设施、地形,摩托车与乘客的二阶交互作用,以及货车与四级公路、道路线形的三阶交互作用仅在类别1中显著;农用车、撞击固定物、非高峰时段、道路线型、能见度仅在类别2中显著;坠车、等外公路、交通控制设施、年龄与非机动车的交互作用仅在类别3中显著。From Table 6, it can be seen that there are significant differences between the factors affecting the severity of accidents in each accident sub-category. Among them, unlicensed driving, drunk driving, speeding, central isolation facilities, terrain, second-order interactions between motorcycles and passengers, and trucks. The third-order interaction with the fourth-class highway and road alignment is significant only in category 1; agricultural vehicles, hitting fixed objects, off-peak hours, road alignment, and visibility are only significant in category 2; crashes, out-of-class highways, traffic The interaction of control facility, age, and non-motorized vehicles was significant only in category 3.

以类别1为例,无证驾驶、超速和酒驾的回归系数均为正,三种情况下死亡事故发生概率分别增加约132%、140%和124%。在事故形态方面,撞击非固定物使死亡事故的发生概率增加96%;有乘客状态下死亡事故发生概率增加165%,缺少道路中央隔离设施使死亡事故发生的概率增加120%;夜晚时死亡事故的发生概率上升约44%。Taking category 1 as an example, the regression coefficients of unlicensed driving, speeding and drunk driving are all positive, and the probability of fatal accidents in the three cases increases by about 132%, 140% and 124%, respectively. In terms of accident form, hitting a non-fixed object increases the probability of fatal accidents by 96%; with passengers, the probability of fatal accidents increases by 165%, and the lack of central road isolation facilities increases the probability of fatal accidents by 120%; fatal accidents at night The probability of occurrence increased by about 44%.

变量交互作用方面,摩托车搭载乘客驾驶时死亡事故发生概率降低约60%;货车在四级公路上行驶时,事故严重度易受道路线型影响,其中弯坡组合路段影响最大(OR值为12.036),其次是弯道路段(OR值为5.57)。In terms of variable interaction, the probability of fatal accidents is reduced by about 60% when motorcycles are driven with passengers; when trucks are driving on Class 4 highways, the accident severity is easily affected by the road alignment, and the combination of curved slopes has the greatest impact (OR value is 12.036), followed by the curved road segment (OR value of 5.57).

步骤4.6.2、模型比较:Step 4.6.2, model comparison:

为比较本发明方法与传统二元logistic回归模型在事故严重度分析方面的优劣性,采用模型预测准确度和ROC曲线两个指标衡量模型预测精度,采用Hosmer-Lemeshow(HL)统计量衡量模型的拟合优度。In order to compare the advantages and disadvantages of the method of the present invention and the traditional binary logistic regression model in the analysis of accident severity, the model prediction accuracy and ROC curve are used to measure the model prediction accuracy, and the Hosmer-Lemeshow (HL) statistic is used to measure the model. goodness of fit.

以敏感度和特异性曲线交点为分类阈值得到模型预测准确度,其值越高表明模型性能越好;以1-特异性为横坐标、敏感度为纵坐标绘制ROC曲线,ROC曲线下的面积即AUC来评价模型的分类效能,AUC值大于0.5表示优于随机猜测具有预测价值,AUC值越接近于1表示模型的预测分类能力越好;以类别1为例,以敏感度和特异性曲线交点对应的阈值作为模型预测分类阈值如图2所示,以1-特异性为横坐标、敏感度为纵坐标绘制ROC曲线如图3所示;此外,模型拟合优度采用Hosmer-Lemeshow(HL)统计量,其服从卡方分布,P值不显著(>0.05)表示模型拟合数据较好。Taking the intersection of the sensitivity and specificity curves as the classification threshold, the prediction accuracy of the model is obtained, and the higher the value, the better the performance of the model; the ROC curve is drawn with 1-specificity as the abscissa and sensitivity as the ordinate, and the area under the ROC curve That is, AUC is used to evaluate the classification performance of the model. The AUC value greater than 0.5 indicates that it has predictive value better than random guessing. The closer the AUC value is to 1, the better the prediction and classification ability of the model. The threshold corresponding to the intersection is used as the model prediction classification threshold as shown in Figure 2, and the ROC curve is drawn with 1-specificity as the abscissa and sensitivity as the ordinate, as shown in Figure 3; HL) statistic, which obeys the chi-square distribution, and the P value is not significant (>0.05), indicating that the model fits the data well.

表7模型检验指标汇总表Table 7 Model test index summary table

Figure GDA0002805037980000171
Figure GDA0002805037980000171

由表7可知,本发明提出的一种应用于区域路网的交通事故严重度预测方法在模型预测准确度和拟合优度方面优于传统的二元logistic回归模型。It can be seen from Table 7 that a traffic accident severity prediction method applied to a regional road network proposed by the present invention is superior to the traditional binary logistic regression model in terms of model prediction accuracy and goodness of fit.

Claims (3)

1. A traffic accident severity prediction method applied to a regional road network is characterized by comprising the following steps:
step one, collecting and preprocessing road traffic accident data of a regional road network;
acquiring N accident data from a road traffic accident database as an accident data set D, and selecting K classification variables from any ith accident data to form a set X { X ═ X { (X) }1,x2,…,xk,…,xKH, x to characterize the ith accident, whereinkRepresents the kth categorical variable, and the kth categorical variable xkComprises CkSeed class, kth categorical variable xkAt CkThe values in the species are denoted as skLet sikRepresenting the value of the kth categorical variable of the ith accident, and recording a categorical variable value set consisting of the values of all the K categorical variables in the ith accident as Si={si1,si2,...,sik,...,siK}; order to
Figure FDA0002805037970000011
Any value set of all possible values of K classification variables representing the ith accident; k is 1,2,3,. K; 1,2,3, ·, N;
the severity of the ith accident was taken as the predictor variable and recorded as yiAnd y isiThe value of (1) is '0' or '1' which respectively represents non-death accidents and death accidents;
step two, establishing a potential category analysis model according to the road traffic accident data of the regional road network;
step 2.1, defining that a potential category variable V exists in the potential category analysis model, wherein V includes T categories, and any category is marked as T, T is 1, 2. Marking the value of a potential category variable V in the ith accident as Vi
Step 2.1.1, defining the external circulation times as tau and the maximum external circulation iteration times as taumax(ii) a Let the number of categories set at the τ th time be Tτ(ii) a Initializing tau as 1;
step 2.1.2, initializing t to 1;
step 2.1.3 obtaining the ith accident V by initially utilizing the formula (1)iWhen the value is t, namely the t-th potential category belongs to, the value collection of the ith accident on the K classification variables is
Figure FDA0002805037970000012
Conditional probability of (2)
Figure FDA0002805037970000013
Figure FDA0002805037970000014
In formula (1), P(s)ik=sk|ViT) indicates that the ith incident belongs to the tth potentialIn category, the value of the kth categorical variable is skThe conditional probability of (a);
step 2.1.4, obtaining a value set of K classified variables in the ith accident as
Figure FDA0002805037970000015
Is the joint probability of the potential class analysis model
Figure FDA0002805037970000016
Figure FDA0002805037970000017
In the formula (2), P (V)iT) is the probability that the ith incident belongs to the tth potential category, with the potential category t accounting for the ratio of the population;
step 2.2, carrying out model parameter estimation by adopting a maximum likelihood method to obtain estimated values of the potential category probability and the classification variable conditional probability
Figure FDA0002805037970000018
And the τ -th maximum likelihood function value L of the potential category analysis modelτ
Step 2.3, calculating the posterior probability of classifying the ith accident into the tth potential class by using the formula (3)
Figure FDA0002805037970000021
Figure FDA0002805037970000022
Step 2.4, assigning T +1 to T, and judging that T is more than TτIf yes, executing step 2.5; otherwise, returning to the step 2.1.3 for execution;
step 2.5 obtaining the mold by using the formula (4), the formula (5), the formula (6) and the formula (7)A type fit evaluation index comprising: tth information evaluation index AICττ th Bayesian information criterion BICτBayesian information criterion aBIC of the Tth sample correctionτTh order entropy value
Figure FDA0002805037970000023
AICτ=-2ln(Lτ)+2M (4)
BICτ=-2ln(Lτ)+ln(N)×M (5)
aBICτ=-2ln(Lτ)+ln(n*)×M (6)
Figure FDA0002805037970000024
In the formulas (4), (5), (6) and (7), M is the number of unknown parameters in the potential category analysis model; n is*Is the adjusted sample size, and n*=(N+2)/24;
Step 2.6, assigning tau +1 to the rear tau, and judging that tau is larger than taumaxIf yes, executing step 2.7; otherwise, returning to the step 2.1.3 for execution;
step 2.7, from τmaxSecondary information evaluation index AIC, Bayesian information criterion BIC, Bayesian information criterion aBIC for sample correction and entropy value R2Selecting the number of potential categories corresponding to the optimal value of each model fitting evaluation index, and recording as T*(ii) a Dividing the incident data set D into T*The sub-category of the individual accidents, note
Figure FDA0002805037970000025
Figure FDA0002805037970000026
Denotes the t-th*Accident data in the individual accident subcategories, t*=1,2,…,T*
Step three, analyzing the model result according to the potential category, and comparing T*Establishing a CART decision tree model for each accident subcategory;
step 3.1, order the t*Accident data in individual accident subcategories
Figure FDA0002805037970000027
As a training sample set, a set X consisting of K classification variables is used as a feature set in the CART decision tree model; let the node sample threshold be sigma, the eigenvalue cut point be alpha, and the Gini exponential threshold be epsilon;
step 3.2, initialize t*=1;
Step 3.3, the training sample set
Figure FDA0002805037970000028
Inputting a feature set X, a defined node sample threshold value sigma and a Gini index threshold value epsilon into the CART decision tree model;
step 3.4, let t*+1 assignment to t*And determine t*>T*If yes, T is obtained*A binary decision tree and step 3.5 is executed; otherwise, returning to the step 3.3 for execution;
step 3.5, according to the T*Determining interaction items among the classification variables by using a tree diagram of a binary decision tree, wherein the t-th*Interaction items determined by a binary decision tree corresponding to each accident subcategory;
step four, for T*Establishing an accident severity model based on binary logistic regression for each accident subcategory;
step 4.1, the t th step*Accident data in individual accident subcategories
Figure FDA0002805037970000031
As fitting data of accident severity model, set X and t composed of K classification variables*The interaction items of the individual accident subcategories are used together as the accidentIndependent variable X of severity model*(ii) a Define t < th > t*Each accident subcategory contains J accident data, the value of J is
Figure FDA0002805037970000032
The prediction variable of the jth accident is recorded as yj
Step 4.2, initialize t*=1;
Step 4.3, obtaining the independent variable X based on binary logistic regression by using the formula (11)*Death accident under conditions yjProbability of occurrence of 1P (y 1| X)*):
Figure FDA0002805037970000033
In the formula (11), w*Is an independent variable X*The regression coefficient of (2);
step 4.4, estimating the parameter w of the accident severity model of the binary logistic regression by utilizing a maximum likelihood method*
For the event of the jth occurrence,
Figure FDA0002805037970000034
for a given independent variable
Figure FDA0002805037970000035
Under the condition of yjProbability of 1, then given the argument
Figure FDA0002805037970000036
Under the condition of yjProbability of 0 being 1-Pj(ii) a And a likelihood function L (w) is obtained by using the formula (12)*):
Figure FDA0002805037970000037
Using maximum likelihood estimation, find L (w)*) ObtainingAn estimated parameter w' at maximum;
obtaining the independent variable of the jth accident at the beginning according to the estimated parameter w
Figure FDA0002805037970000038
Under the condition of yjPrediction probability of 1
Figure FDA0002805037970000039
Thereby obtaining the prediction probability of the J-start accident
Figure FDA00028050379700000310
And sequencing in ascending order to obtain a sequenced prediction probability set which is marked as { P'1,...,P′j,...,P′J};
Step 4.5, adjusting a prediction classification threshold value of the accident severity model;
step 4.6, let t*+1 assignment to t*And determine t*>T*Whether the T is obtained or not, if so, the T is obtained*And 4, performing the accident severity prediction model, and otherwise, returning to the step 4.3 for execution.
2. The method of predicting the severity of a traffic accident according to claim 1, wherein said step 3.3 is performed as follows:
step 3.3.1, the CART decision tree uses Gini coefficient as the basis for judging whether the decision tree branches, a binary decision tree model is established, the training sample set is cut into points alpha according to the characteristic value
Figure FDA00028050379700000411
Into a first subset Dα1And a second subset Dα2Obtaining the Gini index Gini (D) of the eigenvalue cut point alpha by using the formula (8)α):
Figure FDA0002805037970000041
In the formula (8), the reaction mixture is,
Figure FDA0002805037970000042
|Dα1i and I Dα2Respectively representing training sample sets
Figure FDA0002805037970000043
First subset Dα1And a second subset Dα2Including the total number of accidents;
Gini(Dα1) Represents the first subset Dα1And has a Gini index of:
Figure FDA0002805037970000044
in the formula (9), the reaction mixture is,
Figure FDA0002805037970000045
and
Figure FDA0002805037970000046
respectively represent a first subset Dα1Probability of median non-death and death incidents;
in formula (8), Gini (D)α2) Represents a second subset Dα2And has a Gini index of:
Figure FDA0002805037970000047
in the formula (10), the compound represented by the formula (10),
Figure FDA0002805037970000048
and
Figure FDA0002805037970000049
respectively represent a second subset Dα2Probability of median non-death and death incidents;
step 3.3.2, traversing the segmentation points of each characteristic value in the characteristic set X, and calculating the Gini index of the segmentation points of each characteristic value; if the Gini index of the segmentation point of each feature value in the feature set X is smaller than the threshold epsilon, the CART decision tree model is a single-node tree, and the single-node tree is output; otherwise, executing step 3.3.3;
step 3.3.3, selecting the characteristic value X corresponding to the Gini index of the minimum cut point in the characteristic set XminAnd its corresponding point of tangency alphaminAnd according to said point of tangency alphaminWill train sample set Dt*Divided into two subsets Dmin1And Dmin2Then, the subset D is further processedmin1And subset Dmin2Are respectively assigned to training sample sets Dt*Two child nodes which are father nodes;
if subset Dmin1And subset Dmin2Is less than the given nodal sample threshold σ, two subsets D are representedmin1And Dmin2All the subnodes are leaf nodes, and a binary decision tree is output; if subset Dmin1And/or subset Dmin2Is greater than the nodal sample threshold σ, then subset D is representedmin1Or subset Dmin2The sub-node is a non-leaf node, which can be further divided and step 3.3.4 is executed;
step 3.3.4, for non-leaf nodes, order the training sample set
Figure FDA00028050379700000410
Equal to the subset corresponding to the non-leaf node, and the characteristic value X corresponding to the Gini index of the minimum cut pointminAnd (4) after the samples are deleted from the feature set X, returning to execute the step 3.3.1 until the number of the samples of all the subnodes is smaller than the node sample threshold value sigma or the feature set X is empty, and outputting a final binary decision tree.
3. The method of predicting the severity of a traffic accident according to claim 1, wherein said step 4.5 is performed as follows:
step 4.5.1, define θ as the pre-modelMeasure a classification threshold value of 0<θ<1;
Figure FDA0002805037970000051
Representing that the accident severity model predicts the jth accident prediction as a death accident;
Figure FDA0002805037970000052
representing that the accident severity model predicts that the jth accident is predicted to be a non-death accident;
step 4.5.2, initializing j' ═ 1;
step 4.5.3, let the jth' classification threshold θ of the modelj′Is equal to P'j′The j' th sensitivity Se (theta) predicted by the accident severity model is obtained by using the formula (13)j′) I.e. the probability that the death incident is predicted to be a death incident in the incident data set:
Figure FDA0002805037970000053
in the formula (13), the reaction mixture is,
Figure FDA0002805037970000054
representing the probability that the s-th accident is predicted to be a death accident, ys1 represents that the s-th accident is a death accident, and s is more than or equal to 1 and less than or equal to J;
obtaining the j' th specific Sp (theta) predicted by the accident severity model by using the formula (14)j′) I.e. the probability that a non-fatal accident is predicted to be a non-fatal accident in the accident data set:
Figure FDA0002805037970000055
in the formula (14), the compound represented by the formula (I),
Figure FDA0002805037970000056
representing the probability that the s-th accident is predicted to be a death accident, ys0 means the s th accidentS is more than or equal to 1 and less than or equal to J for death accidents;
step 4.5.4, assigning J ' +1 to J ', judging whether J ' > J is true, if so, representing that J pair sensitivity and specificity values are obtained, and executing step 4.5.5; otherwise, returning to the step 4.5.3 for execution;
step 4.5.5, sorting threshold theta according to jth' numberj′As the abscissa, the j' th classification threshold θ is respectively setj′Corresponding sensitivity Se (theta)j′) And specificity Sp (. theta.)j′) The value is the ordinate, the sensitivity and specificity curves are drawn, and the threshold corresponding to the intersection point of the two curves is used as the optimal model prediction classification threshold theta'.
CN201910770584.3A 2019-08-20 2019-08-20 Traffic accident severity prediction method applied to regional road network Active CN110458244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910770584.3A CN110458244B (en) 2019-08-20 2019-08-20 Traffic accident severity prediction method applied to regional road network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910770584.3A CN110458244B (en) 2019-08-20 2019-08-20 Traffic accident severity prediction method applied to regional road network

Publications (2)

Publication Number Publication Date
CN110458244A CN110458244A (en) 2019-11-15
CN110458244B true CN110458244B (en) 2021-03-30

Family

ID=68488078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910770584.3A Active CN110458244B (en) 2019-08-20 2019-08-20 Traffic accident severity prediction method applied to regional road network

Country Status (1)

Country Link
CN (1) CN110458244B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110942260B (en) * 2019-12-12 2024-02-13 长安大学 College traffic safety evaluation method based on Bayesian maximum entropy
CN111476274B (en) * 2020-03-16 2024-03-08 宜通世纪科技股份有限公司 Big data predictive analysis method, system, device and storage medium
CN111951550B (en) * 2020-08-06 2021-10-29 华南理工大学 Traffic safety risk monitoring method, device, storage medium and computer equipment
CN111931861B (en) * 2020-09-09 2021-01-05 北京志翔科技股份有限公司 Anomaly detection method for heterogeneous data set and computer-readable storage medium
CN112270994B (en) * 2020-10-14 2021-08-17 中国医学科学院阜外医院 Construction method, device, terminal and storage medium of a risk prediction model
CN112349098A (en) * 2020-11-03 2021-02-09 南京信息职业技术学院 Method for estimating accident severity by environmental elements in exit ramp area of expressway
CN112561175A (en) * 2020-12-18 2021-03-26 深圳赛安特技术服务有限公司 Traffic accident influence factor prediction method, device, equipment and storage medium
CN112837533B (en) * 2021-01-08 2021-11-19 合肥工业大学 Highway accident frequency prediction method considering risk factor time-varying characteristics
CN113762364B (en) * 2021-08-23 2022-11-04 东南大学 Unbalanced traffic accident data synthesis sampling method
CN114386844A (en) * 2022-01-11 2022-04-22 合肥工业大学 Modeling method based on relation between traffic state before accident and accident
CN115830800A (en) * 2022-11-28 2023-03-21 广州城建职业学院 Traffic accident early warning method, system, device and storage medium
CN116882780B (en) * 2023-07-05 2024-04-05 北京大学 A method for rural spatial element extraction and local classification planning based on landscape images

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154681A (en) * 2016-12-06 2018-06-12 杭州海康威视数字技术股份有限公司 Risk Forecast Method, the apparatus and system of traffic accident occurs
CN109598929A (en) * 2018-11-26 2019-04-09 北京交通大学 A kind of multi-class the number of traffic accidents prediction technique

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130331055A1 (en) * 2012-06-12 2013-12-12 Guardity Technologies, Inc. Qualifying Automatic Vehicle Crash Emergency Calls to Public Safety Answering Points
US10783997B2 (en) * 2016-08-26 2020-09-22 International Business Machines Corporation Personalized tolerance prediction of adverse drug events
CN109447306B (en) * 2018-08-13 2021-07-02 上海海事大学 Prediction method of subway accident delay time based on maximum likelihood regression tree

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154681A (en) * 2016-12-06 2018-06-12 杭州海康威视数字技术股份有限公司 Risk Forecast Method, the apparatus and system of traffic accident occurs
CN109598929A (en) * 2018-11-26 2019-04-09 北京交通大学 A kind of multi-class the number of traffic accidents prediction technique

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Evaluation of the safety performance of highway alignments based on fault tree analysis and safety boundaries;Yikai Chen等;《Traffic Injury Prevention》;20180301;第19卷(第4期);第409-416页 *
The Model of Severity Prediction of Traffic Crash on the Curve;Jian-feng Xi等;《Green Transportation System and Safety》;20140109;第1-6页 *
基于有序Logit和多项Logit模型的高速公路交通事故严重程度预测;李庚凭;《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》;20190115(第01期);第C034-2019页 *
车辆碰撞中行人死亡风险及颅脑损伤类型预测研究;冯成建;《中国博士学位论文全文数据库 工程科技Ⅱ辑》;20171115(第11期);第C035-25页 *

Also Published As

Publication number Publication date
CN110458244A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110458244B (en) Traffic accident severity prediction method applied to regional road network
CN110097755B (en) State recognition method of expressway traffic flow based on deep neural network
CN113096388B (en) A Short-term Traffic Flow Forecast Method Based on Gradient Boosting Decision Tree
Zhao et al. Factors affecting traffic risks on bridge sections of freeways based on partial dependence plots
CN112668172B (en) Car-following behavior modeling method and its model considering the heterogeneity of vehicle types and driving styles
CN108550263B (en) Expressway traffic accident cause analysis method based on fault tree model
Rovšek et al. Identifying the key risk factors of traffic accident injury severity on Slovenian roads using a non-parametric classification tree
Zhou et al. Comparing factors affecting injury severity of passenger car and truck drivers
CN110288825B (en) Traffic control subregion clustering division method based on multi-source data fusion and SNMF
CN108665093B (en) Prediction method of highway traffic accident severity based on deep learning
CN111563555A (en) Driver driving behavior analysis method and system
CN105809193B (en) A kind of recognition methods of the illegal vehicle in use based on kmeans algorithm
CN110858312A (en) Driver driving style classification method based on fuzzy C-means clustering algorithm
CN104574968A (en) Determining method for threshold traffic state parameter
Xu et al. Utilizing structural equation modeling and segmentation analysis in real-time crash risk assessment on freeways
CN114926825A (en) Vehicle driving behavior detection method based on space-time feature fusion
CN110119891B (en) A traffic safety influencing factor identification method suitable for big data
CN112149922A (en) Method for predicting severity of accident in exit and entrance area of down-link of highway tunnel
CN112035536A (en) Electric automobile energy consumption prediction method considering dynamic road network traffic flow
CN108682153A (en) A kind of urban road traffic congestion condition discrimination method based on RFID electronic license plate data
CN115587536A (en) Traffic accident severity prediction method, equipment and storage medium
CN113011713B (en) A driver driving stability assessment method based on information entropy
CN112036709B (en) Random forest based rainfall weather expressway secondary accident cause analysis method
Yang Clearance time prediction of traffic accidents: A case study in Shandong, China
CN114693072A (en) Motorcade structure analysis method, motorcade structure analysis system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant