CN115938105A

CN115938105A - A mileage measurement method for highway sections based on ETC big data

Info

Publication number: CN115938105A
Application number: CN202210729479.7A
Authority: CN
Inventors: 邹复民; 吴松洋; 蔡祈钦; 罗旭; 郭峰; 罗思杰; 罗永煜; 陈灏彬; 田俊山; 吴金山; 陈子瑜; 黄世彬; 于翔; 王浩琳; 许根; 任强; 林子杨
Original assignee: Fujian University of Technology
Current assignee: Fujian University of Technology
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-04-07

Abstract

The invention discloses an ETC big data-based highway section mileage measurement method, which comprises the following steps of: acquiring the geographic position coordinates of a starting point and a finishing point of a highway section, and acquiring the total mileage of the section by using a map API (application programming interface); acquiring the running time of the vehicle on the whole road section, and calculating the average running speed of the vehicle on the whole road section by combining the mileage of the whole road section; dividing the whole road section into a plurality of sections, respectively acquiring the residence time of the vehicle in each section, and calculating the section driving mileage by taking the average speed of the whole road section as the driving speed of each section; and constructing a section mileage generation model according to the driving mileage of different sections of the vehicle. The model constructed by the method realizes the measurement of the mileage of the highway section only by relying on the ETC big data, has small measurement error and stable performance, and is beneficial to the fine management of the highway and the improvement of the application value of the ETC big data.

Description

A highway section mileage measurement method based on ETC big data

技术领域Technical Field

本发明涉及及高速公路管理技术领域，尤其涉及一种基于ETC大数据的高速公路区段里程测量方法。The present invention relates to the technical field of highway management, and in particular to a highway section mileage measurement method based on ETC big data.

背景技术Background Art

截至2020年底，我国高速公路总里程达16.1万公里，位居世界第一。为进一步提升我国高速公路运营效率，截至2019年底，我国高速公路不停车电子收费系统(ElectronicToll Collection，ETC)实现了全国29个省份联网，共建成了ETC门架系统24588套，改造了ETC车道48211条，全国ETC用户累积超过了2亿。By the end of 2020, the total mileage of my country's expressways reached 161,000 kilometers, ranking first in the world. To further improve the operational efficiency of my country's expressways, by the end of 2019, my country's expressway non-stop electronic toll collection (ETC) system had been connected to 29 provinces across the country, with a total of 24,588 ETC gantry systems built, 48,211 ETC lanes renovated, and more than 200 million ETC users nationwide.

为实现正确合理地ETC联网收费，根据我国新建高速收费办法文件，新建高速公路ETC联网收费前，需对该路段里程进行实地测量，以确定实际收费里程并依此计算通行费额。然而，面对我国幅员辽阔、纵横交错的复杂高速路网，传统的测量方式不仅耗费大量人力物力财力，在实地测量中还存在安全隐患。In order to realize the correct and reasonable ETC network charging, according to the document of my country's new expressway charging method, before the ETC network charging of new expressways is built, the mileage of the road section needs to be measured on the spot to determine the actual toll mileage and calculate the toll amount accordingly. However, facing my country's vast and complex expressway network, the traditional measurement method not only consumes a lot of manpower, material and financial resources, but also has safety hazards in the field measurement.

发明内容Summary of the invention

本发明的目的在于提供一种基于ETC大数据的高速公路区段里程测量方法，大大降低了人工测绘成本，并且提高了安全系数。The purpose of the present invention is to provide a highway section mileage measurement method based on ETC big data, which greatly reduces the cost of manual surveying and mapping and improves the safety factor.

本发明采用的技术方案是：The technical solution adopted by the present invention is:

一种基于ETC大数据的高速公路区段里程测量方法，其包括以下步骤：A method for measuring the mileage of a highway section based on ETC big data comprises the following steps:

步骤1，获取高速公路路段起点、终点的地理位置坐标，并利用地图API获取该区段的总里程；Step 1, obtain the geographic coordinates of the starting point and end point of the highway section, and use the map API to obtain the total mileage of the section;

步骤2，获取车辆在整个路段的行驶时间，计算出车辆在整个路段的平均行驶速度；Step 2, obtaining the driving time of the vehicle on the entire road section and calculating the average driving speed of the vehicle on the entire road section;

步骤3，获取车辆在每个区段的驻留时间，以路段的平均速度作为每个区段的行驶速度计算出区段行驶里程；Step 3, obtaining the residence time of the vehicle in each section, and calculating the section mileage by taking the average speed of the section as the driving speed of each section;

具体地，先根据步骤1获取整个路段的里程，再根据车辆数据获取车辆再整个路段的行驶时间，计算出行驶速度。Specifically, firstly obtain the mileage of the entire road section according to step 1, then obtain the driving time of the vehicle on the entire road section according to the vehicle data, and calculate the driving speed.

进而计算出整个区段的平均行驶速度，再获取车辆在每个区段的驻留时间，用平均速度作区段的行驶速度，进而获取每个区段的区段行驶里程。Then calculate the average driving speed of the entire section, obtain the vehicle's residence time in each section, use the average speed as the section's driving speed, and then obtain the section mileage of each section.

步骤4，根据车辆的区段行驶里程，构建区段里程生成模型如下：Step 4: According to the segment mileage of the vehicle, a segment mileage generation model is constructed as follows:

ΔD～N(μ,Γ_Δd)ΔD～N(μ,Γ _Δd )

其中，ΔD为高斯随机向量，其均值向量μ＝[μ₁,μ₂,…,μ_n-1]，μ_n表示第n个区段的距离均值，根据经过的车辆的行驶里程计算得出；协方差矩阵为

表示每个区段的方差，m表示经过区段的m辆车；N表示正态分布的意思。。即根据通行的车辆计算出每一辆车在不同区段的通行里程，输入模型中获取整个路段的里程，以及每个区段的里程。Where ΔD is a Gaussian random vector, whose mean vector μ = [μ ₁ ,μ ₂ ,…,μ _n-1 ], μ _n represents the mean distance of the nth segment, which is calculated based on the mileage of the passing vehicles; the covariance matrix is

represents the variance of each section, m represents the number of vehicles passing through the section, and N represents the normal distribution. That is, the mileage of each vehicle in different sections is calculated based on the number of vehicles passing through, and the mileage of the entire section and each section is obtained by inputting the model.

进一步地，步骤1中地图API为高德地图API。Furthermore, the map API in step 1 is the Amap API.

进一步地，步骤2中采用平均速度公式计算车辆平均速度：Furthermore, the average speed formula is used in step 2 to calculate the average speed of the vehicle:

其中，d为路段的总里程，

表示第j辆车在该路段的行驶时间，Where d is the total mileage of the road section,

represents the travel time of the jth vehicle on this road section,

进一步地，步骤3中采用基于箱线图的噪声数据清洗方法获取车辆在每个区段的驻留时间和行驶里程。Furthermore, in step 3, a noise data cleaning method based on a box plot is used to obtain the residence time and mileage of the vehicle in each section.

进一步地，步骤3的具体步骤如下：Furthermore, the specific steps of step 3 are as follows:

步骤301，选择路况为交通自由流的路段LD，并且该路段中每个区段的车道数相同；若路段交通情况不符合则选择后半夜等符合交通自由流的时间段；若每个区段的车道数不同，则分成多个路段独立进行处理；Step 301, select a road section LD with a free traffic flow, and the number of lanes in each section of the road section is the same; if the traffic condition of the road section does not meet the requirements, select a time period such as the second half of the night that meets the free traffic flow requirements; if the number of lanes in each section is different, divide the sections into multiple sections and process them independently;

步骤302，由于不同车辆及不同驾驶行为习惯会导致行驶速度变化，车辆行经某一区段的驻留时间占整个路段经行时间的比例称为区段用时比例r：Step 302: Since different vehicles and different driving habits may lead to different driving speeds, the ratio of the dwelling time of a vehicle passing through a certain section to the total travel time of the section is called the section time ratio r:

其中，Δt为区段驻留时间，Δt_j为整个路段的驻留时间，Among them, Δt is the dwell time of the section, _Δtj is the dwell time of the entire section,

步骤303，将区段驻留时间进行归一化，获取m辆车在整个路段的用时比例R；Step 303, normalize the section residence time to obtain the time ratio R of the m vehicles in the entire road section;

其中，m指m辆车；n指第n个区段；Among them, m refers to the mth vehicle; n refers to the nth section;

步骤304，采用箱线图对行经服务区的车辆产生的噪声数据进行清洗，得到路段的有效车辆子集M”：Step 304: Use a box plot to clean the noise data generated by vehicles passing through the service area to obtain a valid vehicle subset M" of the road section:

车辆的有效子集即为所使用的路段的车辆集合；The valid subset of vehicles is the set of vehicles on the road segment used;

步骤305，获取区段行驶里程ΔD：Step 305, obtaining the segment mileage ΔD:

其中，V＝diag(v¹,v²,D,v^n-1)表示子集中每一辆车的速度；

表示第m辆车在第n-1个区段的行驶里程；m表示有效车辆子集M”中车辆的数目；Wherein, V = diag (v ¹ , v ² , D, v ^n-1 ) represents the speed of each vehicle in the subset;

represents the mileage of the mth vehicle in the n-1th segment; m represents the number of vehicles in the valid vehicle subset M”;

步骤3041，车辆经行服务区时，分为车辆在服务区停留和未在服务区停留，可以得到不同的区段用时比例，未停留在服务区的用时比例为：Step 3041: When a vehicle passes through a service area, it is divided into the case where the vehicle stays in the service area and the case where the vehicle does not stay in the service area. Different time usage ratios of different sections can be obtained. The time usage ratio of the vehicle not staying in the service area is:

其中，Δt₁为未停留时在该区段的驻留时间，Δt为包含该区段的整个路段的驻留时间；Among them, Δt ₁ is the residence time in the section when not stopping, and Δt is the residence time of the entire section including the section;

停留在服务区的车辆的区段用时比例为：The time ratio of vehicles staying in the service area is:

其中，Δt₁为车辆在区段的驻留时间，未包含在服务区时间；Δt_s为在服务区的停留时间；Among them, Δt ₁ is the residence time of the vehicle in the section, not included in the service area time; Δt _s is the residence time in the service area;

步骤3042，由于车辆在服务区停留会导致区段用时比例偏离整体真实分布，因此需要使用箱线图对数据进行清洗；获取箱线图的各个数据其中需要的有：Q1为第一四分位数；Q2为第二四分位数，也称中位数；Q3为第三四分位数；IQR＝Q3-Q1为四分位数间距；Q1-1.5×IQR为下限位Lower和Q3+1.5×IQR为上限位Upper；Outliers为噪声点，其值大于Q3+1.5×IQR或小于Q1-1.5×IQR，也称离群点或异常点；Step 3042, since the vehicle staying in the service area will cause the section time ratio to deviate from the overall true distribution, it is necessary to use a box plot to clean the data; obtain the various data of the box plot, of which the following are needed: Q1 is the first quartile; Q2 is the second quartile, also known as the median; Q3 is the third quartile; IQR = Q3-Q1 is the interquartile range; Q1-1.5×IQR is the lower limit Lower and Q3+1.5×IQR is the upper limit Upper; Outliers are noise points, whose values are greater than Q3+1.5×IQR or less than Q1-1.5×IQR, also known as outliers or abnormal points;

步骤3043，根据箱线图的结果来构建经停服务区车辆子集I_j：Step 3043, constructing a subset of vehicles I _j that stop at the service area according to the results of the box plot:

其中，J＝{j}(j∈[1,n-1])为服务区所在的区段集合；

为第i辆车在区段j的区段用时比例；

为区段j的用时比例在箱线图中的第三四分位数；

为区段j的用时比例在箱线图中的四分位数间距；

为区段j的用时比例在箱线图中的第一四分位数；Wherein, J = {j} (j∈[1,n-1]) is the set of sections where the service area is located;

is the time ratio of the i-th vehicle in section j;

is the third quartile of the time proportion of segment j in the box plot;

is the interquartile range of the time proportion of segment j in the box plot;

is the first quartile of the time proportion of segment j in the box plot;

步骤3044，构建路段的车辆子集M'：Step 3044, construct the vehicle subset M' of the road segment:

其中，M为原始车辆集，M'已剔除经停服务区车辆后的车辆集；Among them, M is the original vehicle set, and M' is the vehicle set after eliminating the vehicles that stop at the service area;

步骤3045，由于交通路况情况复杂多变，交通流、道路养护及突发事件等影响，需要对M'进一步清洗，再构建区段异常车辆子集I'_j：Step 3045: Due to the complex and changeable traffic conditions, traffic flow, road maintenance and emergencies, it is necessary to further clean M' and then construct the abnormal vehicle subset I' _j in the section:

其中，

为异常车辆的区段用时比例；

为箱线图中的第三四分位数；

为箱线图中的四分位数间距；

为箱线图中的第一四分位数；in,

The proportion of time used in the section for abnormal vehicles;

is the third quartile in the box plot;

is the interquartile range in the box plot;

is the first quartile in the box plot;

步骤3046，进而得到路段的有效车辆子集M”：Step 3046, further obtaining a valid vehicle subset M" of the road segment:

车辆的有效子集即为所使用的路段的车辆集合。The valid subset of vehicles is the set of vehicles on the road segment used.

进一步地，步骤4中采用大数定律构建门架区段里程生成模型的步骤如下：Furthermore, the steps of constructing the gantry section mileage generation model using the law of large numbers in step 4 are as follows:

步骤401，根据步骤三中得到的区段行驶里程，可知每辆车在不同的区段的行驶里程是相互独立的并且服从同一分布具有数学期望：Step 401: According to the segment mileage obtained in step 3, it can be known that the mileage of each vehicle in different segments is independent of each other and follows the same distribution with mathematical expectation:

其中，

表示第i辆车在区段j的行驶里程，μ_j为路段j的里程的数学期望值，in,

represents the mileage of the i-th vehicle in section j, μ _j is the mathematical expectation of the mileage of section j,

步骤402，根据大数定律可得，序列

依概率收敛于μ_j，即

设

具有方差

由中心极限定理可知，Δd_j之和

的标准化变量Y_m为：Step 402, according to the law of large numbers, the sequence

Converges to μ _j with probability, that is

set up

With variance

From the central limit theorem, we know that the sum of Δd _j

The standardized variable Y _m is:

其中，Y_m的分布函数F_m(x)对于任意x满足：Among them, the distribution function F _m (x) of Y _m satisfies for any x:

其中，F_m(x)为分布函数，Φ(x)为标准正态分布函数；Among them, F _m (x) is the distribution function, Φ(x) is the standard normal distribution function;

步骤403，当m足够大时(m≥30)Δd_j的均值经适当标准化后依分布收敛于正态分布，则任一

的均值

将近似服从均值为μ_j，方差为

的正态分布。因此，可构建基于ETC大数据的门架区段里程生成模型MGM(Mileage GenerationModel)：Step 403, when m is large enough (m≥30), the mean of Δd _j converges to a normal distribution after proper standardization.

The mean

The approximate mean is μ _j and the variance is

Therefore, a gantry section mileage generation model MGM (Mileage Generation Model) based on ETC big data can be constructed:

ΔD～N(μ,Γ_Δd)ΔD～N(μ,Γ _Δd )

其中，ΔD即为生成的区段里程，其均值向量μ＝[μ₁,μ₂,…,μ_n-1]，μ_n表示第n个区段的距离均值，根据经过的车辆的行驶里程计算得出；协方差矩阵为

表示每个区段的方差，m表示经过区段的m辆车。Among them, ΔD is the generated segment mileage, and its mean vector μ = [μ ₁ ,μ ₂ ,…,μ _n-1 ], μ _n represents the mean distance of the nth segment, which is calculated based on the mileage of the passing vehicles; the covariance matrix is

represents the variance of each segment, and m represents the number of vehicles passing through the segment.

本发明采用以上技术方案，依托ETC大数据实现了区段里程的测量，并且测量精度高、性能稳定，改变了传统的高速公路里程测量作业模式，不仅大大降低了人工测绘成本，还提高了安全系数，同时完善ETC系统基础信息，有助于高速公路精细化管理及提高ETC大数据的应用价值。The present invention adopts the above technical scheme and realizes the measurement of section mileage based on ETC big data, with high measurement accuracy and stable performance, which changes the traditional highway mileage measurement operation mode, not only greatly reduces the cost of manual surveying and mapping, but also improves the safety factor. At the same time, it improves the basic information of the ETC system, which is conducive to the refined management of highways and improves the application value of ETC big data.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

以下结合附图和具体实施方式对本发明做进一步详细说明；The present invention is further described in detail below with reference to the accompanying drawings and specific embodiments;

图1为本发明一种基于ETC大数据的高速公路区段里程测量方法的结构示意图；FIG1 is a schematic diagram of the structure of a highway section mileage measurement method based on ETC big data according to the present invention;

图2为箱线图的噪声数据清洗模型示意图；FIG2 is a schematic diagram of a noise data cleaning model for a box plot;

图3为LD₁示意图；Fig. 3 is a schematic diagram of LD ₁ ;

图4为LD₂示意图；Fig. 4 is a schematic diagram of LD ₂ ;

图5为LD₁数据清洗前后区段行驶里程数据分布对比示意图；Figure 5 is a schematic diagram showing the comparison of the mileage data distribution in the segment before and after LD ₁ data cleaning;

图6为LD₁数据清洗前后区段里程相对误差对比示意图；Figure 6 is a schematic diagram showing the comparison of relative errors of segment mileage before and after LD ₁ data cleaning;

图7为LD₂数据清洗前后区段行驶里程数据分布对比示意图；Figure 7 is a schematic diagram showing the comparison of the mileage data distribution in the segment before and after LD ₂ data cleaning;

图8为LD₂数据清洗前后区段里程相对误差对比示意图；Figure 8 is a schematic diagram showing the comparison of relative errors of segment mileage before and after LD ₂ data cleaning;

图9为LD₁整体里程误差示意图；Figure 9 is a schematic diagram of the overall mileage error of LD ₁ ;

图10为LD₂整体里程误差示意图；Figure 10 is a schematic diagram of the overall mileage error of LD ₂ ;

图11为LD₁中单独区段的误差累积概率示意图；FIG11 is a schematic diagram of the error accumulation probability of a single segment in LD ₁ ;

图12为LD₂中单独区段的误差累积概率示意图；FIG12 is a schematic diagram of the error accumulation probability of a single segment in LD ₂ ;

图13为LD₁中不同参数对模型误差影响示意图；Figure 13 is a schematic diagram showing the effect of different parameters on model error in LD ₁ ;

图14为LD₂中不同参数对模型误差影响示意图。Figure 14 is a schematic diagram showing the effect of different parameters in LD ₂ on the model error.

具体实施方式DETAILED DESCRIPTION

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图对本申请实施例中的技术方案进行清楚、完整地描述。In order to make the purpose, technical solution and advantages of the embodiments of the present application clearer, the technical solution in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application.

如图1至14之一所示，本发明公开了一种基于ETC大数据的高速公路区段里程测量方法，其包括以下步骤：As shown in any one of FIGS. 1 to 14 , the present invention discloses a method for measuring the mileage of a highway section based on ETC big data, which comprises the following steps:

步骤3，，获取车辆在每个区段的驻留时间，以路段的平均速度作为每个区段的行驶速度计算出区段行驶里程；Step 3, obtain the residence time of the vehicle in each section, and use the average speed of the road section as the driving speed of each section to calculate the section mileage;

步骤4根据车辆的区段行驶里程，构建区段里程生成模型如下：Step 4: According to the segment mileage of the vehicle, the segment mileage generation model is constructed as follows:

ΔD～N(μ,Γ_Δd)ΔD～N(μ,Γ _Δd )

表示每个区段的方差，m表示经过区段的m辆车。Where ΔD is a Gaussian random vector, whose mean vector μ = [μ ₁ ,μ ₂ ,…,μ _n-1 ], μ _n represents the mean distance of the nth segment, which is calculated based on the mileage of the passing vehicles; the covariance matrix is

其中，d为路段的总里程，

represents the travel time of the jth vehicle on this road section,

其中，V＝diag(v¹,v²,…,v^n-1)表示子集中每一辆车的速度；

表示第m辆车在第n-1个区段的行驶里程；m表示有效车辆子集M”中车辆的数目；Wherein, V = diag (v ¹ , v ² , …, v ^n-1 ) represents the speed of each vehicle in the subset;

进一步的，步骤304的具体步骤如下：Furthermore, the specific steps of step 304 are as follows:

其中，J＝{j}(j∈[1,n-1])为服务区所在的区段集合；

为第i辆车在区段j的区段用时比例；

为区段j的用时比例在箱线图中的第三四分位数；

为区段j的用时比例在箱线图中的四分位数间距；

is the time ratio of the i-th vehicle in section j;

is the third quartile of the time proportion of segment j in the box plot;

is the interquartile range of the time proportion of segment j in the box plot;

is the first quartile of the time proportion of segment j in the box plot;

其中，

为异常车辆的区段用时比例；

为箱线图中的第三四分位数；

为箱线图中的四分位数间距；

为箱线图中的第一四分位数；in,

The proportion of time used in the section for abnormal vehicles;

is the third quartile in the box plot;

is the interquartile range in the box plot;

is the first quartile in the box plot;

其中，

步骤402，根据大数定律可得，序列

依概率收敛于μ_j，即

设

具有方差

由中心极限定理可知，Δd_j之和

Converges to μ _j with probability, that is

set up

With variance

From the central limit theorem, we know that the sum of Δd _j

The standardized variable Y _m is:

的均值

将近似服从均值为μ_j，方差为

The mean

The approximate mean is μ _j and the variance is

ΔD～N(μ,Γ_Δd)ΔD～N(μ,Γ _Δd )

下面就本发明的具体原理做详细说明：The specific principle of the present invention is described in detail below:

假设路段LD起始节点和终止节点的地理位置坐标(经纬度)均是已知的。若某一路段LD不符合假设1，则可选择收费站进出口等地理位置坐标明确的节点作为路段的起止节点。显然，该假设是合理的。Assume that the geographic coordinates (latitude and longitude) of the starting and ending nodes of the road section LD are known. If a road section LD does not meet assumption 1, nodes with clear geographic coordinates such as the entrance and exit of the toll station can be selected as the starting and ending nodes of the road section. Obviously, this assumption is reasonable.

因此，根据路段LD起止节点地理位置坐标，可通过高德地图驾车路径规划API获得路段LD总里程为d，则车辆在整个路段的平均行驶速度v：Therefore, according to the geographic coordinates of the start and end nodes of the road section LD, the total mileage of the road section LD can be obtained through the Amap driving route planning API as d, and the average driving speed of the vehicle in the entire road section is v:

然而，路段车道数、突发事件、驾驶行为等因素不可避免影响车辆的行驶速度。为了达到较为理想的效果，符合MGM模型的要求，对研究路段做如下约束：However, factors such as the number of lanes on a road section, emergencies, and driving behavior inevitably affect the vehicle's speed. In order to achieve a more ideal effect and meet the requirements of the MGM model, the following constraints are imposed on the study section:

L1＝L2＝…＝Ln-1，Lj为区段j车道数；L1＝L2＝…＝Ln-1, Lj is the number of lanes in section j;

所选路段LD的路况为交通自由流。The traffic condition of the selected road section LD is free flow traffic.

若某一路段不符合约束1，则可分成多个路段独立进行处理；若某一路段的路况不符合约束2，则可选择后半夜等符合交通自由流的时间段。因此，该约束是合理的。If a road section does not meet constraint 1, it can be divided into multiple sections and processed independently; if the road condition of a section does not meet constraint 2, a time period that meets the free flow of traffic, such as the second half of the night, can be selected. Therefore, this constraint is reasonable.

假设所有车辆运行在理想交通路况状态下，车辆间互不干扰、互不影响，车辆具有自由流速度。It is assumed that all vehicles are operating under ideal traffic conditions, there is no interference or influence between vehicles, and the vehicles have free flow speed.

根据词假设，显然可认为

相互独立，服从同一分布且具有数学期望

根据大数定律可得，序列

依概率收敛于μ_j，即

不妨设

具有方差

由中心极限定理可知，Δd_j之和

的标准化变量Y_m为：According to the word hypothesis, it is obvious that

Independent of each other, follow the same distribution and have mathematical expectations

According to the law of large numbers, the sequence

Converges to μ _j with probability, that is

Let's assume

With variance

From the central limit theorem, we know that the sum of Δd _j

The standardized variable Y _m is:

由上式可知，当m足够大(通常要求大样本m≥30)时，Δd_j的均值经适当标准化后依分布收敛于正态分布，则任一

的均值

将近似服从均值为μ_j，方差为

的正态分布。因此，可构建基于ETC大数据的门架区段里程生成模型MGM(Mileage Generation Model)：From the above formula, we can see that when m is large enough (usually a large sample size of m≥30 is required), the mean of Δd _j converges to the normal distribution after proper standardization.

The mean

The approximate mean is μ _j and the variance is

ΔD～N(μ,Γ_Δd)ΔD～N(μ,Γ _Δd )

ΔD为高斯随机向量，其均值向量μ＝[μ₁,μ₂,…,μ_n-1]及协方差矩阵

ΔD is a Gaussian random vector, whose mean vector μ = [μ ₁ ,μ ₂ ,…,μ _n-1 ] and covariance matrix

表1：ETC交易数据部分字段描述Table 1: Description of some fields of ETC transaction data

如表1所示ETC交易数据部分字段描述。在数据清洗及生成模型得到有效验证后，对LD₁和LD₂的ETC交易数据进行分布特性分析，其区段行驶里程清洗前后的数据分布对比见图5和图6。数据清洗前，尽管里程数据分布基本具有高斯分布雏形，但仍存在大量噪声数据，使之呈现偏态分布。具体地，QD1、QD2和QD3均为无服务区区段，其大量里程数据点散布在真实里程左侧，导致拟合曲线(顶部)左侧具有较长尾部，呈现负偏态分布；而QD4为服务区区段，其大量里程数据点散布在真实里程右侧，导致拟合曲线(顶部)右侧均具有较长尾部，呈现正偏态分布。进一步通过ODC算法对离群点里程数据进行清洗处理。从图5和图6中可看出，相比清洗前的各区段横轴坐标，所有纵轴坐标均缩小至一定范围，表明经清洗后所有里程数据均处于真实里程附近；同时，从清洗后的各区段拟合曲线(右侧)可看出，所有区段里程数据分布均呈现较好的高斯分布特性，表明大部分的异常里程数据已经得到了较好的清洗。Table 1 shows the description of some fields of ETC transaction data. After data cleaning and generation model were effectively verified, the distribution characteristics of ETC transaction data of LD ₁ and LD ₂ were analyzed. The data distribution comparison before and after cleaning of the segment mileage is shown in Figures 5 and 6. Before data cleaning, although the mileage data distribution basically has the prototype of Gaussian distribution, there are still a lot of noise data, which makes it present a skewed distribution. Specifically, QD1, QD2 and QD3 are all sections without service areas, and a large number of mileage data points are scattered on the left side of the real mileage, resulting in a longer tail on the left side of the fitting curve (top), showing a negative skewed distribution; while QD4 is a service area section, and a large number of mileage data points are scattered on the right side of the real mileage, resulting in a longer tail on the right side of the fitting curve (top), showing a positive skewed distribution. The outlier mileage data is further cleaned by the ODC algorithm. It can be seen from Figures 5 and 6 that compared with the horizontal axis coordinates of each section before cleaning, all vertical axis coordinates are reduced to a certain range, indicating that all mileage data are close to the actual mileage after cleaning; at the same time, it can be seen from the fitting curves of each section after cleaning (right side) that the mileage data distribution of all sections shows a good Gaussian distribution characteristic, indicating that most of the abnormal mileage data has been well cleaned.

进一步使用清洗前后的里程数据对各区段里程进行估计，并使用MRE作为评价指标。具体地，从LD₁和LD₂数据集中随机采样100组，每组样本容量分别为1～200，研究MRE的波动演化，结果如图7和图8所示。所有区段均体现出样本量越小，MRE波动越剧烈的特点。具体地，数据清洗前，所有区段的MRE值波动范围广，无服务区区段均收敛于负值，服务区区段均收敛于正值，表现出与清洗前的分布特性一致，进一步体现了数据分布的偏态性；而经清洗后，所有区段的MRE只在小范围波动，体现出轻微振荡后快速收敛于0值。另外，从最终收敛后的里程偏离程度发现，无论是服务区区段，还是无服务区区段，均具有区段里程越长，偏离幅度越大的特点。经清洗后，获得LD₁和LD₂完整轨迹总数分别为1033条和1302条，并通过数据清洗前后各区段的数据分布特性和MRE的对比分析。The mileage data before and after cleaning are further used to estimate the mileage of each section, and MRE is used as an evaluation indicator. Specifically, 100 groups are randomly sampled from the LD ₁ and LD ₂ data sets, with sample sizes of 1 to 200 in each group, to study the fluctuation evolution of MRE. The results are shown in Figures 7 and 8. All sections show that the smaller the sample size, the more violent the MRE fluctuation. Specifically, before data cleaning, the MRE values of all sections fluctuate in a wide range, and the sections without service areas converge to negative values, while the sections with service areas converge to positive values, showing the same distribution characteristics as before cleaning, further reflecting the skewness of data distribution; after cleaning, the MRE of all sections fluctuates only in a small range, reflecting a rapid convergence to 0 value after slight oscillation. In addition, from the degree of mileage deviation after the final convergence, it is found that both the service area section and the section without service area have the characteristics that the longer the section mileage, the greater the deviation. After cleaning, the total number of complete trajectories obtained for LD ₁ and LD ₂ were 1033 and 1302, respectively, and the data distribution characteristics and MRE of each segment before and after data cleaning were compared and analyzed.

在保证数据质量的基础上，样本容量将直接影响生成模型误差的大小。为充分研究样本容量对误差的影响，使用MAE作为评价指标。具体地，从LD₁和LD₂数据集中随机采样，样本容量分别为1～1000，重复100次研究MAE的波动演化规律。如图9和图11所示，MAE随样本容量的增大而快速降低，其误差波动范围逐渐变小，最终趋于稳定。根据第2节对样本容量的要求，m＝30是样本容量的一个分水岭，当m<30时，MAE误差波动剧烈、范围广，其部分误差超过了200m，说明在小样本容量下，生成模型性能较差；当m>30时，MAE波动范围较小，其均值波动变化平缓且低于50m，说明大样本容量下，生成模型性能得到了较大地提升，这也进一步验证了第2节中生成模型对大样本容量的要求。On the basis of ensuring data quality, sample size will directly affect the size of the error of the generative model. In order to fully study the impact of sample size on the error, MAE is used as an evaluation indicator. Specifically, random sampling is performed from the LD ₁ and LD ₂ data sets, with sample sizes ranging from 1 to 1000, and repeated 100 times to study the fluctuation evolution law of MAE. As shown in Figures 9 and 11, MAE decreases rapidly with the increase of sample size, and its error fluctuation range gradually decreases and eventually tends to be stable. According to the requirements for sample size in Section 2, m = 30 is a watershed of sample size. When m < 30, the MAE error fluctuates violently and widely, and some of its errors exceed 200m, indicating that the performance of the generative model is poor under small sample size; when m > 30, the MAE fluctuation range is small, and its mean fluctuation changes smoothly and is less than 50m, indicating that the performance of the generative model has been greatly improved under large sample size, which further verifies the requirements of the generative model for large sample size in Section 2.

进一步对路段中各区段单独进行多次实验，并统计MAE总体概率分布。不妨设δ＝100m，α＝0.02，由于标准差σ未知，可从数据集中随机抽取10000组，每组样本为30个，并计算各区段标准差，取最大值可得：LD₁各区段σ₁≈[379,86,279,348](单位：m，下同)，LD₂各区段σ₂≈[115,315,152,370]。因此，令LD₁和LD₂的σ_max分别为379m和370m。根据推论1可得，LD₁和LD₂所需样本容量分别至少为78和75，因此只需令m1＝78和m2＝75。为逼近每个区段的真实误差概率分布，对每个区段进行10000组(下同)实验并作误差累积概率分布(CDF)曲线。如图10和图12所示，所有区段均能以100％置信概率将误差控制在100m内，其概率值稳定高于预设值98％。同时，在置信概率为98％时，LD₁中各区段均能将误差控制在73m内，而LD₂中各区段均能将误差控制在66m内，表明生成模型性能显著。特别地，在相同样本容量下，LD₁的QD2和LD₂的QD1表现更佳，并以100％置信概率分别将误差控制在30m和32m内，进一步研究发现两区段里程均较短，其生成模型性能表现较佳。Further, multiple experiments are conducted on each section of the road section, and the overall probability distribution of MAE is calculated. Let δ = 100m, α = 0.02. Since the standard deviation σ is unknown, 10,000 groups can be randomly selected from the data set, with 30 samples in each group, and the standard deviation of each section can be calculated. Taking the maximum value, we can get: σ ₁ ≈[379,86,279,348] (unit: m, the same below) for each section of LD ₁ , and σ ₂ ≈[115,315,152,370] for each section of LD _2. Therefore, let σ _max of LD ₁ and LD ₂ be 379m and 370m respectively. According to Corollary 1, the required sample size of LD ₁ and LD ₂ is at least 78 and 75 respectively, so we only need to set m1 = 78 and m2 = 75. In order to approximate the true error probability distribution of each section, 10,000 groups of experiments (the same below) were conducted on each section and the error cumulative probability distribution (CDF) curve was drawn. As shown in Figures 10 and 12, all sections can control the error within 100m with a 100% confidence probability, and the probability value is stably higher than the preset value of 98%. At the same time, when the confidence probability is 98%, each section in LD ₁ can control the error within 73m, and each section in LD ₂ can control the error within 66m, indicating that the performance of the generative model is significant. In particular, under the same sample size, QD2 of LD ₁ and QD1 of LD ₂ perform better, and control the error within 30m and 32m respectively with a 100% confidence probability. Further research found that the mileage of the two sections is shorter, and their generative model performance is better.

为进一步研究不同参数下，样本容量对生成模型误差的影响，使用控制变量法作如下实验：In order to further study the impact of sample size on the error of the generated model under different parameters, the following experiment is conducted using the control variable method:

为研究参数σ变化的影响，则令α和δ为固定值。不妨令α＝0.02、δ＝100m、两路段标准差分别为σ₁和σ₂，可得LD₁和LD₂对应所需的样本容量分别为[78,30,42,66]和[30,53,30,75]，其对应的CDF曲线见图13(a)和(c)。在不同的σ下，LD₁和LD₂在误差100m内的置信概率分别为[100％,100％,99.6％,99.9％]和[100％,99.8％,100％,100％]，均稳定高于预设值98％。同时，在置信概率为98％时，LD₁和LD₂分别能将误差控制在[64.0,28.8,82.9,77.6]和[31.9,75.6,54.3,62.9]以内，均远低于预设值100m。To study the effect of parameter σ, let α and δ be fixed values. Let α = 0.02, δ = 100m, and the standard deviations of the two sections be σ ₁ and σ ₂ respectively. The required sample sizes for LD ₁ and LD ₂ are [78, 30, 42, 66] and [30, 53, 30, 75] respectively. The corresponding CDF curves are shown in Figures 13(a) and (c). Under different σ, the confidence probabilities of LD ₁ and LD ₂ within an error of 100m are [100%, 100%, 99.6%, 99.9%] and [100%, 99.8%, 100%, 100%] respectively, which are both stable and higher than the preset value of 98%. At the same time, when the confidence probability is 98%, LD ₁ and LD ₂ can control the error within [64.0, 28.8, 82.9, 77.6] and [31.9, 75.6, 54.3, 62.9] respectively, which are far lower than the preset value of 100m.

为研究参数δ变化的影响，则令α和σ为固定值。不妨令α＝0.02、σ＝σ_max、δ＝[50,100,150,200]，可得LD₁和LD₂对应所需的样本容量分别为[315,78,35,30]和[297,75,33,30]，表明δ越大，所需的样本容量越少，直到δ＞150后，受大样本容量限制所需样本容量基本不再变化。根据实验参数设置，两路段对应的CDF曲线见图13(b)和图14(e)。随着样本容量的增大，在相同置信概率下，其误差越小。在置信概率为98％时，LD₁和LD₂对应的误差分别为[25.7,41.3,58.0,62.0]和[22.1,36.9,54.3,56.1]，均远小于对应的预设值；同时，在预设值δ处的置信概率均达100％，效果显著。同时，当δ越小，所需样本容量以平方级数增长，其概率分布曲线陡峭程度依次递增，表明参数δ变化对生成模型误差的影响较大。To study the effect of parameter δ, α and σ are fixed. Let α = 0.02, σ = σ _max , δ = [50, 100, 150, 200], and the sample sizes required for LD ₁ and LD ₂ are [315, 78, 35, 30] and [297, 75, 33, 30], respectively. This indicates that the larger δ is, the smaller the sample size is required. After δ > 150, the required sample size basically does not change due to the large sample size. According to the experimental parameter settings, the CDF curves corresponding to the two sections are shown in Figure 13 (b) and Figure 14 (e). As the sample size increases, the error becomes smaller under the same confidence probability. When the confidence probability is 98%, the errors corresponding to LD ₁ and LD ₂ are [25.7, 41.3, 58.0, 62.0] and [22.1, 36.9, 54.3, 56.1], respectively, which are much smaller than the corresponding preset values; at the same time, the confidence probability at the preset value δ is 100%, and the effect is significant. At the same time, when δ is smaller, the required sample size increases in square series, and the steepness of its probability distribution curve increases successively, indicating that the change of parameter δ has a greater impact on the error of the generated model.

为研究参数α变化的影响，则令δ和σ为固定值。不妨令δ＝100m、σ＝σ_max、α＝[0.02,0.04,0.06,0.08,0.10]，可得LD₁和LD₂对应所需的样本容量分别为[78,61,51,44,39]和[75,58,48,42,38]，其CDF曲线见图13(c)和图14(f)。当置信概率(1-α)为设定值时，LD₁和LD₂对应的误差分别基本分布在33～44m和28～38m之间，均大幅低于所设阈值100m，效果显著；进一步研究发现，不同α下的样本容量分布跨度较小，使得其概率分布曲线基本趋于一致，表明参数α变化对生成模型误差的影响较小。To study the effect of parameter α, δ and σ are fixed. Let δ = 100m, σ = σ _max , α = [0.02, 0.04, 0.06, 0.08, 0.10], and the required sample sizes for LD ₁ and LD ₂ are [78, 61, 51, 44, 39] and [75, 58, 48, 42, 38], respectively. Their CDF curves are shown in Figures 13(c) and 14(f). When the confidence probability (1-α) is the set value, the errors corresponding to LD ₁ and LD ₂ are basically distributed between 33 and 44m and 28 and 38m, respectively, which are significantly lower than the set threshold of 100m, and the effect is significant. Further research found that the sample size distribution span under different α is small, so that their probability distribution curves are basically consistent, indicating that the change of parameter α has little effect on the error of the generated model.

本发明采用以上技术方案，构建了基于ETC交易数据的高速公路区段里程生成模型，推导了数据样本容量与生成模型误差的函数关系。实地实验结果表明，该模型生成的区段里程平均误差达到了10m的精度要求，并具有较强的鲁棒性和适用性。最后，采用2020年9月3日至5日的ETC交易数据对沈海高速两路段(草埔园枢纽～内坑枢纽、港后枢纽～西埔枢纽)进行实地验证。理论分析与实验结果表明，该模型生成的区段里程平均误差达到了10m的精度要求，并具有较强的鲁棒性和适用性。本发明构建的模型仅依托ETC大数据实现了高速公路区段里程的测量，并且测量误差小、性能稳定，有助于高速公路精细化管理及提高ETC大数据的应用价值。The present invention adopts the above technical scheme to construct a highway section mileage generation model based on ETC transaction data, and derives the functional relationship between the data sample capacity and the generation model error. The field experimental results show that the average error of the section mileage generated by the model reaches the accuracy requirement of 10m, and has strong robustness and applicability. Finally, the ETC transaction data from September 3 to 5, 2020 were used to conduct field verification on two sections of the Shenhai Expressway (Caopuyuan Hub~Neikeng Hub, Ganghou Hub~Xipu Hub). Theoretical analysis and experimental results show that the average error of the section mileage generated by the model reaches the accuracy requirement of 10m, and has strong robustness and applicability. The model constructed by the present invention relies solely on ETC big data to realize the measurement of highway section mileage, and has small measurement error and stable performance, which is conducive to the refined management of highways and improves the application value of ETC big data.

显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。因此，本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围，而是仅仅表示本申请的选定实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. In the absence of conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The components of the embodiments of the present application generally described and shown in the drawings here can be arranged and designed in various different configurations. Therefore, the detailed description of the embodiments of the present application is not intended to limit the scope of the application claimed for protection, but merely represents the selected embodiments of the present application. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians of the art without making creative work are within the scope of protection of the present application.

Claims

1. A highway section mileage measurement method based on ETC big data, characterized in that it includes the following steps:

Step 1, obtain the geographic coordinates of the starting point and end point of the highway section, and use the map API to obtain the total mileage of the section;

Step 2, obtaining the driving time of the vehicle on the entire road section, and calculating the average driving speed of the vehicle on the entire road section in combination with the mileage of the entire road section;

Step 3, divide the entire road section into several sections, obtain the vehicle's residence time in each section respectively, and use the average speed of the entire road section as the driving speed of each section to calculate the section mileage;

Step 4: According to the mileage of different sections of the vehicle, a section mileage generation model is constructed as follows:

ΔD～N(μ,Γ _Δd )

Where ΔD is a Gaussian random vector, whose mean vector μ = [μ ₁ ,μ ₂ ,…,μ _n-1 ], μ _n represents the mean distance of the nth segment, which is calculated based on the mileage of the passing vehicles; the covariance matrix is

Represents the variance of each segment, m represents the number of vehicles passing through the segment; N represents normal distribution.

2. According to a method for measuring mileage of a highway section based on ETC big data in claim 1, it is characterized in that the map API in step 1 is Amap API.

3. A highway section mileage measurement method based on ETC big data according to claim 1, characterized in that: in step 2, the average speed formula is used to calculate the average speed of the vehicle:

Where d is the total mileage of the road section,

represents the travel time of the jth vehicle on this road section.

4. A highway section mileage measurement method based on ETC big data according to claim 3, characterized in that: the specific steps of using a noise data cleaning method based on a box plot in step 3 to obtain the residence time and mileage of the vehicle in each section are as follows:

Step 301, select a road section LD with a free traffic flow, and the number of lanes in each section of the road section is the same; if the traffic condition of the road section does not meet the requirements, select a time period such as the second half of the night that meets the free traffic flow requirements; if the number of lanes in each section is different, divide the sections into multiple sections and process them independently;

Step 302: Since different vehicles and different driving habits may lead to different driving speeds, the ratio of the dwelling time of a vehicle passing through a certain section to the total travel time of the section is called the section time ratio r:

Among them, Δt is the dwell time of the section, _Δtj is the dwell time of the entire section,

Step 303, normalize the section residence time to obtain the time ratio R of the m vehicles in the entire road section;

Among them, m refers to the mth vehicle; n refers to the nth section;

Step 304, using a box plot to clean the noise data generated by vehicles passing through the service area, to obtain a valid vehicle subset M" of the road section:

The valid subset of vehicles is the set of vehicles on the road segment used;

Step 305, obtaining the segment mileage ΔD:

Wherein, V = diag (v ¹ , v ² , …, v ^n-1 ) represents the speed of each vehicle in the subset;

represents the mileage of the m-th vehicle in the n-1-th section; m represents the number of vehicles in the valid vehicle subset M".

5. The method for measuring the mileage of a highway section based on ETC big data according to claim 4 is characterized in that: the specific steps of step 304 are as follows:

Step 3041: When a vehicle passes through a service area, it is divided into the vehicle staying in the service area and the vehicle not staying in the service area, and the time ratio of different sections is obtained. The time ratio of not staying in the service area is:

Among them, Δt ₁ is the residence time in the section when not stopping, and Δt is the residence time of the entire section including the section;

The time ratio of vehicles staying in the service area is:

Among them, Δt ₁ is the residence time of the vehicle in the section, not included in the service area time; Δt _s is the residence time in the service area;

Step 3042,

The data needed to obtain the box plot are: Q1 is the first quartile; Q2 is the second quartile, also known as the median; Q3 is the third quartile; IQR = Q3-Q1 is the interquartile range; Q1-1.5×IQR is the lower limit Lower and Q3+1.5×IQR is the upper limit Upper; Outliers are noise points, whose values are greater than Q3+1.5×IQR or less than Q1-1.5×IQR, also known as outliers or abnormal points;

Step 3043, constructing a subset of vehicles I _j that stop at the service area according to the results of the box plot:

Wherein, J = {j} (j∈[1,n-1]) is the set of sections where the service area is located;

is the time ratio of the i-th vehicle in section j;

is the third quartile of the time proportion of segment j in the box plot;

is the interquartile range of the time proportion of segment j in the box plot;

is the first quartile of the time proportion of segment j in the box plot;

Step 3044, construct a vehicle subset M′ of the road section that does not contain vehicles stopping at the service area:

Among them, M is the original vehicle set, and M′ is the vehicle set after eliminating the vehicles that stopped at the service area;

Step 3045, M' is further cleaned to obtain vehicles affected by traffic abnormalities, and then the abnormal vehicle subset I _j ' of the section is constructed:

in,

The proportion of time used in the section for abnormal vehicles;

is the third quartile in the box plot;

is the interquartile range in the box plot;

is the first quartile in the box plot;

Step 3046, further obtaining a valid vehicle subset M" of the road section:

The valid subset of vehicles is the set of vehicles on the road segment used.

6. A highway section mileage measurement method based on ETC big data according to claim 5, characterized in that: the traffic abnormality factors in step 3045 include traffic flow, road maintenance and emergencies.

7. According to a method for measuring highway section mileage based on ETC big data as described in claim 1, it is characterized in that: in step 4, the law of large numbers is used to construct a gantry section mileage generation model.

Step 401: According to the segment mileage obtained in step 3, it can be known that the mileage of each vehicle in different segments is independent of each other and follows the same distribution with mathematical expectation:

in,

Step 402, according to the law of large numbers, the sequence

Converges to μ _j with probability, that is

set up

…with variance

From the central limit theorem, we know that the sum of Δd _j

The standardized variable Y _m is:

Among them, the distribution function F _m (x) of Y _m satisfies for any x:

Among them, F _m (x) is the distribution function, Φ(x) is the standard normal distribution function;

Step 403, when m is not less than the minimum value, the mean of Δd _j converges to the normal distribution after proper standardization, then any

The mean

The approximate mean is μ _j and the variance is

If the normal distribution of is obtained, the gantry section mileage generation model MGM based on ETC big data is constructed:

ΔD～N(μ,Γ _Δd ).

8. A highway section mileage measurement method based on ETC big data according to claim 7, characterized in that: the minimum value in step 403 is 30, that is, m≥30.