[go: up one dir, main page]

CN110471917A - It is a kind of based on historical data excavate customs declaration list intelligently make a report on method - Google Patents

It is a kind of based on historical data excavate customs declaration list intelligently make a report on method Download PDF

Info

Publication number
CN110471917A
CN110471917A CN201910617724.3A CN201910617724A CN110471917A CN 110471917 A CN110471917 A CN 110471917A CN 201910617724 A CN201910617724 A CN 201910617724A CN 110471917 A CN110471917 A CN 110471917A
Authority
CN
China
Prior art keywords
field
data
fields
value
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910617724.3A
Other languages
Chinese (zh)
Other versions
CN110471917B (en
Inventor
林友芳
万怀宇
李金富
王强
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN201910617724.3A priority Critical patent/CN110471917B/en
Publication of CN110471917A publication Critical patent/CN110471917A/en
Application granted granted Critical
Publication of CN110471917B publication Critical patent/CN110471917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明提供了一种基于历史数据挖掘的海关报关单智能填报方法。该方法包括:对海关报关单的历史数据进行预处理,将表头与表体数据合并,去除无关字段;设计并实现基于具体值的各字段相关性分析算法;设计树形结构,基于各字段相关性构建动态树并储存;根据生成的动态树,智能推荐填报内容;依据新录入数据定期自动进行数据维护与更新。本发明设计的相关性分析算法及动态树可以较好地根据用户当前录入内容动态进行其余填写字段的智能填报,准确率较高。具有较强的泛化及自学习能力,可以极大地提高海关报关效率,为报关机构和报关企业节省人力物力。

The invention provides a method for intelligently filling in a customs declaration form based on historical data mining. The method includes: preprocessing the historical data of the customs declaration form, merging the header and body data, and removing irrelevant fields; designing and implementing a correlation analysis algorithm for each field based on specific values; designing a tree structure, based on each field Correlation builds a dynamic tree and stores it; based on the generated dynamic tree, intelligently recommends the content to be filled in; regularly and automatically performs data maintenance and update based on newly entered data. The correlation analysis algorithm and the dynamic tree designed by the present invention can better dynamically perform intelligent filling of other filling fields according to the current input content of the user, and the accuracy rate is high. With strong generalization and self-learning ability, it can greatly improve the efficiency of customs declaration, and save manpower and material resources for customs declaration agencies and customs declaration enterprises.

Description

一种基于历史数据挖掘的海关报关单智能填报方法An Intelligent Filling Method of Customs Declaration Form Based on Historical Data Mining

技术领域:Technical field:

本发明涉及海关报关信息填报领域,尤其涉及一种基于历史数据挖掘的海关报关单智能填报方法。The invention relates to the field of customs declaration information filling, in particular to a method for intelligently filling customs declaration forms based on historical data mining.

背景技术:Background technique:

中华人民共和国海关作为我国进出境监督管理机关,进出口商品申报管理是其日常业务中一项重要且基础的工作。目前进出口商品申报主要是申报公司将所有相关纸质单据整理汇总后录入海关相关系统,由于录入字段繁多且内容问联系性较小,录入效率低下、错误率高,浪费人力物力。The Customs of the People's Republic of my country is the entry and exit supervision and management agency of our country, and the declaration management of import and export commodities is an important and basic task in its daily business. At present, the import and export commodity declaration mainly requires the declaration company to sort out all relevant paper documents and enter them into the relevant customs system. Due to the large number of input fields and the low connection between the contents, the input efficiency is low, the error rate is high, and manpower and material resources are wasted.

发明内容:Invention content:

本发明提出一种基于历史数据挖掘的海关报关单智能填报方法。本发明充分考虑了填报信息中各个字段基于具体值之间的相关关系,提供智能填报策略,较之传统完全人工录入具有更高效率及更高准确率的特点。The invention proposes a method for intelligently filling in customs declaration forms based on historical data mining. The present invention fully considers the correlation between each field in the reporting information based on specific values, provides an intelligent reporting strategy, and has the characteristics of higher efficiency and higher accuracy than traditional complete manual input.

本发明提供了如下方案,一种基于历史数据挖掘的海关报关单智能填报方法,包括以下步骤:The present invention provides the following scheme, a method for intelligently filling customs declaration forms based on historical data mining, comprising the following steps:

S1:对海关报关单的历史数据进行预处理,将表头与表体数据合并,去除无关字段。S1: Preprocess the historical data of the customs declaration form, merge the header and body data, and remove irrelevant fields.

S1.1:使用Spark分布式计算合并数据S1.1: Use Spark distributed computing to merge data

真实报关单数据分为表头数据及表体数据,分别存储于两个数据表中,其中表头数据描述某一订单信息,如进出口岸、申报单位、成交方式等;表体数据描述某一订单中商品具体信息,如商品编号、商品名称、申报单价等。本发明使用Spark分布式计算方法,通过订单号作为主键将两表进行连接,得到包含所有填报信息字段的数据表,并存储于Hive数据库中。The real customs declaration data is divided into header data and body data, which are stored in two data tables respectively. The header data describes a certain order information, such as import and export ports, declaration units, transaction methods, etc.; the body data describes a certain order information. The specific information of the product in the order, such as product number, product name, declared unit price, etc. The present invention uses the Spark distributed computing method to connect the two tables through the order number as the primary key to obtain a data table containing all information fields to be filled in, and store it in the Hive database.

S1.2:统计各字段空值情况,去除无关字段。S1.2: Count the null values of each field and remove irrelevant fields.

在海关报关过程中,某些字段属于用户选择填写项,故而会出现字段为空情况,通过预处理过程中统计某一字段空值数据条数占总数据条数比率,如大于90%以上则去除该字段。同时将诸如时间、序号、手册号码等无实际推荐价值字段进行去除。In the process of customs declaration, some fields belong to the items that users choose to fill in, so there will be cases where the fields are empty. In the preprocessing process, the ratio of the number of empty value data items in a certain field to the total number of data items is counted. If it is greater than 90%, then Remove this field. At the same time, fields without actual recommendation value such as time, serial number, and manual number are removed.

通过以上预处理,得到具有推荐价值的字段数据表。Through the above preprocessing, a field data table with recommended value is obtained.

S2:设计并实现基于具体值的各字段相关性分析算法。S2: Design and implement a correlation analysis algorithm for each field based on specific values.

传统相关性发现算法仅仅依赖字段本身之间的内在联系判定某些字段之间相关性大小。本发明将真实报关数据与字段结合,判断在某一字段的值已确定情况下,该字段与其他字段之间的相关性大小,从而模拟实际录入场景中与用户的不断交互过程。The traditional correlation discovery algorithm only relies on the internal relationship between the fields themselves to determine the correlation between certain fields. The present invention combines real customs declaration data with fields, and judges the correlation between the field and other fields when the value of a certain field is determined, thereby simulating the continuous interaction process with users in the actual entry scene.

定义相关性为,给定字段A及其值a,当某字段B的值确定后,使得除A、B以外其他需要录入字段的取值唯一或选择项最少,则称字段A与字段B之间相关性最大。The definition of correlation is, given field A and its value a, when the value of a certain field B is determined, so that the values of fields other than A and B that need to be entered are unique or the options are the fewest, it is called the relationship between field A and field B. The greatest correlation among them.

算法输入为历史数据集、用户录入字段A及字段值a,输出为当前录入情况下相关性最大字段B。算法实现如下:The input of the algorithm is the historical data set, user input field A and field value a, and the output is the most relevant field B under the current input situation. The algorithm is implemented as follows:

5、根据录入字段A及字段值a对历史数据集进行切分,得到该特定录入情况下子数据集;5. Segment the historical data set according to the input field A and field value a to obtain the sub-data set under the specific input situation;

6、对子数据集进行去重处理;6. Deduplicate the sub-dataset;

7、计算其余需要录入字段与字段A的相关性大小,选择与字段A相关性最大的字段B7. Calculate the correlation between the remaining fields that need to be entered and field A, and select field B that has the greatest correlation with field A

8、输出字段B。8. Output field B.

S3:设计树形结构,基于各字段相关性构建动态树并储存。S3: Design a tree structure, construct a dynamic tree based on the correlation of each field and store it.

本发明基于历史数据挖掘进行报关智能填报方法的实现,通过 S2发现基于值的字段相关性关系后,设计树形结构,构建一棵具有历史最短填报路径的树。The present invention implements the intelligent customs declaration filling method based on historical data mining. After discovering the value-based field correlation relationship through S2, a tree structure is designed to build a tree with the shortest historical filling path.

S3.1:树形结构设计S3.1: Tree structure design

树结构包括结点及边。其中结点分为划分结点(非叶子结点) 与推荐结点(叶子结点)。划分结点为某一字段属性名及该字段出现频次最高属性值,推荐结点为Map结构,存储字段名及对应属性值;各结点之间通过边进行连接,边存储父节点字段对应的属性值。A tree structure includes nodes and edges. The nodes are divided into division nodes (non-leaf nodes) and recommendation nodes (leaf nodes). Divide the nodes into the attribute name of a field and the attribute value with the highest frequency of occurrence in this field. The recommended node is a Map structure, which stores the field name and the corresponding attribute value; each node is connected by an edge, and the corresponding value of the parent node field is stored at the same time. attribute value.

S3.2:动态树生成S3.2: Dynamic tree generation

首先读入S1预处理后的字段数据表,将第一层划分结点(根结点)字段定义为申报行,第二、第三层划分结点分别定义为经营单位及商品名称,各层结点与结点之间通过父节点对应字段的各个填报值进行连接。First read the field data table preprocessed by S1, define the field of the first-level division node (root node) as the declaration line, and define the second and third-level division nodes as the business unit and commodity name respectively. Nodes are connected through the values of the corresponding fields of the parent nodes.

第四层及之后各层结点选择依据上层结点及边对应属性值作为输入,选择相关性最大字段作为下层结点,直至在某结点后其他录入字段取值唯一或所有字段录入完成,生成推荐结点(叶子结点),存储其余录入字段及字段值或生成空结点。The nodes of the fourth layer and subsequent layers are selected according to the corresponding attribute values of the upper layer nodes and edges as input, and the field with the highest correlation is selected as the lower layer node until the value of other input fields after a certain node is unique or all fields are entered. Generate recommended nodes (leaf nodes), store other input fields and field values or generate empty nodes.

通过以上步骤,生成了一棵具有历史最短填报路径的动态树。Through the above steps, a dynamic tree with the shortest reporting path in history is generated.

S3.3:动态树存储S3.3: Dynamic Tree Storage

将S3.2生成的动态树存储于MySql数据库中,其中表结构如下表所示。Store the dynamic tree generated by S3.2 in the MySql database, and the table structure is shown in the following table.

表1划分结点表结构Table 1 Divide node table structure

字段名field name 字段类型Field Type 说明illustrate idid intint 自增长,主键self-growth, primary key Field_nameField_name StringString 填报字段名Fill in the field name Best_valueBest_value StringString 出现频次最高字段值Most frequent field value Namename StringString 结点名称,唯一Node name, unique LevelLevel intint 所处树结构层次tree structure level Agent_codeAgent_code StringString 对应申报行名称Corresponding reporting line name Datedate Datedate 该记录插入时间 The record was inserted at

表2叶子结点表结构Table 2 Leaf node table structure

字段名field name 字段类型Field Type 说明illustrate idid intint 自增长,主键self-growth, primary key Valuevalue StringString 字段名+字段值拼接Field name + field value splicing LevelLevel intint 所处树结构层次tree structure level Agent_codeAgent_code StringString 对应申报行名称Corresponding reporting line name Datedate Datedate 该记录插入时间 The record was inserted at

表3边表结构Table 3 Edge table structure

字段名field name 字段类型Field Type 说明illustrate idid intint 自增长,主键self-growth, primary key FatherFather StringString 父节点字段名Parent node field name SonSon StringString 子节点字段名Child node field name Valuevalue StringString 父节点某个填报值A certain value of the parent node Agent_codeAgent_code StringString 对应申报行名称Corresponding reporting line name Datedate Datedate 该记录插入时间 The record was inserted at

S4:根据生成的动态树,智能推荐填报内容。S4: Based on the generated dynamic tree, intelligently recommend filling content.

将存储于MySql数据库中动态树各结构读入内存,并重新在内存构建对应动态树。首先根据用户录入的申报行、经营单位及商品名称搜索动态树,继而根据每一节点存储的BestValue属性进行深度优先搜索,有序输出其余所有字段及推荐填报值;接着与用户进行交互,进行部分字段修改或录入,最终依据新的录入值再次进行如上搜索及推荐,直至符合录入预期。Read each structure of the dynamic tree stored in the MySql database into the memory, and re-build the corresponding dynamic tree in the memory. First, search the dynamic tree according to the declaration line, business unit and product name entered by the user, then perform a depth-first search according to the BestValue attribute stored in each node, and output all other fields and recommended filling values in an orderly manner; then interact with the user and perform partial Field modification or input, and finally perform the above search and recommendation again based on the new input value until it meets the input expectations.

S5:依据新录入数据定期自动进行数据维护与更新。S5: Regularly and automatically perform data maintenance and update based on newly entered data.

首先将用户每一次录入情况存储至数据库中并添加时间标签,接着在每周固定时间依据时间标签将新数据添加入历史数据集,依照新生成历史数据集构建新的动态树并存储,从而实现数据自动维护与更新。First, store each entry of the user in the database and add a time tag, then add new data to the historical data set at a fixed time every week according to the time tag, and construct a new dynamic tree according to the newly generated historical data set and store it, so as to achieve Data is automatically maintained and updated.

本发明具有以下技术效果:The present invention has the following technical effects:

1.本发明可以较好地根据用户当前录入内容动态进行其余填写字段的智能填报,准确率较高;1. The present invention can better dynamically perform intelligent filling of other filling fields according to the user's current input content, and the accuracy rate is high;

2.具有较强的泛化及自学习能力,可以极大地提高海关报关效率,为报关机构和报关企业节省人力物力;2. With strong generalization and self-learning ability, it can greatly improve the efficiency of customs declaration, and save manpower and material resources for customs declaration agencies and customs declaration enterprises;

3.本发明具有定期自动更新功能。3. The present invention has a regular automatic update function.

附图说明:Description of drawings:

为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort.

图1为本发明实施例提供的一种基于历史数据挖掘的海关报关单智能填报方法的实施流程图。Fig. 1 is an implementation flow chart of an intelligent customs declaration form filling method based on historical data mining provided by an embodiment of the present invention.

图2为本发明实施例提供的一种基于历史数据挖掘的海关报关单智能填报方法的动态树结构图。FIG. 2 is a dynamic tree structure diagram of a method for intelligently filling customs declaration forms based on historical data mining provided by an embodiment of the present invention.

具体实施方式:Detailed ways:

本发明提供了一种基于海关报关历史数据挖掘的海关报关单智能填报方法。整体方案包括:The invention provides an intelligent customs declaration form filling method based on the mining of customs declaration history data. The overall plan includes:

对海关报关单的历史数据进行预处理,将表头与表体数据合并,去除无关字段;设计并实现基于具体值的各字段相关性分析算法;设计树形结构,基于各字段相关性构建动态树并储存;根据生成的动态树,智能推荐填报内容;依据新录入数据定期自动进行数据维护与更新。图一为本发明的实施流程图。Preprocess the historical data of the customs declaration form, merge the header and body data, and remove irrelevant fields; design and implement a correlation analysis algorithm for each field based on specific values; design a tree structure, and build dynamics based on the correlation of each field Tree and store; based on the generated dynamic tree, intelligently recommend filling content; regularly and automatically perform data maintenance and update based on newly entered data. Fig. 1 is the implementation flowchart of the present invention.

本发明设计的相关性分析算法及动态树可以较好地根据用户录入智能动态进行其余填写字段的填报,准确率较高。具有较强的泛化及自学习能力,可以极大地提高海关报关效率,为报关机构和报关企业节省人力物力。The correlation analysis algorithm and the dynamic tree designed by the present invention can better fill in the rest of the filling fields according to the user's input intelligence and dynamics, and the accuracy rate is high. With strong generalization and self-learning ability, it can greatly improve the efficiency of customs declaration, and save manpower and material resources for customs declaration agencies and customs declaration enterprises.

举例说明本发明的实施方式:To illustrate the embodiment of the present invention:

以国内某海关口岸提供的真实数据集为例,其中包括2015年与 2016年该口岸全部申报数据。2015年表头数据表中共有14,423,930 条记录,表体数据表中共有36,578,318条记录;2016年表头数据表中共有14,832,866条记录,表体数据表中共有38,673,224条记录。数据涉及7000余家申报行。其中,表头数据共有43个字段,描述某一订单信息,如进出口岸、申报单位、成交方式等,表体数据共有15个字段,描述某一订单中诸多商品具体信息,如商品编号、商品名称、申报单价等。Take the real data set provided by a domestic customs port as an example, which includes all the declared data of the port in 2015 and 2016. In 2015, there were 14,423,930 records in the header data table, and 36,578,318 records in the body data table; in 2016, there were 14,832,866 records in the header data table, and 38,673,224 records in the body data table. The data involved more than 7,000 reporting banks. Among them, the header data has a total of 43 fields, which describe an order information, such as the import and export port, declaration unit, transaction method, etc., and the table body data has a total of 15 fields, which describe the specific information of many commodities in an order, such as commodity number, commodity Name, declared unit price, etc.

步骤S1:对海关报关单的历史数据进行预处理,将表头与表体数据合并,去除无关字段。Step S1: Preprocessing the historical data of the customs declaration form, merging the header and body data, and removing irrelevant fields.

S1.1:使用Spark分布式计算合并数据S1.1: Use Spark distributed computing to merge data

使用Spark分布式计算方法,通过订单号作为主键将两表进行连接,得到包含所有填报信息字段的数据表,并存储于Hive数据库中。Using the Spark distributed computing method, the order number is used as the primary key to connect the two tables to obtain a data table containing all the information fields to be filled in, and store it in the Hive database.

S1.2:统计各字段空值情况,去除无关字段。S1.2: Count the null values of each field and remove irrelevant fields.

统计其中空值比率大于90%的字段并去除,同时将时间、序号、手册号码等去除,得到具有推荐价值的字段数据表。Count the fields with a null value ratio greater than 90% and remove them, and remove the time, serial number, manual number, etc. to obtain a field data table with recommended value.

步骤S2:设计与实现基于具体值的各字段相关性分析算法。Step S2: Design and implement a correlation analysis algorithm for each field based on specific values.

将真实报关数据与字段结合,判断在某一字段的值已确定情况下,该字段与其他字段之间的相关性大小,从而模拟实际录入场景中与用户的不断交互过程。Combining real customs declaration data with fields, judging the correlation between this field and other fields when the value of a certain field has been determined, so as to simulate the continuous interaction process with users in the actual entry scene.

实现字段相关性分析算法,设计程序输入某一字段名A、字段值a及当前输入下的数据集,输出该特定输入情况下相关性最大字段B。Implement the field correlation analysis algorithm, design a program to input a certain field name A, field value a and the data set under the current input, and output the field B with the highest correlation under the specific input situation.

步骤S3:设计树形结构,基于各字段相关性构建动态树并储存。Step S3: Design a tree structure, construct a dynamic tree based on the correlation of each field and store it.

S3.1:树形结构设计S3.1: Tree structure design

树结构包括结点及边。其中结点分为划分结点(非叶子结点) 与推荐结点(叶子结点)。划分结点为某一字段属性名及该字段出现频次最高属性值,推荐结点为Map结构,存储字段名及对应属性值;各结点之间通过边进行连接,边存储父节点字段对应的属性值。A tree structure includes nodes and edges. The nodes are divided into division nodes (non-leaf nodes) and recommendation nodes (leaf nodes). Divide the nodes into the attribute name of a field and the attribute value with the highest frequency of occurrence in this field. The recommended node is a Map structure, which stores the field name and the corresponding attribute value; each node is connected by an edge, and the corresponding value of the parent node field is stored at the same time. attribute value.

S3.2:动态树生成S3.2: Dynamic tree generation

读入经过步骤S1后产生的数据表,构建第一层划分结点为申报行、二三层划分结点为经营单位及申报商品的动态树,各层结点与结点之间通过父节点对应字段的各个填报值进行连接。其余层划分结点选择步骤S2得到的相关性最大字段,推荐结点为剩余录入唯一或空。Read in the data table generated after step S1, construct a dynamic tree in which the division nodes of the first layer are declaration lines, and the division nodes of the second and third layers are business units and declared commodities, and the nodes of each layer are connected through the parent node Connect each filled value of the corresponding field. The remaining layer division node selects the most relevant field obtained in step S2, and the recommended node is the only or empty remaining entry.

S3.3:动态树存储S3.3: Dynamic Tree Storage

将生成后的动态树各划分结点、推荐结点及边分别存储进入Mysql划分结点表、叶子结点表及边表中。The division nodes, recommended nodes and edges of the generated dynamic tree are respectively stored in the Mysql division node table, leaf node table and edge table.

步骤S4:根据生成的动态树,智能推荐填报内容。Step S4: Intelligently recommend filling content according to the generated dynamic tree.

经过上述步骤,将基于历史数据挖掘及相关性发现算法得到的动态树存储在Mysql数据库中。本步骤将数据库数据读入内存并重构动态树。首先根据用户录入的申报行、经营单位及商品名称搜索动态树,继而根据每一节点存储的BestValue属性进行深度优先搜索,有序输出其余所有字段及推荐填报值;接着与用户进行交互,进行部分字段修改或录入,最终依据新的录入值再次进行如上搜索及推荐,直至符合录入预期。After the above steps, the dynamic tree obtained based on the historical data mining and correlation discovery algorithm is stored in the Mysql database. This step reads database data into memory and reconstructs the dynamic tree. First search the dynamic tree according to the declaration line, business unit and product name entered by the user, then perform a depth-first search according to the BestValue attribute stored in each node, and output all other fields and recommended filling values in an orderly manner; then interact with the user and perform some Field modification or input, and finally perform the above search and recommendation again based on the new input value until it meets the input expectations.

步骤S5:依据新录入数据定期自动进行数据维护与更新。Step S5: Periodically and automatically perform data maintenance and update according to newly entered data.

首先将用户每一次录入情况存储至数据库中并添加时间标签,接着在每周固定时间依据时间标签将新数据添加入历史数据集,依照新生成历史数据集构建新的动态树并存储。First, store each entry of the user into the database and add a time tag, then add new data to the historical data set at a fixed time every week according to the time tag, and construct a new dynamic tree according to the newly generated historical data set and store it.

Claims (2)

1.一种基于历史数据挖掘的海关报关单智能填报方法,其特征在于,包括以下步骤:1. A method for intelligently filling customs declaration forms based on historical data mining, characterized in that it comprises the following steps: S1:对海关报关单的历史数据进行预处理,将表头与表体数据合并,去除无关字段:S1: Preprocess the historical data of the customs declaration form, merge the header and body data, and remove irrelevant fields: S1.1:使用Spark分布式计算合并数据S1.1: Use Spark distributed computing to merge data 真实报关单数据分为表头数据及表体数据,分别存储于两个数据表中,其中表头数据描述某一订单信息,如进出口岸、申报单位、成交方式;表体数据描述某一订单中商品具体信息,如商品编号、商品名称、申报单价;本发明使用Spark分布式计算方法,通过订单号作为主键将两表进行连接,得到包含所有填报信息字段的数据表,并存储于Hive数据库中;The real customs declaration data is divided into header data and body data, which are stored in two data tables respectively. The header data describes an order information, such as the import and export port, declaration unit, and transaction method; the body data describes an order Commodity specific information, such as commodity number, commodity name, declared unit price; the present invention uses the Spark distributed computing method to connect the two tables through the order number as the primary key to obtain a data table containing all the information fields to be filled in, and store it in the Hive database middle; S1.2:统计各字段空值情况,去除无关字段S1.2: Count the null values of each field and remove irrelevant fields 通过预处理过程中统计某一字段空值数据条数占总数据条数比率,如大于90%以上则去除该字段;同时将诸如时间、序号、手册号码等无实际推荐价值字段进行去除;During the preprocessing process, count the ratio of the number of empty value data items in a certain field to the total number of data items. If it is greater than 90%, remove the field; at the same time, remove the fields that have no actual recommendation value such as time, serial number, and manual number; 通过以上预处理,得到具有推荐价值的字段数据表;Through the above preprocessing, a field data table with recommended value is obtained; S2:设计并实现基于具体值的各字段相关性分析算法S2: Design and implement a correlation analysis algorithm for each field based on specific values 通过将真实报关数据与字段结合,判断在某一字段的值已确定情况下,该字段与其他字段之间的相关性大小,从而模拟实际录入场景中与用户的不断交互过程;By combining real customs declaration data with fields, it is judged that the value of a certain field is determined, and the correlation between the field and other fields is determined, thereby simulating the continuous interaction process with the user in the actual entry scene; 定义相关性为,给定字段A及其值a,当某字段B的值确定后,使得除A、B以外其他需要录入字段的取值唯一或选择项最少,则称字段A与字段B之间相关性最大;The definition of correlation is, given field A and its value a, when the value of a certain field B is determined, so that the values of fields other than A and B that need to be entered are unique or the options are the fewest, it is called the relationship between field A and field B. the greatest correlation between 算法输入为历史数据集、用户录入字段A及字段值a,输出为当前录入情况下相关性最大字段B,算法实现如下:The input of the algorithm is the historical data set, user input field A and field value a, and the output is the most relevant field B under the current input situation. The algorithm is implemented as follows: 1、根据录入字段A及字段值a对历史数据集进行切分,得到该特定录入情况下子数据集;1. Segment the historical data set according to the input field A and field value a to obtain the sub-data set under the specific input situation; 2、对子数据集进行去重处理;2. Deduplicate the sub-dataset; 3、计算其余需要录入字段与字段A的相关性大小,选择与字段A相关性最大的字段B;3. Calculate the correlation between the remaining fields that need to be entered and field A, and select field B that has the greatest correlation with field A; 4、输出字段B;4. Output field B; S3:设计树形结构,基于各字段相关性构建动态树并储存S3: Design a tree structure, construct a dynamic tree based on the correlation of each field and store it 通过S2发现基于具体值的字段相关性关系后,设计树形结构,构建一棵具有历史最短填报路径的树:After discovering the field correlation relationship based on specific values through S2, design a tree structure and build a tree with the shortest historical reporting path: S3.1:树形结构设计S3.1: Tree structure design 树结构包括结点及边,其中结点分为划分结点与推荐结点,划分结点为某一字段属性名及该字段出现频次最高属性值,推荐结点为Map结构,存储字段名及对应属性值;各结点之间通过边进行连接,边存储父节点字段对应的属性值;The tree structure includes nodes and edges. The nodes are divided into partition nodes and recommended nodes. The partition nodes are the attribute name of a certain field and the attribute value with the highest frequency of occurrence of the field. The recommended node is a Map structure, which stores the field name and Corresponding attribute value; each node is connected by an edge, and the attribute value corresponding to the parent node field is stored on the edge; S3.2:动态树生成S3.2: Dynamic tree generation 首先读入S1预处理后的字段数据表,将第一层划分结点字段定义为申报行,第二、第三层划分结点分别定义为经营单位及商品名称,各层结点与结点之间通过父节点对应字段的各个填报值进行连接;First, read in the field data table preprocessed by S1, define the first-level division node fields as declaration lines, define the second and third-level division nodes as business units and commodity names respectively, and define the nodes and nodes of each layer They are connected by filling in the values of the corresponding fields of the parent node; 第四层及之后各层结点选择依据上层结点及边对应属性值作为输入,选择相关性最大字段作为下层结点,直至在某结点后其他录入字段取值唯一或所有字段录入完成,生成推荐结点,存储其余录入字段及字段值或生成空结点;The nodes of the fourth layer and subsequent layers are selected according to the corresponding attribute values of the upper layer nodes and edges as input, and the field with the highest correlation is selected as the lower layer node until the value of other input fields after a certain node is unique or all fields are entered. Generate recommended nodes, store other input fields and field values or generate empty nodes; 通过以上步骤,生成了一棵具有历史最短填报路径的动态树;Through the above steps, a dynamic tree with the shortest reporting path in history is generated; S3.3:动态树存储S3.3: Dynamic Tree Storage 将S3.2生成的动态树存储于MySql数据库中;Store the dynamic tree generated by S3.2 in the MySql database; S4:根据生成的动态树,智能推荐填报内容S4: Based on the generated dynamic tree, intelligently recommend filling content 将存储于MySql数据库中动态树各结构读入内存,并重新在内存构建对应动态树:首先根据用户录入的申报行、经营单位及商品名称搜索动态树,继而根据每一节点存储的BestValue属性进行深度优先搜索,有序输出其余所有字段及推荐填报值;接着与用户进行交互,进行部分字段修改或录入,最终依据新的录入值再次进行如上搜索及推荐,直至符合录入预期;Read each structure of the dynamic tree stored in the MySql database into the memory, and re-build the corresponding dynamic tree in the memory: firstly search the dynamic tree according to the declaration line, business unit and product name entered by the user, and then perform a search based on the BestValue attribute stored in each node Depth-first search, orderly output all other fields and recommended filling values; then interact with the user, modify or input some fields, and finally perform the above search and recommendation again based on the new input values until the input expectations are met; S5:依据新录入数据定期自动进行数据维护与更新S5: Periodically and automatically perform data maintenance and update based on newly entered data 首先将用户每一次录入情况存储至数据库中并添加时间标签,接着在每周固定时间依据时间标签将新数据添加入历史数据集,依照新生成历史数据集构建新的动态树并存储,从而实现数据自动维护与更新。First, store each entry of the user in the database and add a time tag, then add new data to the historical data set at a fixed time every week according to the time tag, and construct a new dynamic tree according to the newly generated historical data set and store it, so as to achieve Data is automatically maintained and updated. 2.一种如权利要求1所述基于历史数据挖掘的海关报关单智能填报方法,其特征在于:S3.3中表结构如下表所示:2. A customs declaration form intelligent filling method based on historical data mining as claimed in claim 1, characterized in that: the table structure in S3.3 is as shown in the following table: 表1划分结点表结构Table 1 Divide node table structure 字段名field name 字段类型Field Type 说明illustrate idid intint 自增长,主键self-growth, primary key Field_nameField_name StringString 填报字段名Fill in the field name Best_valueBest_value StringString 出现频次最高字段值Most frequent field value Namename StringString 结点名称,唯一Node name, unique LevelLevel intint 所处树结构层次tree structure level Agent_codeAgent_code StringString 对应申报行名称Corresponding reporting line name Datedate Datedate 该记录插入时间The record was inserted at
表2叶子结点表结构Table 2 Leaf node table structure 字段名field name 字段类型Field Type 说明illustrate idid intint 自增长,主键self-growth, primary key Valuevalue StringString 字段名+字段值拼接Field name + field value splicing LevelLevel intint 所处树结构层次tree structure level Agent_codeAgent_code StringString 对应申报行名称Corresponding reporting line name Datedate Datedate 该记录插入时间The record was inserted at
表3边表结构Table 3 Edge table structure
CN201910617724.3A 2019-07-10 2019-07-10 Intelligent customs declaration and customs clearance filling method based on historical data mining Active CN110471917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910617724.3A CN110471917B (en) 2019-07-10 2019-07-10 Intelligent customs declaration and customs clearance filling method based on historical data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910617724.3A CN110471917B (en) 2019-07-10 2019-07-10 Intelligent customs declaration and customs clearance filling method based on historical data mining

Publications (2)

Publication Number Publication Date
CN110471917A true CN110471917A (en) 2019-11-19
CN110471917B CN110471917B (en) 2021-01-15

Family

ID=68507571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910617724.3A Active CN110471917B (en) 2019-07-10 2019-07-10 Intelligent customs declaration and customs clearance filling method based on historical data mining

Country Status (1)

Country Link
CN (1) CN110471917B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111047299A (en) * 2019-12-17 2020-04-21 苏州工业园区报关有限公司 Application of two-dimensional code in customs two-step declaration and customs declaration system
CN111461072A (en) * 2020-05-06 2020-07-28 深圳市慧通关网络科技有限公司 AI (Artificial intelligence) identification importing method for quickly identifying imported form data
CN112633833A (en) * 2020-12-23 2021-04-09 云汉芯城(上海)互联网科技股份有限公司 Data acquisition method and device and computer storage medium
CN113296613A (en) * 2021-03-12 2021-08-24 阿里巴巴新加坡控股有限公司 Customs clearance information processing method and device and electronic equipment
CN114356115A (en) * 2021-12-31 2022-04-15 阿里巴巴(中国)有限公司 Method, electronic device and computer-readable storage medium for intelligently entering forms

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102364512A (en) * 2011-10-11 2012-02-29 中华人民共和国宁波出入境检验检疫局 Method for realizing automatically sampling inspection of computer during product inspection and quarantine
US20130297488A1 (en) * 2012-04-12 2013-11-07 PGB Solutions, Inc. System, method, service and computer readable medium for taking and processing paperless mortgage loan applications
CN103996112A (en) * 2014-04-18 2014-08-20 青岛诚业国际物流有限公司 Custom declaration data process system and method
CN104376068A (en) * 2014-11-07 2015-02-25 北京思特奇信息技术股份有限公司 Data representation system and method based on dynamic report template
CN108197796A (en) * 2017-12-28 2018-06-22 云南路普斯数据科技有限公司 A kind of data management and statistical analysis busincess intelligence platform
CN109409430A (en) * 2018-10-26 2019-03-01 江苏智通交通科技有限公司 Traffic accident intelligent data analysis and comprehensive application system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102364512A (en) * 2011-10-11 2012-02-29 中华人民共和国宁波出入境检验检疫局 Method for realizing automatically sampling inspection of computer during product inspection and quarantine
US20130297488A1 (en) * 2012-04-12 2013-11-07 PGB Solutions, Inc. System, method, service and computer readable medium for taking and processing paperless mortgage loan applications
CN103996112A (en) * 2014-04-18 2014-08-20 青岛诚业国际物流有限公司 Custom declaration data process system and method
CN104376068A (en) * 2014-11-07 2015-02-25 北京思特奇信息技术股份有限公司 Data representation system and method based on dynamic report template
CN108197796A (en) * 2017-12-28 2018-06-22 云南路普斯数据科技有限公司 A kind of data management and statistical analysis busincess intelligence platform
CN109409430A (en) * 2018-10-26 2019-03-01 江苏智通交通科技有限公司 Traffic accident intelligent data analysis and comprehensive application system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔建高: "数据智慧:开启智慧海关建设的关键密钥", 《海关与经贸研究》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111047299A (en) * 2019-12-17 2020-04-21 苏州工业园区报关有限公司 Application of two-dimensional code in customs two-step declaration and customs declaration system
CN111461072A (en) * 2020-05-06 2020-07-28 深圳市慧通关网络科技有限公司 AI (Artificial intelligence) identification importing method for quickly identifying imported form data
CN111461072B (en) * 2020-05-06 2023-04-18 深圳市慧通关网络科技有限公司 AI (Artificial intelligence) identification importing method for quickly identifying imported form data
CN112633833A (en) * 2020-12-23 2021-04-09 云汉芯城(上海)互联网科技股份有限公司 Data acquisition method and device and computer storage medium
CN113296613A (en) * 2021-03-12 2021-08-24 阿里巴巴新加坡控股有限公司 Customs clearance information processing method and device and electronic equipment
CN113296613B (en) * 2021-03-12 2026-01-02 阿里巴巴新加坡控股有限公司 Customs declaration information processing methods, devices and electronic equipment
CN114356115A (en) * 2021-12-31 2022-04-15 阿里巴巴(中国)有限公司 Method, electronic device and computer-readable storage medium for intelligently entering forms
CN114356115B (en) * 2021-12-31 2025-09-26 杭州阿里巴巴海外互联网产业有限公司 Intelligent entry form method, electronic device and computer-readable storage medium

Also Published As

Publication number Publication date
CN110471917B (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN110471917B (en) Intelligent customs declaration and customs clearance filling method based on historical data mining
CN105718565B (en) Construction method and construction device of data warehouse model
JP6609262B2 (en) Mapping of attributes of keyed entities
CN109118296A (en) Movable method for pushing, device and electronic equipment
CN114595294B (en) Data warehouse modeling and extracting method and system
CN113064866A (en) A power business data integration system
CN111159428A (en) Method and device for automatically extracting event relation of knowledge graph in economic field
CN109344153A (en) The processing method and terminal device of business datum
CN107346502A (en) A kind of iteration product marketing forecast method based on big data
CN108090225A (en) Operation method, device, system and the computer readable storage medium of database instance
CN105956723A (en) Logistics information management method based on data mining
CN108711074A (en) Business sorting technique, device, server and readable storage medium storing program for executing
CN115456745A (en) Small and micro enterprise portrait construction method and device
CN110688433B (en) Path-based feature generation method and device
CN105574761A (en) Taxpayer benefit association network parallel generation method based on Spark
CN102214248A (en) Multi-layer frequent pattern discovery algorithm with high space extensibility and high time efficiency for mining mass data
CN106357418A (en) Method and device for extracting features on basis of complex networks
CN106649314A (en) Data query method and device
CN110517009A (en) Real-time common layer building method, device and server
CN116542523A (en) Method and system for generating customs clearance bill information risk rule
CN109241200A (en) power material clustering information processing method and system
Abdulla et al. Measure customer behaviour using c4. 5 decision tree map reduce implementation in big data analytics and data visualization
Yao et al. Impact of US-China trade war on the network topology structure of Chinese stock market
Firmansyah Shariah Stock Emitent Efficiency Strategy in Digital Era: Application of DEA Super-Efficiency and Interpretive Structural Modeling
CN117893235B (en) Data analysis method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared
OL01 Intention to license declared