[go: up one dir, main page]

CN108804533B - Heterogeneous big data information filtering method and device - Google Patents

Heterogeneous big data information filtering method and device Download PDF

Info

Publication number
CN108804533B
CN108804533B CN201810419694.0A CN201810419694A CN108804533B CN 108804533 B CN108804533 B CN 108804533B CN 201810419694 A CN201810419694 A CN 201810419694A CN 108804533 B CN108804533 B CN 108804533B
Authority
CN
China
Prior art keywords
data
error rate
standard
standard data
heterogeneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810419694.0A
Other languages
Chinese (zh)
Other versions
CN108804533A (en
Inventor
欧阳永中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan University
Original Assignee
Foshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan University filed Critical Foshan University
Priority to CN201810419694.0A priority Critical patent/CN108804533B/en
Publication of CN108804533A publication Critical patent/CN108804533A/en
Application granted granted Critical
Publication of CN108804533B publication Critical patent/CN108804533B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种异构大数据信息的过滤方法及装置,针对经常会面临异构数据的互相之间不兼容的操作问题。异构数据具有互不相同的数据结构类型。本发明读取异构大数据并按照数据结构拆分得到标准数据,计算标准数据的误差率,删除标准数据中误差率大于误差阈值的异常数据后得到过滤后的数据,将过滤后的数据按照误差率的大小排序,可以针对不同的数据结构,采用统一的异构数据处理方法,归一化了数据,降低了异构数据的误差率,提高了异构数据的逻辑兼容度。

Figure 201810419694

The invention discloses a method and a device for filtering heterogeneous big data information, aiming at the operation problem that the heterogeneous data is often incompatible with each other. Heterogeneous data has different types of data structures. The invention reads heterogeneous big data and splits it according to the data structure to obtain standard data, calculates the error rate of the standard data, deletes abnormal data whose error rate is greater than the error threshold in the standard data, and obtains filtered data, and filters the filtered data according to The size of the error rate is sorted, and a unified heterogeneous data processing method can be adopted for different data structures, which normalizes the data, reduces the error rate of the heterogeneous data, and improves the logical compatibility of the heterogeneous data.

Figure 201810419694

Description

Heterogeneous big data information filtering method and device
Technical Field
The disclosure relates to the field of data information fusion processing, in particular to a method and a device for filtering heterogeneous big data information.
Background
With the development of internet technology, the application of big data information becomes more and more extensive, and the problem of incompatible operation of heterogeneous data is often faced. Heterogeneous data has data structure types different from each other. For heterogeneous data, corresponding same processing codes are often written for the same data structure, so that logic boundaries are less and less clear, mutual calling among the data structures is disordered, and a large number of logic errors are easily caused.
Disclosure of Invention
The purpose of the present disclosure is to provide a method and an apparatus for filtering heterogeneous big data information, aiming at the defects of the prior art, and specifically including the following steps:
step 1, reading heterogeneous big data and splitting according to a data structure to obtain standard data;
step 2, calculating the error rate of the standard data;
step 3, deleting abnormal data with the error rate larger than the error threshold value in the standard data to obtain filtered data;
step 4, sorting the filtered data according to the error rate;
step 5, deleting 10% of data at the head and the tail in the sorted queue to obtain a filtering result;
and 6, outputting a filtering result.
Further, in step 1, the data structure of the heterogeneous big data at least includes an array, a queue, a hash table, and a tree.
Further, in step 1, the step of obtaining the standard data by splitting according to the data structure includes the following sub-steps:
step 1.1, inputting according to the data structure type of the heterogeneous big data;
step 1.2, reading and splitting the data into metadata with keywords according to the data structure type;
step 1.3, combining metadata according to the same keywords to obtain standard data;
the standard data includes at least a data magnitude.
Further, in step 2, the sub-step of calculating the error rate of the standard data is:
step 2.1, set x1,x2,x3,…,xnFor the data magnitudes of n standard data, the arithmetic mean X' is
Figure BDA0001650385710000011
Step 2.2, the error rate s of the standard data by arithmetic mean X' is formulated as:
Figure BDA0001650385710000021
wherein n is a positive integer greater than or equal to 0, the value range is not limited, i is 1-n, xiIs the data magnitude of the standard data.
Further, in step 3, the error threshold is: let S1,S2,S3,…,SnFor an error rate of n standard data, the error threshold S' is
Figure BDA0001650385710000022
Further, in step 4, the sorting method according to the error rate at least includes bubble sorting, insertion sorting and simple selection sorting.
The invention also provides a device for filtering heterogeneous big data information, which comprises:
the splitting unit is used for reading the heterogeneous big data and splitting the data according to the data structure to obtain standard data;
an error rate calculation unit for calculating an error rate of the standard data;
the exception handling unit is used for deleting the exception data of which the error rate is greater than the error threshold value in the standard data to obtain filtered data;
the sorting unit is used for sorting the filtered data according to the error rate;
a head and tail removing unit for deleting 10% of the head and tail data in the sorted queue to obtain a filtering result;
and the output unit is used for outputting the filtering result.
The beneficial effect of this disclosure does: the invention discloses a method and a device for filtering heterogeneous big data information, which can normalize data, reduce the error rate of heterogeneous data and improve the logic compatibility of the heterogeneous data by adopting a uniform heterogeneous data processing method aiming at different data structures.
Drawings
The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:
fig. 1 is a flowchart illustrating a method for filtering heterogeneous big data information according to the present disclosure;
fig. 2 illustrates a heterogeneous big data information filtering apparatus according to an embodiment of the present disclosure.
Detailed Description
The conception, specific structure and technical effects of the present disclosure will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, aspects and effects of the present disclosure. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Fig. 1 is a flowchart illustrating a method for filtering heterogeneous big data information according to the present disclosure, and the method for filtering heterogeneous big data information according to an embodiment of the present disclosure is described below with reference to fig. 1.
The disclosure provides a method for filtering heterogeneous big data information, which specifically comprises the following steps:
step 1, reading heterogeneous big data and splitting according to a data structure to obtain standard data;
step 2, calculating the error rate of the standard data;
step 3, deleting abnormal data with the error rate larger than the error threshold value in the standard data to obtain filtered data;
step 4, sorting the filtered data according to the error rate;
step 5, deleting 10% of data at the head and the tail in the sorted queue to obtain a filtering result;
and 6, outputting a filtering result.
Further, in step 1, the data structure of the heterogeneous big data at least includes an array, a queue, a hash table, and a tree.
Further, in step 1, the step of obtaining the standard data by splitting according to the data structure includes the following sub-steps:
step 1.1, inputting according to the data structure type of the heterogeneous big data;
step 1.2, reading and splitting the data into metadata with keywords according to the data structure type;
step 1.3, combining metadata according to the same keywords to obtain standard data;
the standard data includes at least a data magnitude.
Further, in step 2, the sub-step of calculating the error rate of the standard data is:
step 2.1, set x1,x2,x3,…,xnFor the data magnitudes of n standard data, the arithmetic mean X' is
Figure BDA0001650385710000031
Step 2.2, calculatingThe error rate s for the mean X' standard data is formulated as:
Figure BDA0001650385710000032
wherein n is a positive integer greater than or equal to 0, the value range is not limited, i is 1-n, xiIs the data magnitude of the standard data.
Further, in step 3, the error threshold is: let S1,S2,S3,…,SnFor an error rate of n standard data, the error threshold S' is
Figure BDA0001650385710000033
Further, in step 4, the sorting method according to the error rate at least includes bubble sorting, insertion sorting and simple selection sorting.
The present invention also provides a device for filtering heterogeneous big data information, as shown in fig. 2, the device includes:
the splitting unit is used for reading the heterogeneous big data and splitting the data according to the data structure to obtain standard data;
an error rate calculation unit for calculating an error rate of the standard data;
the exception handling unit is used for deleting the exception data of which the error rate is greater than the error threshold value in the standard data to obtain filtered data;
the sorting unit is used for sorting the filtered data according to the error rate;
a head and tail removing unit for deleting 10% of the head and tail data in the sorted queue to obtain a filtering result;
and the output unit is used for outputting the filtering result.
The heterogeneous big data information filtering device can be operated in computing equipment such as desktop computers, notebooks, palm computers and cloud servers. The device for filtering heterogeneous big data information can be operated by the device comprising a processor and a memory. Those skilled in the art will appreciate that the example is only an example of a filtering apparatus for heterogeneous big data information, and does not constitute a limitation to the filtering apparatus for heterogeneous big data information, and may include more or less components, or combine some components, or different components, for example, the filtering apparatus for heterogeneous big data information may further include an input and output device, a network access device, a bus, and the like. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the filtering apparatus operation apparatus for one kind of heterogeneous big data information, and various interfaces and lines are used to connect various parts of the filtering apparatus operation apparatus for the whole one kind of heterogeneous big data information.
The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the filtering device for the heterogeneous big data information by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
While the present disclosure has been described in considerable detail and with particular reference to a few illustrative embodiments thereof, it is not intended to be limited to any such details or embodiments or any particular embodiments, but it is to be construed as effectively covering the intended scope of the disclosure by providing a broad, potential interpretation of such claims in view of the prior art with reference to the appended claims. Furthermore, the foregoing describes the disclosure in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the disclosure, not presently foreseen, may nonetheless represent equivalent modifications thereto.

Claims (5)

1.一种异构大数据信息的过滤方法,其特征在于,所述过滤方法包括如下步骤:1. a filtering method of heterogeneous big data information, is characterized in that, described filtering method comprises the steps: 步骤1,读取异构大数据并按照数据结构拆分得到标准数据;Step 1, read heterogeneous big data and split it according to the data structure to obtain standard data; 步骤2,计算标准数据的误差率;Step 2, calculate the error rate of standard data; 步骤3,删除标准数据中误差率大于误差阈值的异常数据后得到过滤后的数据;Step 3, delete the abnormal data whose error rate is greater than the error threshold in the standard data to obtain filtered data; 步骤4,将过滤后的数据按照误差率的大小排序;Step 4, sort the filtered data according to the size of the error rate; 步骤5,删除排序后的队列中首尾10%的数据得到过滤结果;Step 5, delete the first and last 10% of the data in the sorted queue to obtain the filtering result; 步骤6,输出过滤结果;Step 6, output the filtering result; 在步骤2中,所述计算标准数据的误差率的子步骤为:In step 2, the sub-step of calculating the error rate of the standard data is: 步骤2.1,设x1,x2,x3,…,xn为n个标准数据的数据量值,则算术平均值X'为
Figure FDA0003263487430000011
Step 2.1, let x 1 , x 2 , x 3 , ..., x n be the data volume values of n standard data, then the arithmetic mean X' is
Figure FDA0003263487430000011
步骤2.2,通过算术平均值X'标准数据的误差率s的公式为:
Figure FDA0003263487430000012
Step 2.2, the formula for the error rate s of the standard data through the arithmetic mean X' is:
Figure FDA0003263487430000012
其中,n为大于或等于0的正整数,取值范围不做限制,i取值范围为1~n,xi为标准数据的数据量值;Among them, n is a positive integer greater than or equal to 0, the value range is not limited, the value range of i is 1~n, and x i is the data amount value of standard data; 在步骤1中,所述按照数据结构拆分得到标准数据的步骤包括如下子步骤:In step 1, the step of obtaining standard data by splitting the data structure includes the following sub-steps: 步骤1.1,按照异构大数据的数据结构类型输入;Step 1.1, input according to the data structure type of heterogeneous big data; 步骤1.2,根据数据结构类型读取并拆分为带有关键字的元数据;Step 1.2, read and split into metadata with keywords according to the data structure type; 步骤1.3,按照相同的关键字组合元数据获得标准数据;Step 1.3, obtain standard data according to the same keyword combination metadata; 所述标准数据至少包括数据量值。The standard data includes at least a data amount value.
2.根据权利要求1所述的一种异构大数据信息的过滤方法,其特征在于,在步骤1中,所述异构大数据的数据结构至少包括数组、队列、哈希表、树。2 . The method for filtering heterogeneous big data information according to claim 1 , wherein, in step 1, the data structure of the heterogeneous big data at least includes an array, a queue, a hash table, and a tree. 3 . 3.根据权利要求1所述的一种异构大数据信息的过滤方法,其特征在于,在步骤3中,所述误差阈值为:设S1,S2,S3,…,Sn为n个标准数据的误差率,则误差阈值S'为
Figure FDA0003263487430000021
3. The method for filtering heterogeneous big data information according to claim 1, wherein in step 3, the error threshold is: set S 1 , S 2 , S 3 ,..., Sn as The error rate of n standard data, then the error threshold S' is
Figure FDA0003263487430000021
4.根据权利要求1所述的一种异构大数据信息的过滤方法,其特征在于,在步骤4中,所述按照误差率的大小排序方法至少包括冒泡法排序、插入法排序、简单选择排序。4. the filtering method of a kind of heterogeneous big data information according to claim 1, is characterized in that, in step 4, described sorting method according to the size of error rate at least comprises bubble sorting, insertion sorting, simple sorting Select Sort. 5.一种异构大数据信息的过滤装置,其特征在于,所述装置包括:5. A filtering device for heterogeneous big data information, wherein the device comprises: 拆分单元,用于读取异构大数据并按照数据结构拆分得到标准数据;The splitting unit is used to read heterogeneous big data and split it according to the data structure to obtain standard data; 误差率计算单元,用于计算标准数据的误差率;Error rate calculation unit, used to calculate the error rate of standard data; 异常处理单元,用于删除标准数据中误差率大于误差阈值的异常数据后得到过滤后的数据;The exception processing unit is used to delete the abnormal data whose error rate is greater than the error threshold in the standard data to obtain the filtered data; 排序单元,用于将过滤后的数据按照误差率的大小排序;The sorting unit is used to sort the filtered data according to the size of the error rate; 去首尾单元,用于删除排序后的队列中首尾10%的数据得到过滤结果;Remove the first and last units, which is used to delete the first and last 10% of the data in the sorted queue to obtain the filtering result; 输出单元,用于输出过滤结果;The output unit is used to output the filtering result; 所述按照数据结构拆分得到标准数据的步骤包括如下子步骤:The step of obtaining standard data by splitting according to the data structure includes the following sub-steps: 步骤1.1,按照异构大数据的数据结构类型输入;Step 1.1, input according to the data structure type of heterogeneous big data; 步骤1.2,根据数据结构类型读取并拆分为带有关键字的元数据;Step 1.2, read and split into metadata with keywords according to the data structure type; 步骤1.3,按照相同的关键字组合元数据获得标准数据;Step 1.3, obtain standard data according to the same keyword combination metadata; 所述标准数据至少包括数据量值;The standard data includes at least a data amount value; 在步骤2中,所述计算标准数据的误差率的子步骤为:In step 2, the sub-step of calculating the error rate of the standard data is: 步骤2.1,设x1,x2,x3,…,xn为n个标准数据的数据量值,则算术平均值X'为
Figure FDA0003263487430000022
Step 2.1, let x 1 , x 2 , x 3 , ..., x n be the data volume values of n standard data, then the arithmetic mean X' is
Figure FDA0003263487430000022
步骤2.2,通过算术平均值X'标准数据的误差率s的公式为:
Figure FDA0003263487430000023
Step 2.2, the formula for the error rate s of the standard data through the arithmetic mean X' is:
Figure FDA0003263487430000023
其中,n为大于或等于0的正整数,取值范围不做限制,i取值范围为1~n,xi为标准数据的数据量值。Among them, n is a positive integer greater than or equal to 0, the value range is not limited, the value range of i is 1 to n, and x i is the data amount value of the standard data.
CN201810419694.0A 2018-05-04 2018-05-04 Heterogeneous big data information filtering method and device Active CN108804533B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810419694.0A CN108804533B (en) 2018-05-04 2018-05-04 Heterogeneous big data information filtering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810419694.0A CN108804533B (en) 2018-05-04 2018-05-04 Heterogeneous big data information filtering method and device

Publications (2)

Publication Number Publication Date
CN108804533A CN108804533A (en) 2018-11-13
CN108804533B true CN108804533B (en) 2021-11-30

Family

ID=64093243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810419694.0A Active CN108804533B (en) 2018-05-04 2018-05-04 Heterogeneous big data information filtering method and device

Country Status (1)

Country Link
CN (1) CN108804533B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110502662A (en) * 2019-08-23 2019-11-26 南京信易达计算技术有限公司 A kind of Heterogeneous Data Processing system and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102135979A (en) * 2010-12-08 2011-07-27 华为技术有限公司 Data cleaning method and device
CN103049652A (en) * 2012-12-13 2013-04-17 北京交通大学 Method and system for checking station operating standard time and capabilities
CN104933444A (en) * 2015-06-26 2015-09-23 南京邮电大学 Design method of multi-dimension attribute data oriented multi-layered clustering fusion mechanism
CN106203832A (en) * 2016-07-12 2016-12-07 亿米特(上海)信息科技有限公司 Intelligent electricity anti-theft analyzes system and the method for analysis
CN106250444A (en) * 2016-07-27 2016-12-21 北京集奥聚合科技有限公司 The real-time Input System of a kind of heterogeneous data source and method
CN106776778A (en) * 2016-11-24 2017-05-31 太极计算机股份有限公司 The association process method of urban transportation data based on map and beat of data conveniently

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7647291B2 (en) * 2003-12-30 2010-01-12 Microsoft Corporation B-tree compression using normalized index keys
US7516128B2 (en) * 2006-11-14 2009-04-07 International Business Machines Corporation Method for cleansing sequence-based data at query time
CN101594201B (en) * 2009-05-20 2012-05-23 清华大学 Method of Integrating Error Data Filtering in Chained Queue Management Structure
WO2013187963A2 (en) * 2012-03-30 2013-12-19 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for rapid filtering of opaque data traffic
CN103617352A (en) * 2013-11-20 2014-03-05 宁波保税区攀峒信息科技有限公司 Breadth-first detection method and device for mistaken genetic relationship rings
CN104135521B (en) * 2014-07-29 2018-06-05 广东省环境监测中心 The data outliers identification method and system of environment automatic monitoring network
US10579589B2 (en) * 2014-11-06 2020-03-03 Sap Se Data filtering
CN106599561A (en) * 2016-12-06 2017-04-26 北京中交兴路信息科技有限公司 Trajectory data cleaning method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102135979A (en) * 2010-12-08 2011-07-27 华为技术有限公司 Data cleaning method and device
CN103049652A (en) * 2012-12-13 2013-04-17 北京交通大学 Method and system for checking station operating standard time and capabilities
CN104933444A (en) * 2015-06-26 2015-09-23 南京邮电大学 Design method of multi-dimension attribute data oriented multi-layered clustering fusion mechanism
CN106203832A (en) * 2016-07-12 2016-12-07 亿米特(上海)信息科技有限公司 Intelligent electricity anti-theft analyzes system and the method for analysis
CN106250444A (en) * 2016-07-27 2016-12-21 北京集奥聚合科技有限公司 The real-time Input System of a kind of heterogeneous data source and method
CN106776778A (en) * 2016-11-24 2017-05-31 太极计算机股份有限公司 The association process method of urban transportation data based on map and beat of data conveniently

Also Published As

Publication number Publication date
CN108804533A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
US20170201431A1 (en) Systems and Methods for Rule Quality Estimation
US10664443B2 (en) Method and apparatus for presenting to-be-cleaned data, and electronic device
US8533847B2 (en) Apparatus and method for screening new data without impacting download speed
TWI634421B (en) Electronic apparatus for data access and data access method therefor
CN110147203A (en) A file management method, device, electronic device and storage medium
CN110490724A (en) The storage method and device of account data
CN114374392B (en) Data compression storage method and device, terminal equipment and readable storage medium
CN115617799A (en) Method, device, equipment and storage medium for data storage
CN108804533B (en) Heterogeneous big data information filtering method and device
CN115293993B (en) Sampling method and system for filtering salt and pepper noise
JP6783325B2 (en) File save method and electronic devices
CN108616603B (en) A method and system for synchronizing internal and external network data
CN112379835B (en) OOB area data extraction method, terminal device and storage medium
CN111966472B (en) A process scheduling method and system for an industrial real-time operating system
CN112380174B (en) XFS file system analysis method, terminal device and storage medium with deleted files
CN111125715A (en) TCG data processing acceleration method and device based on solid state disk, computer equipment and storage medium
CN105447043A (en) Database and data access method thereof
CN112036133B (en) File storage method and device, electronic equipment and storage medium
CN111723266B (en) Mass data processing method and device
CN115512194A (en) Image target detection method, device and terminal equipment
CN108459928B (en) Related data association visualization method, terminal device and storage medium
CN115758366B (en) Virus scanning progress calculation method, device, electronic device and storage medium
CN111241099A (en) Industrial big data storage method and device
CN111599226A (en) A virtual pottery teaching method and system
TWI875358B (en) Semiconductor memory device and data storage method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 528000 Foshan Institute of science and technology, Xianxi reservoir West Road, Shishan town, Nanhai District, Foshan City, Guangdong Province

Patentee after: Foshan University

Country or region after: China

Address before: 528000 Foshan Institute of science and technology, Xianxi reservoir West Road, Shishan town, Nanhai District, Foshan City, Guangdong Province

Patentee before: FOSHAN University

Country or region before: China

CP03 Change of name, title or address