CN108804533B

CN108804533B - Heterogeneous big data information filtering method and device

Info

Publication number: CN108804533B
Application number: CN201810419694.0A
Authority: CN
Inventors: 欧阳永中
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2021-11-30
Anticipated expiration: 2038-05-04
Also published as: CN108804533A

Abstract

The invention discloses a method and a device for filtering heterogeneous big data information, aiming at the operation problem that the heterogeneous data is often incompatible with each other. Heterogeneous data has different types of data structures. The invention reads heterogeneous big data and splits it according to the data structure to obtain standard data, calculates the error rate of the standard data, deletes abnormal data whose error rate is greater than the error threshold in the standard data, and obtains filtered data, and filters the filtered data according to The size of the error rate is sorted, and a unified heterogeneous data processing method can be adopted for different data structures, which normalizes the data, reduces the error rate of the heterogeneous data, and improves the logical compatibility of the heterogeneous data.

Description

Heterogeneous big data information filtering method and device

Technical Field

The disclosure relates to the field of data information fusion processing, in particular to a method and a device for filtering heterogeneous big data information.

Background

With the development of internet technology, the application of big data information becomes more and more extensive, and the problem of incompatible operation of heterogeneous data is often faced. Heterogeneous data has data structure types different from each other. For heterogeneous data, corresponding same processing codes are often written for the same data structure, so that logic boundaries are less and less clear, mutual calling among the data structures is disordered, and a large number of logic errors are easily caused.

Disclosure of Invention

The purpose of the present disclosure is to provide a method and an apparatus for filtering heterogeneous big data information, aiming at the defects of the prior art, and specifically including the following steps:

step 1, reading heterogeneous big data and splitting according to a data structure to obtain standard data;

step 2, calculating the error rate of the standard data;

step 3, deleting abnormal data with the error rate larger than the error threshold value in the standard data to obtain filtered data;

step 4, sorting the filtered data according to the error rate;

step 5, deleting 10% of data at the head and the tail in the sorted queue to obtain a filtering result;

and 6, outputting a filtering result.

Further, in step 1, the data structure of the heterogeneous big data at least includes an array, a queue, a hash table, and a tree.

Further, in step 1, the step of obtaining the standard data by splitting according to the data structure includes the following sub-steps:

step 1.1, inputting according to the data structure type of the heterogeneous big data;

step 1.2, reading and splitting the data into metadata with keywords according to the data structure type;

step 1.3, combining metadata according to the same keywords to obtain standard data;

the standard data includes at least a data magnitude.

Further, in step 2, the sub-step of calculating the error rate of the standard data is:

step 2.1, set x₁,x₂,x₃,…,x_nFor the data magnitudes of n standard data, the arithmetic mean X' is

Step 2.2, the error rate s of the standard data by arithmetic mean X' is formulated as:

wherein n is a positive integer greater than or equal to 0, the value range is not limited, i is 1-n, x_iIs the data magnitude of the standard data.

Further, in step 3, the error threshold is: let S₁,S₂,S₃,…,S_nFor an error rate of n standard data, the error threshold S' is

Further, in step 4, the sorting method according to the error rate at least includes bubble sorting, insertion sorting and simple selection sorting.

The invention also provides a device for filtering heterogeneous big data information, which comprises:

the splitting unit is used for reading the heterogeneous big data and splitting the data according to the data structure to obtain standard data;

an error rate calculation unit for calculating an error rate of the standard data;

the exception handling unit is used for deleting the exception data of which the error rate is greater than the error threshold value in the standard data to obtain filtered data;

the sorting unit is used for sorting the filtered data according to the error rate;

a head and tail removing unit for deleting 10% of the head and tail data in the sorted queue to obtain a filtering result;

and the output unit is used for outputting the filtering result.

The beneficial effect of this disclosure does: the invention discloses a method and a device for filtering heterogeneous big data information, which can normalize data, reduce the error rate of heterogeneous data and improve the logic compatibility of the heterogeneous data by adopting a uniform heterogeneous data processing method aiming at different data structures.

Drawings

The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:

fig. 1 is a flowchart illustrating a method for filtering heterogeneous big data information according to the present disclosure;

fig. 2 illustrates a heterogeneous big data information filtering apparatus according to an embodiment of the present disclosure.

Detailed Description

The conception, specific structure and technical effects of the present disclosure will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, aspects and effects of the present disclosure. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Fig. 1 is a flowchart illustrating a method for filtering heterogeneous big data information according to the present disclosure, and the method for filtering heterogeneous big data information according to an embodiment of the present disclosure is described below with reference to fig. 1.

The disclosure provides a method for filtering heterogeneous big data information, which specifically comprises the following steps:

step 2, calculating the error rate of the standard data;

step 4, sorting the filtered data according to the error rate;

and 6, outputting a filtering result.

the standard data includes at least a data magnitude.

Step 2.2, calculatingThe error rate s for the mean X' standard data is formulated as:

The present invention also provides a device for filtering heterogeneous big data information, as shown in fig. 2, the device includes:

and the output unit is used for outputting the filtering result.

The heterogeneous big data information filtering device can be operated in computing equipment such as desktop computers, notebooks, palm computers and cloud servers. The device for filtering heterogeneous big data information can be operated by the device comprising a processor and a memory. Those skilled in the art will appreciate that the example is only an example of a filtering apparatus for heterogeneous big data information, and does not constitute a limitation to the filtering apparatus for heterogeneous big data information, and may include more or less components, or combine some components, or different components, for example, the filtering apparatus for heterogeneous big data information may further include an input and output device, a network access device, a bus, and the like. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the filtering apparatus operation apparatus for one kind of heterogeneous big data information, and various interfaces and lines are used to connect various parts of the filtering apparatus operation apparatus for the whole one kind of heterogeneous big data information.

The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the filtering device for the heterogeneous big data information by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

While the present disclosure has been described in considerable detail and with particular reference to a few illustrative embodiments thereof, it is not intended to be limited to any such details or embodiments or any particular embodiments, but it is to be construed as effectively covering the intended scope of the disclosure by providing a broad, potential interpretation of such claims in view of the prior art with reference to the appended claims. Furthermore, the foregoing describes the disclosure in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the disclosure, not presently foreseen, may nonetheless represent equivalent modifications thereto.

Claims

1. a filtering method of heterogeneous big data information, is characterized in that, described filtering method comprises the steps:

Step 1, read heterogeneous big data and split it according to the data structure to obtain standard data;

Step 2, calculate the error rate of standard data;

Step 3, delete the abnormal data whose error rate is greater than the error threshold in the standard data to obtain filtered data;

Step 4, sort the filtered data according to the size of the error rate;

Step 5, delete the first and last 10% of the data in the sorted queue to obtain the filtering result;

Step 6, output the filtering result;

In step 2, the sub-step of calculating the error rate of the standard data is:

Step 2.1, let x ₁ , x ₂ , x ₃ , ..., x _n be the data volume values of n standard data, then the arithmetic mean X' is

Step 2.2, the formula for the error rate s of the standard data through the arithmetic mean X' is:

Among them, n is a positive integer greater than or equal to 0, the value range is not limited, the value range of i is 1~n, and x _i is the data amount value of standard data;

In step 1, the step of obtaining standard data by splitting the data structure includes the following sub-steps:

Step 1.1, input according to the data structure type of heterogeneous big data;

Step 1.2, read and split into metadata with keywords according to the data structure type;

Step 1.3, obtain standard data according to the same keyword combination metadata;

The standard data includes at least a data amount value.

2 . The method for filtering heterogeneous big data information according to claim 1 , wherein, in step 1, the data structure of the heterogeneous big data at least includes an array, a queue, a hash table, and a tree. 3 .

3. The method for filtering heterogeneous big data information according to claim 1, wherein in step 3, the error threshold is: set S ₁ , S ₂ , S ₃ ,..., _Sn as The error rate of n standard data, then the error threshold S' is

4. the filtering method of a kind of heterogeneous big data information according to claim 1, is characterized in that, in step 4, described sorting method according to the size of error rate at least comprises bubble sorting, insertion sorting, simple sorting Select Sort.

5. A filtering device for heterogeneous big data information, wherein the device comprises:

The splitting unit is used to read heterogeneous big data and split it according to the data structure to obtain standard data;

Error rate calculation unit, used to calculate the error rate of standard data;

The exception processing unit is used to delete the abnormal data whose error rate is greater than the error threshold in the standard data to obtain the filtered data;

The sorting unit is used to sort the filtered data according to the size of the error rate;

Remove the first and last units, which is used to delete the first and last 10% of the data in the sorted queue to obtain the filtering result;

The output unit is used to output the filtering result;

The step of obtaining standard data by splitting according to the data structure includes the following sub-steps:

Step 1.1, input according to the data structure type of heterogeneous big data;

The standard data includes at least a data amount value;

In step 2, the sub-step of calculating the error rate of the standard data is:

Among them, n is a positive integer greater than or equal to 0, the value range is not limited, the value range of i is 1 to n, and x _i is the data amount value of the standard data.