Disclosure of Invention
In order to solve the problems, the invention provides an information credibility detection and evaluation system based on multi-source data, which can dig the potential structural characteristics and diversified and distinctive data attributes of the multi-source data under unsupervised guidance information, thereby effectively carrying out information credibility detection and evaluation on the multi-source research data, analyzing and detecting the credibility of each piece of research data, evaluating the overall credibility of a researched object, and providing powerful supervision and guidance for data mining tasks of related credibility detection.
The invention aims to provide an information credibility detection and evaluation system based on multi-source data, which is used for carrying out information credibility detection and evaluation on the multi-source research data, analyzing and detecting the credibility of each piece of research data, and evaluating the overall credibility of a researched object.
The technical scheme adopted by the invention for realizing the purpose is as follows:
a multi-source data-based information credibility detection and evaluation system comprises:
the data acquisition module is used for acquiring multi-source data related to the research task;
the data processing module is used for preprocessing the collected multi-source data related to the research tasks and acquiring semantic feature description vectors in the multi-source data and attribute labels of each research task;
and the reliability detection and evaluation module is used for detecting the reliability of each investigation task according to the semantic feature description vector and the attribute label information and evaluating the overall reliability of the investigated object.
The data acquisition module comprises:
a keyword acquisition unit for acquiring keywords related to the research task;
and the data mining unit is used for acquiring multi-source data information containing keywords in the research task.
The data processing module comprises:
the data screening unit is used for pre-screening the multi-source data to eliminate repeated and redundant multi-source data and delete the multi-source data irrelevant to the semantics of the keywords;
the text word segmentation unit is used for carrying out sentence word segmentation processing on the multi-source data to obtain vocabulary data after the sentence word segmentation;
and the characteristic extraction unit is used for extracting the characteristic vector in the vocabulary data and further determining the semantic characteristic description vector of the statement.
The credibility detection and evaluation module comprises:
the multi-source data credibility detection unit is used for detecting the credibility of each piece of multi-source data related to the investigation task and giving a credibility detection value;
and the reliability evaluation unit of the investigated object is used for evaluating the overall reliability of the investigated object and providing a reliability evaluation value of the investigated object.
The multi-source data credibility detection unit comprises:
the dictionary learning unit is used for learning sparse dictionaries for different research tasks, extracting semantic information of multi-source data related to the research tasks from semantic feature description vectors of the multi-source data, and discarding redundant semantic information of the multi-source data unrelated to the research tasks;
the data reconstruction unit is used for reconstructing semantic feature description vectors of multi-source data used for the research task according to the learned sparse dictionary;
the data reliability detection unit is used for quantizing the reconstructed semantic feature description vector into a reliability detection numerical value;
the investigation object credibility evaluation unit comprises:
the attribute level analysis unit is used for determining the relative importance weights of all the attribute labels of the multi-source data of the research task through a level analysis method;
and the object reliability evaluation unit is used for carrying out weighted average on the reliability values of all the multi-source data according to the relative importance weight of each piece of multi-source data to obtain the reliability evaluation value of the investigated object.
A method for detecting and evaluating information reliability based on multi-source data comprises the following steps:
1) the data acquisition module acquires multi-source data related to the research task;
2) the data processing module carries out preprocessing operation on the collected multi-source data related to the research tasks and obtains semantic feature description vectors in the multi-source data and attribute labels of each research task;
3) and the reliability detection and evaluation module detects the reliability of each investigation task according to the semantic feature description vector and the attribute label information and evaluates the overall reliability of the investigated object.
The step 1) comprises the following steps:
1.1) a keyword acquisition unit acquires keywords related to a research task;
1.2) the data mining unit acquires multi-source data information containing keywords in the research task.
The step 2) comprises the following steps:
2.1) the data screening unit pre-screens the multi-source data to eliminate repeated and redundant multi-source data and delete the multi-source data irrelevant to the semantics of the keywords;
2.2) the text word segmentation unit carries out sentence word segmentation on the multi-source research data to obtain vocabulary data after sentence word segmentation;
2.3) the feature extraction unit extracts the feature vector in the vocabulary data and further determines the semantic feature description vector of the sentence.
The step 3) comprises the following steps:
3.1) the multi-source data credibility detection unit detects the credibility of each piece of multi-source data related to the research task and provides a credibility detection value;
and 3.2) evaluating the overall reliability of the investigated object by the investigated object reliability evaluating unit, and giving a reliability evaluation value of the investigated object.
Said step 3.1), step 3.2), comprising the steps of:
3.1.1) the dictionary learning unit learns sparse dictionaries aiming at different research tasks, extracts semantic information of multi-source data relevant to the research tasks from semantic feature description vectors of the multi-source data, and discards redundant semantic information of the multi-source data irrelevant to the research tasks;
3.1.2) the data reconstruction unit reconstructs semantic feature description vectors of the multi-source data used for the research task based on the learned sparse dictionary;
3.1.3) the data credibility detection unit quantizes the reconstructed semantic feature description vector into a credibility detection numerical value;
3.2.1) the attribute hierarchical analysis unit determines the relative importance weights of all the attribute labels of the multi-source data of the investigation task through a hierarchical analysis method;
3.2.2) the object credibility evaluation unit obtains the credibility evaluation numerical value of the investigated object by weighted average of the credibility numerical values of all multi-source data according to the relative importance weight of each multi-source data.
The invention has the advantages and beneficial effects that:
1. the invention fills the blank of establishing a multi-source data credibility detection and evaluation system under the condition of unsupervised information guidance, and provides powerful supervision and guidance for related multi-source data mining tasks based on credibility detection in the big data era of false information flooding.
2. The method can dig the potential structural characteristics and diversified and distinctive data attributes of the multi-source data under the unsupervised guidance information, thereby effectively improving the performance and robustness of information credibility detection and evaluation of the multi-source research data and providing reference significance for other big data-driven excavation and evaluation tasks.
Detailed Description
In order to make the advantages, technical solutions and purposes of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly set forth below with reference to the drawings in the embodiments of detecting and evaluating the information credibility of the multi-source data investigated and filled by enterprises, where the embodiments of detecting and evaluating the information credibility of the multi-source data investigated and filled by enterprises are only a part of the embodiments of the present invention, and not all of the embodiments of the present invention. The components of the embodiments of the present invention illustrated in the drawings may be designed in a variety of different combined configurations. Accordingly, the detailed description of the embodiments of the present invention provided below in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Other embodiments, which are based on the embodiments of the invention and can be obtained by a person skilled in the art without inventive work, are within the scope of the patent protection of the invention.
The multi-source research data used for related task evaluation has a large amount of abnormal information and false data, which seriously affects the accuracy of an evaluation system, and further restricts the supervision and guidance function of data mining on industrial mode adjustment and economic development acceleration gear shifting. In order to solve the problems, the invention provides an information credibility detection and evaluation system based on multi-source data, which can detect and evaluate the information credibility of the multi-source data for research and evaluation, analyze and detect the credibility of each piece of research data, and evaluate the overall credibility of a researched object.
As shown in fig. 1, the present embodiment provides an information reliability detection and evaluation system based on multi-source data, where the system includes:
the data acquisition module 11: the system is used for collecting multi-source text data related to research tasks, wherein the multi-source text data refers to text data information of different enterprises in different cities obtained from government official networks, research questionnaires and enterprise internet;
the data processing module 22: the system comprises a word segmentation technology, a word segmentation technology and a data processing and analyzing unit, wherein the word segmentation technology is used for preprocessing collected multi-source data related to a research task to eliminate noise interference information, obtaining vocabulary data of a text statement by utilizing the word segmentation technology, and obtaining semantic feature description vectors and attribute label information of each research problem;
confidence detection and evaluation module 33: the reliability value is quantized to a value between 0 and 100, and the larger the value is, the higher the reliability is.
Wherein the correlation includes at least one of a keyword representing the investigated object or the recommendation.
For the collected multi-source text data related to enterprise research problems, the embodiment of the invention carries out preprocessing operation on the data to obtain the text feature description vector and the attribute label information of each research problem. The preprocessing operation comprises the steps of removing repeated and redundant data and deleting text expression information irrelevant to the key word semantics. And then, performing word segmentation on the text sentence to obtain corresponding word data, utilizing a word embedding technology to perform word segmentation on the sentence to obtain a word embedding matrix according to the context expression of the word data, performing maximum pooling operation on word segmentation characteristics to obtain characteristic expression of the text sentence, and performing credibility detection and evaluation tasks of research data. The attribute label information refers to main categories to which set research questions belong, including logistics support, economic benefits, registered funds, partners, development scale, employee treatment, technical field, working time, civil investigation, government policies, innovation contributions and talent introduction, and each main category includes 5-10 categories of secondary subdivisions.
For the above-mentioned approach to obtain multi-source text data information from different enterprises in different cities from government official networks, questionnaires, and enterprise internet, a data acquisition module 11 provided by the embodiment of the present invention is shown in fig. 2, and includes:
the keyword acquisition unit 111: the system is used for acquiring statement keywords related to research and evaluation tasks, and information retrieval related to enterprise research and evaluation is conveniently carried out from government official networks, research questionnaires, enterprise internet and the like according to the acquired keywords;
the data mining unit 112: the method is used for mining multi-source data information containing keywords in the research and evaluation task. With the keywords as a reference, the information retrieval can be obtained through a remote data access interface or a web crawler. On one hand, public data of government official websites or open data of related websites can be used as data interfaces to acquire data; and on the other hand, a web crawler technology can be adopted to crawl data information related to enterprise research from website data.
In order to extract text feature vectors from the multi-source data investigated by the enterprises, the embodiment of the invention provides that semantic feature representation of text sentences is learned by exploring word embedding matrixes by considering vocabulary context information. As shown in fig. 3, the data processing module 22 according to the embodiment of the present invention includes:
the data screening unit 221 is used for performing pre-screening processing on the collected multi-source data related to enterprise research problems, eliminating repeated and redundant filling data, and deleting text expression information unrelated to keyword semantics;
the text word segmentation unit 222 is used for performing sentence word segmentation on the acquired multi-source data researched by the enterprise to obtain vocabulary data after sentence word segmentation;
the feature extraction unit 223 extracts a feature vector for the vocabulary data after the text is segmented according to the word embedding technology, and further determines a feature description vector of the text statement, where the feature description vector is expressed in the robustness of the feature space of the corresponding text statement.
The method comprises the steps of firstly, carrying out simple screening processing on acquired multi-source data researched by enterprises, automatically eliminating irrelevant information such as less information (for example, less than 20 words), repeated text information (information duplication), redundant text (for example, redundant 5000 words) and the like, then carrying out word segmentation processing on text statement data with rich residual information through word segmentation technology, embedding a word segmentation vocabulary learning word into a matrix based on the word embedding technology, acquiring feature vectors of the word segmentation vocabulary through the word embedding matrix, carrying out maximum pooling operation on the feature vectors of all the word segmentation of the text statement and acquiring semantic feature representation of the current text statement.
For the semantic feature representation of the text sentence obtained above, the embodiment of the present invention provides a credibility detection and evaluation module 33, which is used for credibility detection of each research question of an enterprise and credibility analysis of the whole researched enterprise. As shown in fig. 4, the reliability detection and evaluation module 33 according to the embodiment of the present invention includes:
multi-source data reliability detection unit 331: the reliability detection value is used for detecting the reliability of each investigation question and is quantized to be between 0 and 100, and the larger the value is, the higher the reliability is;
investigated subject reliability evaluation section 332: the method is used for evaluating the overall credibility of the investigated enterprise and giving a credibility evaluation value of the investigated object, the credibility detection value is between 0 and 100, and the higher the value is, the higher the credibility is.
For the multi-source data credibility detection unit 331, the information credibility of the enterprise filled data is detected by constructing a dictionary optimization model, and powerful supervision and guidance are provided for relevant credibility assessment tasks. As shown in fig. 5, the multi-source data reliability detection unit 331 includes:
the dictionary learning unit 3311: the dictionary optimization technology can be used for constructing and learning a sparse dictionary aiming at credibility evaluation detection tasks of different investigation problems, the dictionary can mine text information highly related to the credibility evaluation detection tasks, redundant irrelevant text data is abandoned, and the complex structure of multi-source data is effectively explored;
the data reconstruction unit 3312: based on the learned sparse dictionary, text feature description vectors used for detection and evaluation before can be reconstructed, and reconstruction errors can be obtained;
the data reliability detection unit 3313: the confidence measure value used to quantify the reconstruction error by the activation function between 0 and 100, with higher values being indicative of higher values.
The sparse dictionary expression relevant to enterprise research multi-source data credibility detection evaluation is learned by constructing a combined optimization objective function of the text dictionary and sparse features of the text sentences for the obtained text sentence feature description vectors of enterprise research, the sparse dictionary can mine text sentence information highly relevant to information credibility detection, and redundant irrelevant text data is abandoned. And reconstructing text statement information of enterprise research data by taking the optimized sparse dictionary as reference, and quantizing the reconstruction error into a value of reliability detection through an activation function. Therefore, the credibility detection and evaluation can be accurately and quickly carried out on each filled problem of the enterprise, and the credibility evaluation values are quantized to be 0 and 100.
For the reliability evaluation unit 332 of the investigated object, importance weights of the data attributes are determined through an analytic hierarchy process, and the overall reliability evaluation is performed on the investigated enterprise in a data weighting mode. As shown in fig. 6, the investigated object reliability evaluation unit 332 includes:
attribute level analysis unit 3321: the method is used for determining relative importance weights of all attributes through an analytic hierarchy process for 12 main category attributes and 5-10 secondary category attributes of research data;
the object reliability evaluation unit 3322: and according to the relative importance weight of each piece of research data, carrying out weighted average on the credibility values to obtain the overall credibility value of the researched enterprise, wherein the value is quantized to be between 0 and 100, and the larger the value is, the higher the credibility is.
Based on the same inventive concept, the embodiment of the invention also provides an implementation method corresponding to the information credibility detection and evaluation system based on the multi-source data, and as the principle of the implementation method in the embodiment of the invention is similar to that of the enterprise filled information credibility detection and evaluation system in the embodiment of the invention, the implementation of the method can refer to the implementation of the system, and repeated parts are not repeated. As shown in fig. 7, a flowchart of a method for detecting and evaluating information reliability based on multi-source data according to an embodiment of the present application includes:
s11: collecting multisource data which is filled by enterprises and related to enterprise research problems, wherein the multisource text data refers to text data information of different enterprises in different cities obtained from government official networks, research questionnaires and enterprise internet;
s22: preprocessing collected multi-source data related to the research tasks to obtain semantic feature description vectors and attribute label information of each research question, wherein the semantic feature description vectors comprise 12 main categories and 5-10 secondary subdivided categories;
s33: and detecting the reliability of each investigation question and evaluating the overall reliability of the investigated object.
In the embodiment of the present invention, as shown in fig. 8, the step S11 specifically includes the following steps:
s111: acquiring keywords related to enterprise research data evaluation so as to perform information retrieval related to the enterprise research evaluation;
s112: and mining multi-source data containing the keywords in the enterprise research evaluation by a data interface or a web crawler.
In the embodiment of the present invention, as shown in fig. 9, the step S22 specifically includes the following steps:
s221: pre-screening collected multi-source data related to enterprise research problems, eliminating repeated and redundant enterprise filled data, and deleting text expression information unrelated to keyword semantics;
s222: performing sentence segmentation processing on the acquired multisource data investigated by the enterprises to obtain vocabulary data after sentence segmentation;
s223: and extracting a characteristic vector for the vocabulary data after the text is participled according to a word embedding technology, and further determining the characteristic description vector of the text statement.
In the embodiment of the present invention, as shown in fig. 10, the step S33 specifically includes the following steps:
s331: detecting the reliability of each investigation problem and giving a reliability detection value;
s332: and evaluating the overall credibility of the investigated enterprise, and giving a credibility evaluation value of the investigated object.
In an embodiment of the present invention, as shown in fig. 11, the step S331 specifically includes the following steps:
s3311: constructing and learning a sparse dictionary aiming at credibility evaluation detection tasks of different research problems, mining text information highly related to the credibility evaluation detection tasks, and abandoning redundant irrelevant text data;
s3312: based on the constructed and learned sparse dictionary, the acquired text feature description vector of the multi-source data for detection and evaluation can be reconstructed;
s3313: the reconstruction error is quantized by an activation function to a value for confidence detection.
In the embodiment of the present invention, as shown in fig. 12, the step S332 specifically includes the following steps:
s3321: determining relative importance weights of all attributes of the researched data through an analytic hierarchy process;
s3322: and according to the relative importance weight of each piece of research data, carrying out weighted average on the credibility values to obtain the overall credibility value of the enterprise to be researched.