CN113704414B - Data processing method, system, storage medium and electronic equipment - Google Patents
Data processing method, system, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN113704414B CN113704414B CN202111027813.6A CN202111027813A CN113704414B CN 113704414 B CN113704414 B CN 113704414B CN 202111027813 A CN202111027813 A CN 202111027813A CN 113704414 B CN113704414 B CN 113704414B
- Authority
- CN
- China
- Prior art keywords
- preset
- result
- identification
- content
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000012550 audit Methods 0.000 claims description 24
- 239000012634 fragment Substances 0.000 claims description 23
- 238000006243 chemical reaction Methods 0.000 claims description 15
- 238000007781 pre-processing Methods 0.000 abstract description 19
- 230000007246 mechanism Effects 0.000 abstract description 6
- 238000002372 labelling Methods 0.000 description 19
- 238000012544 monitoring process Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a data processing method, a system, a storage medium and electronic equipment, wherein the method comprises the steps of preprocessing acquired content to be checked to obtain content segments in a preset format, storing the content segments in a preset distributed message queue, sending the content segments in the preset distributed message queue to a preset word stock retrieval model and a preset AI identification model for identification, obtaining a first identification result and a second identification result, performing checking processing, obtaining a checking result and outputting the checking result. Through the method, the content segments with the uniform format are obtained by preprocessing the content to be audited, and the content segments are rapidly sent to the preset word stock retrieval system and the preset AI identification model by adopting the high-availability preset distributed message queue, so that rapid retrieval of the preset word stock retrieval model is facilitated, and the efficiency of auditing the content information is improved. Through intelligent expansion and self-learning of a preset word stock retrieval model, a double recognition mechanism is formed by combining a preset AI recognition model, and the recognition rate of auditing the content information is improved.
Description
Technical Field
The present application relates to the field of data content auditing technologies, and in particular, to a data processing method, a system, a storage medium, and an electronic device.
Background
The interactive scenes of the internet service platform, such as a plurality of scenes of user chatting, e-commerce comments, posts, messages and the like, can generate content information, wherein the content information comprises advertisements, news and the like. There are corresponding laws and regulations on the content of the internet, and once violations occur, website or APP rectification or off-shelf shutdown may result. In order to meet the regulations and form healthy content ecology, the content information generated by the Internet service platform needs to be checked in real time.
In the prior art, the auditing of the content is carried out by adopting a mode of machine auditing and human auditing, the machine auditing carries out preliminary auditing, the suspected content is marked, and the auditing result is released after the manual auditing. And the machine review filters the target text mainly by establishing a word stock, and if the target text matches keywords in the word stock, the target text is judged to be illegal text.
Because the word stock in the word stock comprises a plurality of large categories such as various advertisements, news and the like, and the large categories also comprise a plurality of small categories, matching speed is low due to the fact that the word stock is matched with keywords, and when keywords in various forms such as homophones, adjacency words and the like are encountered, the recognition rate of auditing content information through the keywords is low.
Therefore, the prior art has low efficiency and recognition rate in auditing content information.
Disclosure of Invention
In view of the above, the application discloses a data processing method, a system, a storage medium and electronic equipment, which aim to improve the efficiency and recognition rate of auditing content information.
In order to achieve the above purpose, the technical scheme disclosed by the method is as follows:
the first aspect of the application discloses a data processing method, which comprises the following steps:
acquiring content to be audited;
preprocessing the content to be checked to obtain a content segment in a preset format, and storing the content segment into a preset distributed message queue;
The content segments in the preset distributed message queue are respectively sent to a preset word stock retrieval model and a preset AI identification model for identification, and a first identification result and a second identification result are obtained; the preset word stock retrieval model is used for identifying sensitive words and corresponding risk types; the preset AI identification model is used for identifying the violation type;
And auditing the first identification result and the second identification result to obtain and output an auditing result.
Preferably, the preprocessing the content to be checked to obtain a content segment in a preset format includes:
If the fact that the preset characters exist in the content to be audited is monitored, removing the preset characters in the content to be audited, and obtaining the content to be audited without the preset characters;
Calculating the content to be checked without preset characters through a preset semantic algorithm to obtain an original content segment;
And carrying out grammar conversion on the original content segment to obtain the content segment with the preset format.
Preferably, the sending the content segments in the preset distributed message queue to a preset sensitive word stock search model and a preset AI identification model for identification respectively to obtain a first identification result and a second identification result includes:
identifying cluster nodes corresponding to a preset word stock retrieval model in an idle state and cluster nodes corresponding to a preset AI identification model in the idle state;
The content segments in the preset distributed message queue are respectively sent to the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state for identification, and a first identification result and a second identification result are obtained; the first recognition result is obtained by recognizing cluster nodes corresponding to the preset word stock retrieval model in an idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
Preferably, the auditing processing is performed on the first identification result and the second identification result, so as to obtain and output an auditing result, including:
Determining a first result type corresponding to the first identification result and a second result type corresponding to the second identification result;
Judging a first result type corresponding to the first identification result and a second result type corresponding to the second identification result;
And/or if the first result type is a risk type and the second result type is a preset violation type, marking the label corresponding to the obtained audit result as a violation label, and outputting the audit result marked with the violation label;
And/or if the first result type is the risk type and the second result type is a preset suspected violation type, marking the label corresponding to the obtained audit result as a violation label and outputting the audit result marked with the violation label;
And/or if the first result type is the risk type and the second result type is a preset compliance type, marking the label corresponding to the obtained audit result as an illegal label and outputting the audit result marked with the illegal label;
And/or if the first result type is a risk-free type and the second result type is the preset violation type, marking the label corresponding to the obtained audit result as a violation label and outputting the audit result marked with the violation label;
and/or if the first result type is the risk-free type and the second result type is the preset suspected violation type, marking the label corresponding to the obtained audit result as a suspected violation label, and outputting the audit result marked with the suspected violation label;
and/or if the first result type is the risk-free type and the second result type is the preset compliance type, marking the label corresponding to the obtained auditing result as a compliance label, and outputting the auditing result marked with the compliance label.
Preferably, the method further comprises:
and if the number of the queues in the preset distributed message queues is monitored to be larger than the preset number, performing dynamic capacity expansion operation on the preset distributed message queues.
A second aspect of the application discloses a data processing system, the system comprising:
The acquisition unit is used for acquiring the content to be audited;
The preprocessing unit is used for preprocessing the content to be checked to obtain content fragments in a preset format, and storing the content fragments into a preset distributed message queue;
The identification unit is used for respectively transmitting the content fragments in the preset distributed message queue to a preset word stock retrieval model and a preset AI identification model for identification to obtain a first identification result and a second identification result; the preset word stock retrieval model is used for identifying sensitive words and corresponding risk types; the preset AI identification model is used for identifying the violation type;
and the auditing unit is used for auditing the first identification result and the second identification result, obtaining an auditing result and outputting the auditing result.
Preferably, the preprocessing unit for preprocessing the content to be checked to obtain a content segment in a preset format includes:
The removing module is used for removing the preset characters in the content to be audited if the preset characters exist in the content to be audited, so as to obtain the content to be audited without the preset characters;
the computing module is used for computing the content to be checked without the preset characters through a preset semantic algorithm to obtain an original content segment;
And the conversion module is used for carrying out grammar conversion on the original content fragment to obtain a content fragment with a preset format.
Preferably, the identification unit includes:
the identification module is used for identifying cluster nodes corresponding to the preset word stock retrieval model in an idle state and cluster nodes corresponding to the preset AI identification model in the idle state;
The sending module is used for respectively sending the content fragments in the preset distributed message queue to the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state for identification, so as to obtain a first identification result and a second identification result; the first recognition result is obtained by recognizing cluster nodes corresponding to the preset word stock retrieval model in an idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
A third aspect of the present application discloses a storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium is located to perform the data processing method according to any one of the first aspects.
A fourth aspect of the application discloses an electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by one or more processors to perform the data processing method according to any of the first aspects.
According to the technical scheme, the data processing method, the system, the storage medium and the electronic equipment are disclosed, content to be audited is obtained, preprocessing is carried out on the content to be audited, content segments in a preset format are obtained, the content segments are stored in a preset distributed message queue, the content segments in the preset distributed message queue are respectively sent to a preset word stock retrieval model and a preset AI identification model for identification, a first identification result and a second identification result are obtained, the preset word stock retrieval model is used for identifying sensitive words and corresponding risk types thereof, the preset AI identification model is used for identifying illegal types, auditing processing is carried out on the first identification result and the second identification result, and the auditing results are obtained and output. Through the method, the content segments with the uniform format are obtained by preprocessing the content to be audited, and the content segments are rapidly sent to the preset word stock retrieval system and the preset AI identification model by adopting the high-availability preset distributed message queue, so that rapid retrieval of the preset word stock retrieval model is facilitated, and the efficiency of auditing the content information is improved. Through intelligent expansion and self-learning of a preset word stock retrieval model, a double recognition mechanism is formed by combining a preset AI recognition model, and the recognition rate of auditing the content information is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;
Fig. 2 is a schematic flow chart of preprocessing content to be audited to obtain content segments in a preset format according to the embodiment of the present application;
FIG. 3 is a schematic flow chart of obtaining a first recognition result and a second recognition result according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a data processing system according to an embodiment of the present application;
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the present disclosure, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As known from the background art, the existing auditing method for the content information is low in efficiency and recognition rate.
In order to solve the problems, the embodiment of the application discloses a data processing method, a system, a storage medium and electronic equipment, wherein content segments with uniform formats are obtained by preprocessing to-be-audited content, and the content segments are rapidly sent to a preset word stock retrieval system and a preset AI identification model by adopting a high-availability preset distributed message queue, so that rapid retrieval of the preset word stock retrieval model is facilitated, and the efficiency of auditing content information is improved. Through intelligent expansion and self-learning of a preset word stock retrieval model, a double recognition mechanism is formed by combining a preset AI recognition model, and the recognition rate of auditing the content information is improved. The specific implementation is illustrated by the following examples.
Referring to fig. 1, a flow chart of a data processing method according to an embodiment of the present application is shown, and the data processing method mainly includes the following steps:
S101: and obtaining the content to be audited.
In S101, the content to be audited collects content information input by the user from the internet service platform.
The interactive scenes of the internet service platform, such as a plurality of scenes of user chatting, e-commerce comments, posts, messages, barrages and the like, can generate content information, namely to-be-checked content, wherein the to-be-checked content comprises advertisement content, news content, chat content and the like.
S102: preprocessing the content to be audited to obtain content fragments in a preset format, and storing the content fragments into a preset distributed message queue.
In S102, preprocessing is applied to the content to be audited, so as to obtain content segments in a preset format (unified format), so that the content segments in the preset format can be conveniently and rapidly searched.
Because the preset distributed message queue has the characteristics of high performance, high availability and the like, the content fragments in the preset format are received in real time through the preset distributed message queue for caching.
The method specifically carries out preprocessing on the content to be audited, and the process of obtaining the content segments in the preset format is as follows:
firstly, if the existence of the preset characters in the content to be checked is monitored, the preset characters in the content to be checked are removed, and the content to be checked without the preset characters is obtained.
The preset characters comprise punctuation marks, special characters, webpage labels, stop words and the like.
And then, calculating the content to be checked without preset characters through a preset semantic algorithm to obtain an original content segment.
And carrying out semantic segmentation on the content to be checked without preset characters through a preset semantic algorithm to obtain an original content segment.
And calculating the content to be checked without preset characters through a preset semantic algorithm, so as to avoid false alarm and false interception of the content to be checked.
The preset semantic algorithm may be a natural semantic algorithm or other semantic algorithms, and the determination of the specific preset semantic algorithm is set by a technician according to actual conditions, so that the application is not limited specifically.
And finally, carrying out grammar conversion on the original content fragment to obtain the content fragment with the preset format.
The original content segments are subjected to Chinese simplified and traditional conversion, english case conversion and unified conversion into Chinese simplified and English case formats, namely, the content segments in the preset format.
S103: the content segments in the preset distributed message queue are respectively sent to a preset word stock retrieval model and a preset AI identification model for identification, and a first identification result and a second identification result are obtained; the method comprises the steps that a word stock retrieval model is preset and used for identifying sensitive words and corresponding risk types; the preset AI identification model is used to identify the type of violation.
In S103, according to the cluster performance of the preset lexicon retrieval model and the cluster performance of the preset artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) identification model, intelligent scheduling is performed on the content segments in the preset distributed message queue, and the content segments in the preset distributed message queue are respectively synchronized to the preset lexicon retrieval model and the preset AI identification model.
And carrying out intelligent scheduling on the content segments in the preset distributed message queue according to the cluster performance of the preset word stock retrieval model and the cluster performance of the preset AI identification model, namely identifying the content segments in the preset distributed message queue through the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state.
And performing self-learning in a preset word stock retrieval model to form a comprehensive word stock. The specific self-learning comprises supervision file self-learning, internet new noun monitoring, deformed word stock self-learning, business personnel feedback, distributed retrieval and real-time monitoring.
Supervision file self-learning: self-learning is performed through industry requirements, laws and regulations and the like, such as forbidden words specified by advertisement laws, forbidden words published by online credit office and the like, and word libraries are timely followed and revised. Operators can perform optimization and modification on the self-learning result and input the self-learning result into a word stock.
Internet new noun monitoring: the new words are mined based on the similarity algorithm of the words, marked and confirmed by operators, and the words are input into a word stock.
Self-learning of the deformed word library: based on the basic word stock, homonyms, interference times and shape near words are synchronously expanded to form a new word stock independent of the standard word stock.
Business personnel feedback: and according to the false killing and missing killing cases fed back in the service use process, the word stock is revised in a targeted manner.
And (5) distributed search: matching is performed in a plurality of different word banks to synchronize the results to a decision making system. Through the mechanisms for updating the word stock in real time, the latest compliance requirement and deformed words can be accurately identified.
And (3) real-time monitoring: the word stock expansion is performed by monitoring policy documents, violation events and internet new nouns in real time and then by self-learning.
The labeling, training and calling processes of the preset AI identification model are as follows:
Sample marking: according to the demand of the violation identification of the content on the business, the common violation classification of the text, including advertisement, negative and other categories, is defined first, and the business data of the common violation classification is marked.
Model training: and (3) using marked service data, adding a marked data set disclosed by an industry open source, extracting features of the marked service data, entering a model training link, converting the marked service data into high-dimensional vector representation through a neural network embedding layer, and finally obtaining a model classification output result through operations such as convolution, pooling and full connection, wherein the model classification output result is used for indicating the marked illegal classified service data.
Model call: and presetting content fragments in a distributed message queue for calling and identifying.
Specifically, content segments in a preset distributed message queue are respectively sent to a preset word stock retrieval model and a preset AI identification model for identification, and the process of obtaining a first identification result and a second identification result is as follows:
First, cluster nodes corresponding to a preset word stock retrieval model in an idle state and cluster nodes corresponding to a preset AI identification model in the idle state are identified.
Then, respectively transmitting the content fragments in the preset distributed message queue to a cluster node corresponding to a preset word stock retrieval model in an idle state and a cluster node corresponding to a preset AI identification model in the idle state for identification, so as to obtain a first identification result and a second identification result; the first recognition result is obtained by recognizing cluster nodes corresponding to a preset word stock retrieval model in an idle state; and the second identification result is obtained by identifying the cluster nodes corresponding to the preset AI identification model in the idle state.
The preset distributed message queue has the characteristics of high performance, high availability and the like, and the high-availability preset distributed message queue is adopted to rapidly schedule the content fragments in real time and send the content fragments to the preset word stock retrieval model and the preset AI identification model, so that the processing efficiency of the content fragments is improved.
Synchronizing the content segments in the preset format to a preset distributed content message queue, and if the number of the queues in the preset distributed message queue is monitored to be larger than the preset number, performing dynamic capacity expansion operation on the preset distributed message queue.
When the performance of the preset distributed message queue is reduced or the performance is stably recovered, dynamic capacity expansion can be performed according to the concurrency of the content data.
The preset number may be 50, 100, etc., and the determination of the specific preset number is set by a technician according to the actual situation, which is not particularly limited.
S104: and auditing the first identification result and the second identification result to obtain and output an auditing result.
In S104, the first result type of the first recognition result and the second result type corresponding to the second recognition result are comprehensively checked, that is, the type corresponding to the checked result is determined through the first result type of the first recognition result and the second result type corresponding to the second recognition result, the type corresponding to the checked result is marked, and finally the marked checked result is output.
The first result type comprises a risky type and a risky type, the second result type comprises a preset violation type, a preset suspected violation type and a preset compliance type, and the preset compliance type is used for indicating the type meeting the regulation.
The marked auditing results comprise auditing results marked with the illegal tags, auditing results marked with suspected illegal tags and auditing results marked with the compliance tags.
Specifically, the first identification result and the second identification result are subjected to auditing treatment, and the auditing result is obtained and output as follows:
And determining a first result type corresponding to the first identification result and a second result type corresponding to the second identification result.
And judging the first result type corresponding to the first identification result and the second result type corresponding to the second identification result.
And/or if the first result type is a risk type and the second result type is a preset violation type, marking the label corresponding to the obtained verification result as a violation label, and outputting the verification result marked with the violation label.
And/or if the first result type is a risk type and the second result type is a preset suspected violation type, marking the label corresponding to the obtained audit result as a violation label, and outputting the audit result marked with the violation label.
And/or if the first result type is a risk type and the second result type is a preset compliance type, marking the label corresponding to the obtained auditing result as an illegal label, and outputting the auditing result marked with the illegal label.
And/or if the first result type is a risk-free type and the second result type is a preset violation type, marking the label corresponding to the obtained verification result as a violation label, and outputting the verification result marked with the violation label.
And/or if the first result type is a risk-free type and the second result type is a preset suspected violation type, marking the label corresponding to the obtained audit result as a suspected violation label, and outputting the audit result marked with the suspected violation label.
And/or if the first result type is a risk-free type and the second result type is a preset compliance type, marking the label corresponding to the obtained auditing result as a compliance label, and outputting the auditing result marked with the compliance label.
In the embodiment of the application, the content to be audited is preprocessed to obtain the content segments in the uniform format, and the content segments are rapidly sent to the preset word stock retrieval system and the preset AI identification model by adopting the high-availability preset distributed message queue, so that the rapid retrieval of the preset word stock retrieval model is facilitated, and the efficiency of auditing the content information is improved. Through intelligent expansion and self-learning of a preset word stock retrieval model, a double recognition mechanism is formed by combining a preset AI recognition model, and the recognition rate of auditing the content information is improved.
Referring to fig. 2, in the step S102, the process of preprocessing the content to be audited to obtain the content segment in the preset format mainly includes the following steps:
s201: and if the preset characters exist in the content to be checked, removing the preset characters in the content to be checked, and obtaining the content to be checked without the preset characters.
S202: and calculating the content to be checked without preset characters through a preset semantic algorithm to obtain an original content segment.
S203: and carrying out grammar conversion on the original content fragment to obtain the content fragment with the preset format.
The execution principle of S201-S203 is identical to that of S102 described above, and reference is made thereto, and details thereof will not be repeated.
In the embodiment of the application, the preset characters in the content to be audited are removed to obtain the content to be audited without the preset characters, the content to be audited without the preset characters is calculated through a preset semantic algorithm to obtain the original content segment, the original content segment is subjected to grammar conversion to obtain the content segment with the preset format, and the original content segment is subjected to grammar conversion without other contents with complex formats, so that the content segment with the preset format can be conveniently and rapidly searched in the subsequent processing process.
Referring to fig. 3, in the step S103, the process of sending the content segments in the preset distributed message queue to the preset sensitive word stock search model and the preset AI identification model to identify, thereby obtaining a first identification result and a second identification result mainly includes the following steps:
s301: and identifying cluster nodes corresponding to the preset word stock retrieval model in the idle state and cluster nodes corresponding to the preset AI identification model in the idle state.
S302: respectively transmitting content fragments in a preset distributed message queue to cluster nodes corresponding to a preset word stock retrieval model in an idle state and cluster nodes corresponding to a preset AI identification model in the idle state for identification, so as to obtain a first identification result and a second identification result; the first recognition result is obtained by recognizing cluster nodes corresponding to a preset word stock retrieval model in an idle state; and the second identification result is obtained by identifying the cluster nodes corresponding to the preset AI identification model in the idle state.
The execution principle of S301-S302 is identical to that of S103 described above, and reference is made thereto, and details thereof will not be repeated here.
In the embodiment of the application, the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state are identified, and the content segments in the preset distributed message queue are respectively sent to the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state for identification, so that the purposes of obtaining the first identification result and the second identification result are realized.
Based on the data processing method disclosed in fig. 1 of the foregoing embodiment, the embodiment of the present application further correspondingly discloses a schematic structure diagram of a data processing system, as shown in fig. 4, where the data processing system includes:
An obtaining unit 401 is configured to obtain the content to be audited.
The preprocessing unit 402 is configured to preprocess the content to be audited, obtain a content segment in a preset format, and store the content segment in a preset distributed message queue.
The identifying unit 403 is configured to send content segments in a preset distributed message queue to a preset word stock search model and a preset AI identification model for identification, so as to obtain a first identification result and a second identification result; the method comprises the steps that a word stock retrieval model is preset and used for identifying sensitive words and corresponding risk types; the preset AI identification model is used to identify the type of violation.
And the auditing unit 404 is used for auditing the first identification result and the second identification result, obtaining and outputting the auditing result.
Further, the preprocessing unit for preprocessing the content to be audited to obtain the content segments in the preset format comprises a removing module, a calculating module and a converting module.
And the removing module is used for removing the preset characters in the content to be audited if the preset characters exist in the content to be audited, so as to obtain the content to be audited without the preset characters.
And the computing module is used for computing the content to be checked without preset characters through a preset semantic algorithm to obtain an original content segment.
And the conversion module is used for carrying out grammar conversion on the original content fragments to obtain content fragments in a preset format.
Further, the recognition unit 403 includes a recognition module and a transmission module.
The identification module is used for identifying the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state.
The sending module is used for respectively sending the content fragments in the distributed message queue to the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state to identify, so as to obtain a first identification result and a second identification result; the first recognition result is obtained by recognizing cluster nodes corresponding to a preset word stock retrieval model in an idle state; and the second identification result is obtained by identifying the cluster nodes corresponding to the preset AI identification model in the idle state.
Further, the auditing unit 404 includes a determining module and a determining module.
The determining module is used for determining a first result type corresponding to the first identification result and a second result type corresponding to the second identification result.
And the judging module is used for judging the first result type corresponding to the first identification result and the second result type corresponding to the second identification result.
And/or, the auditing unit 404 includes a first labeling module.
And the first labeling module is used for labeling the label corresponding to the obtained auditing result as the violation label if the first result type is the risk type and the second result type is the preset violation type, and outputting the auditing result labeled with the violation label.
And/or the auditing unit 404 includes a second labeling module.
And the second labeling module is used for labeling the label corresponding to the obtained auditing result as the violation label if the first result type is a risk type and the second result type is a preset suspected violation type, and outputting the auditing result labeled with the violation label.
And/or, the auditing unit 404 includes a third labeling module.
And the third labeling module is used for labeling the label corresponding to the obtained auditing result as the illegal label if the first result type is the risk type and the second result type is the preset compliance type, and outputting the auditing result labeled with the illegal label.
And/or, the auditing unit 404 includes a fourth labeling module.
And the fourth labeling module is used for labeling the label corresponding to the obtained auditing result as the violation label if the first result type is a risk-free type and the second result type is a preset violation type, and outputting the auditing result labeled with the violation label.
And/or, the auditing unit 404 includes a fifth labeling module.
And the fifth labeling module is used for labeling the label corresponding to the obtained auditing result as a suspected violation label and outputting the auditing result labeled with the suspected violation label if the first result type is the risk-free type and the second result type is the preset suspected violation type.
And/or, the auditing unit 404 includes a sixth labeling module.
And the sixth labeling module is used for labeling the label corresponding to the obtained auditing result as a compliance label if the first result type is the risk-free type and the second result type is the preset compliance type, and outputting the auditing result labeled with the compliance label.
Further, the device also comprises a capacity expansion unit.
And the capacity expansion unit is used for carrying out dynamic capacity expansion operation on the preset distributed message queues if the number of the queues in the preset distributed message queues is monitored to be larger than the preset number.
In the embodiment of the application, the content to be audited is preprocessed to obtain the content segments in the uniform format, and the content segments are rapidly sent to the preset word stock retrieval system and the preset AI identification model by adopting the high-availability preset distributed message queue, so that the rapid retrieval of the preset word stock retrieval model is facilitated, and the efficiency of auditing the content information is improved. Through intelligent expansion and self-learning of a preset word stock retrieval model, a double recognition mechanism is formed by combining a preset AI recognition model, and the recognition rate of auditing the content information is improved.
The embodiment of the application also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is controlled to execute the data processing method when the instructions run.
The embodiment of the application also provides an electronic device, whose structure schematic diagram is shown in fig. 5, specifically including a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501 and configured to execute the above-mentioned data processing method by the one or more processors 503.
The specific implementation process and derivative manner of the above embodiments are all within the protection scope of the present application.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.
Claims (6)
1. A method of data processing, the method comprising:
acquiring content to be audited;
If the fact that the preset characters exist in the content to be audited is monitored, removing the preset characters in the content to be audited, and obtaining the content to be audited without the preset characters;
calculating the content to be checked without preset characters through a preset semantic algorithm to obtain an original content segment; the preset semantic algorithm is used for avoiding false alarm and false interception of the content to be audited;
carrying out grammar conversion on the original content segments to obtain content segments in a preset format, and storing the content segments into a preset distributed message queue;
The content segments in the preset distributed message queue are respectively sent to a preset word stock retrieval model and a preset AI identification model for identification, and a first identification result and a second identification result are obtained; the preset word stock retrieval model is used for identifying sensitive words and corresponding risk types; the preset AI identification model is used for identifying the violation type;
the content segments in the preset distributed message queue are respectively sent to a preset sensitive word stock retrieval model and a preset AI identification model for identification, and a first identification result and a second identification result are obtained, wherein the method comprises the following steps:
identifying cluster nodes corresponding to a preset word stock retrieval model in an idle state and cluster nodes corresponding to a preset AI identification model in the idle state;
The content segments in the preset distributed message queue are respectively sent to the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state for identification, and a first identification result and a second identification result are obtained; the first recognition result is obtained by recognizing cluster nodes corresponding to the preset word stock retrieval model in an idle state; the second identification result is obtained by identifying cluster nodes corresponding to the preset AI identification model in an idle state;
And auditing the first identification result and the second identification result to obtain and output an auditing result.
2. The method of claim 1, wherein the auditing the first recognition result and the second recognition result to obtain and output an auditing result comprises:
Determining a first result type corresponding to the first identification result and a second result type corresponding to the second identification result;
Judging a first result type corresponding to the first identification result and a second result type corresponding to the second identification result;
And/or if the first result type is a risk type and the second result type is a preset violation type, marking the label corresponding to the obtained audit result as a violation label, and outputting the audit result marked with the violation label;
And/or if the first result type is the risk type and the second result type is a preset suspected violation type, marking the label corresponding to the obtained audit result as a violation label and outputting the audit result marked with the violation label;
And/or if the first result type is the risk type and the second result type is a preset compliance type, marking the label corresponding to the obtained audit result as an illegal label and outputting the audit result marked with the illegal label;
And/or if the first result type is a risk-free type and the second result type is the preset violation type, marking the label corresponding to the obtained audit result as a violation label and outputting the audit result marked with the violation label;
and/or if the first result type is the risk-free type and the second result type is the preset suspected violation type, marking the label corresponding to the obtained audit result as a suspected violation label, and outputting the audit result marked with the suspected violation label;
and/or if the first result type is the risk-free type and the second result type is the preset compliance type, marking the label corresponding to the obtained auditing result as a compliance label, and outputting the auditing result marked with the compliance label.
3. The method as recited in claim 1, further comprising:
and if the number of the queues in the preset distributed message queues is monitored to be larger than the preset number, performing dynamic capacity expansion operation on the preset distributed message queues.
4. A data processing system, the system comprising:
The acquisition unit is used for acquiring the content to be audited;
The removing module is used for removing the preset characters in the content to be audited if the preset characters exist in the content to be audited, so as to obtain the content to be audited without the preset characters;
The computing module is used for computing the content to be checked without the preset characters through a preset semantic algorithm to obtain an original content segment; the preset semantic algorithm is used for avoiding false alarm and false interception of the content to be audited;
The conversion module is used for carrying out grammar conversion on the original content segments to obtain content segments in a preset format, and storing the content segments into a preset distributed message queue;
The identification unit is used for respectively transmitting the content fragments in the preset distributed message queue to a preset word stock retrieval model and a preset AI identification model for identification to obtain a first identification result and a second identification result; the preset word stock retrieval model is used for identifying sensitive words and corresponding risk types; the preset AI identification model is used for identifying the violation type;
The identification unit comprises an identification module and a sending module;
The identification module is used for identifying cluster nodes corresponding to a preset word stock retrieval model in an idle state and cluster nodes corresponding to a preset AI identification model in the idle state;
The sending module is used for respectively sending the content segments in the preset distributed message queue to the cluster nodes corresponding to the preset word stock retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state for identification, so as to obtain a first identification result and a second identification result; the first recognition result is obtained by recognizing cluster nodes corresponding to the preset word stock retrieval model in an idle state; the second identification result is obtained by identifying cluster nodes corresponding to the preset AI identification model in an idle state;
and the auditing unit is used for auditing the first identification result and the second identification result, obtaining an auditing result and outputting the auditing result.
5. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device in which the storage medium is located to perform the data processing method of any one of claims 1 to 3.
6. An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to perform a data processing method according to any one of claims 1 to 3 by one or more processors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111027813.6A CN113704414B (en) | 2021-09-02 | 2021-09-02 | Data processing method, system, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111027813.6A CN113704414B (en) | 2021-09-02 | 2021-09-02 | Data processing method, system, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113704414A CN113704414A (en) | 2021-11-26 |
CN113704414B true CN113704414B (en) | 2024-08-16 |
Family
ID=78658905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111027813.6A Active CN113704414B (en) | 2021-09-02 | 2021-09-02 | Data processing method, system, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113704414B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114238303A (en) * | 2021-12-23 | 2022-03-25 | 广东太平洋互联网信息服务有限公司 | Data cleaning method and device, electronic equipment and storage medium |
CN116306619B (en) * | 2023-05-17 | 2023-08-25 | 北京拓普丰联信息科技股份有限公司 | Document detection method and device, electronic equipment and storage medium |
CN118734263B (en) * | 2024-06-24 | 2025-02-25 | 中国标准化研究院 | Service content digital management system and operation method based on data processing |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109831751A (en) * | 2019-01-04 | 2019-05-31 | 上海创蓝文化传播有限公司 | A kind of short message content air control system and method based on natural language processing |
CN110674255A (en) * | 2019-09-24 | 2020-01-10 | 湖南快乐阳光互动娱乐传媒有限公司 | Text content review method and device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7971223B2 (en) * | 2008-03-25 | 2011-06-28 | Seachange International, Inc. | Method and system of queued management of multimedia storage |
US9794599B2 (en) * | 2014-04-10 | 2017-10-17 | Telibrahma Convergent Communications Private Limited | Method and system for auditing multimedia content |
CN108647309B (en) * | 2018-05-09 | 2021-08-10 | 达而观信息科技(上海)有限公司 | Chat content auditing method and system based on sensitive words |
US10810373B1 (en) * | 2018-10-30 | 2020-10-20 | Oath Inc. | Systems and methods for unsupervised neologism normalization of electronic content using embedding space mapping |
CN112507936B (en) * | 2020-12-16 | 2024-04-23 | 平安银行股份有限公司 | Image information auditing method and device, electronic equipment and readable storage medium |
-
2021
- 2021-09-02 CN CN202111027813.6A patent/CN113704414B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109831751A (en) * | 2019-01-04 | 2019-05-31 | 上海创蓝文化传播有限公司 | A kind of short message content air control system and method based on natural language processing |
CN110674255A (en) * | 2019-09-24 | 2020-01-10 | 湖南快乐阳光互动娱乐传媒有限公司 | Text content review method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113704414A (en) | 2021-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113704414B (en) | Data processing method, system, storage medium and electronic equipment | |
CN111897970B (en) | Text comparison method, device, equipment and storage medium based on knowledge graph | |
US11151660B1 (en) | Intelligent routing control | |
EP3792784A1 (en) | Service system update method, electronic device and readable storage medium | |
US11238539B1 (en) | Intelligent routing control | |
CN113986864B (en) | Log data processing method, device, electronic device and storage medium | |
CN110147540B (en) | Method and system for generating business security requirement document | |
US11822578B2 (en) | Matching machine generated data entries to pattern clusters | |
CN116680459A (en) | Foreign trade content data processing system based on AI technology | |
Kumaragurubaran et al. | Sentimental Analysis for Social Media Platform Based on Trend Analysis | |
CN112579781B (en) | Text classification method, device, electronic equipment and medium | |
CN107545505A (en) | Insure recognition methods and the system of finance product information | |
CN112417996A (en) | Information processing method and device for industrial drawing, electronic equipment and storage medium | |
US20200142962A1 (en) | Systems and methods for content filtering of publications | |
CN114817808A (en) | Illegal website identification method, device, electronic device and storage medium | |
CN113553431A (en) | User label extraction method, device, equipment and medium | |
CN112308453A (en) | Risk identification model training method, user risk identification method and related device | |
CN115618857B (en) | Threat information processing method, threat information pushing method and threat information pushing device | |
CN111199170B (en) | Formula file identification method and device, electronic equipment and storage medium | |
CN117389821A (en) | Log abnormality detection method, device and storage medium | |
CN112966158A (en) | IoC automatic extraction and mining method and system | |
Kashid et al. | Live News Classification Using Naive Bayes Classifier | |
CN111881106A (en) | Data labeling and processing method based on AI (Artificial Intelligence) inspection | |
CN118170688B (en) | Configuration file detection method and device, storage medium and electronic equipment | |
US11870933B2 (en) | Emergency dispatch command information management system, device, and method capable of providing relevant emergency dispatch command information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |