CN107239447B

CN107239447B - Junk information identification method, device and system

Info

Publication number: CN107239447B
Application number: CN201710417747.0A
Authority: CN
Inventors: 陈方毅; 陈伟
Original assignee: Xiamen Meishao Co ltd
Current assignee: Xiamen Meishao Co ltd
Priority date: 2017-06-05
Filing date: 2017-06-05
Publication date: 2020-12-18
Anticipated expiration: 2037-06-05
Also published as: CN107239447A

Abstract

The invention discloses a junk information identification method, device and system, and belongs to the technical field of internet application. The method comprises the following steps: extracting text content of user original information, performing semantic reduction on the text content to obtain a reduced text, performing matching operation on the reduced text in a preset sample model library through a gradient descent algorithm to obtain a junk probability that the user original information is junk information, and comparing the junk probability with a preset junk probability threshold value to identify that the user original information is the junk information. In addition, a junk information identification device and system are also provided. The junk information identification method, the device and the system can identify the junk information aiming at the user original information after semantic conversion.

Description

Junk information identification method, device and system

Technical Field

The invention relates to the technical field of internet application, in particular to a junk information identification method, device and system.

Background

With the development of internet technology, network information is increasingly rich, and various user original information on websites is mixed with fish and dragon, so that more and more junk information such as useless advertisements, pornography and the like is available. Therefore, the user original information in the website should be filtered by the junk words in advance, that is, the user original information should be identified in advance, and the user original information identified as the junk information should be shielded to ensure the purity of the website information.

However, when the user original information is distributed, the semantic conversion is performed on the user original information in advance, so that the purpose of avoiding being recognized as spam is achieved. For example, in order to avoid being recognized as spam, when advertisement information is released, arabic numerals such as QQ numbers are converted into chinese numerals, thereby achieving the purpose of avoiding being recognized as spam.

At present, the existing junk information identification generally identifies the junk information by means of complete matching or partial matching with a reference junk word, and the junk information cannot be identified according to the original information of the user after semantic conversion, so that the accuracy of junk information identification is greatly reduced, and the misjudgment rate of the junk information is high.

Disclosure of Invention

The invention provides a junk information identification method, device and system, and aims to solve the technical problem that junk information identification cannot be performed on user original information subjected to semantic conversion in the related art.

The embodiment of the invention provides a junk information identification method, which comprises the following steps:

extracting text content of user original information;

performing semantic reduction on the text content to obtain a reduced text;

matching the reduced text in a preset sample model library through a gradient descent algorithm to obtain the garbage probability that the user original information is garbage information;

and identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value.

In addition, an embodiment of the present invention provides a spam information identifying apparatus, including:

the text content extraction module is used for extracting the text content of the original information of the user;

the semantic reduction module is used for carrying out semantic reduction on the text content to obtain a reduced text; the matching operation module is used for performing matching operation on the reduced text in a preset sample model library through a gradient descent algorithm to obtain the garbage probability that the user original information is garbage information;

and the junk information identification module is used for identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value.

In addition, an embodiment of the present invention further provides a system, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform:

extracting text content of user original information;

performing semantic reduction on the text content to obtain a reduced text;

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

when the junk information of the user original information is identified, the text content of the user original information is subjected to semantic restoration, so that the junk information can be identified for the user original information subjected to semantic conversion, the accuracy of identifying the junk information is greatly improved, and the misjudgment rate of the junk information is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flow chart illustrating a spam identification method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a spam identification method according to an example embodiment.

Fig. 3 is a flow chart illustrating a spam identification method according to an example embodiment.

Fig. 4 is a flowchart illustrating a specific implementation of step S220 in the spam identification method according to the corresponding embodiment in fig. 3.

Fig. 5 is a flowchart illustrating a specific implementation of step S130 in the spam identification method according to the corresponding embodiment in fig. 1.

Fig. 6 is a block diagram illustrating a spam recognition device according to an example embodiment.

Fig. 7 is a block diagram of the semantic restoring module 120 in the spam recognition device according to the corresponding embodiment in fig. 6.

Fig. 8 is a block diagram of another spam identification device according to the corresponding embodiment of fig. 6.

Fig. 9 is a block diagram of the feature extraction module 220 in the spam recognition device according to the embodiment shown in fig. 8.

Fig. 10 is a block diagram of the matching operation module 130 in the spam recognition device according to the corresponding embodiment of fig. 6.

FIG. 11 is a block diagram illustrating a system in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Fig. 1 is a flow chart illustrating a spam identification method according to an exemplary embodiment. As shown in fig. 1, the spam recognition method may include the following steps.

In step S110, the text content of the user original information is extracted.

The user-originated information is information input by a user on the network. For example, in a forum, users leave comments on a topic.

It is understood that the user original information includes expressions, text, and the like.

The original information of the user is mixed with the fish and the dragon, the text usually contains a lot of junk information, and the junk information needs to be identified in advance for the text content of the original information of the user. Thus, the text content is extracted from the user's original information.

When extracting text content from the user original information, the text content may be extracted by various text extraction methods, which are not limited herein.

In step S120, semantic reduction is performed on the text content to obtain a reduced text.

And semantic reduction is to perform text processing on text content data according to semantics. And after semantic analysis is carried out on the text content, corresponding reduction processing is carried out to obtain a reduced text.

It is understood that in order to avoid the spam being released, the user can avoid the original information of the released user from being identified as spam by converting the semantic meaning.

For example, by converting the QQ number "1234567" from an arabic number to "one, two, three, four, five, six, seven", it is prevented from being recognized as spam.

For another example, by conversion of a harmonic character/a combined character, "plus me WeChat you" is converted into "Home me WeiChat you", thereby avoiding being recognized as spam.

Therefore, semantic restoration processing needs to be performed on the text content of the user original information.

Semantic reduction is to perform semantic analysis on the text content and extract the text meaning represented by the text content.

The method for semantic analysis of the text content comprises various methods, the text content can be represented as a feature-document matrix form based on a vector space model through a potential semantic indexing method, the matrix is reduced in rank through a singular value decomposition technology, and the text content and feature words are mapped to the same low-dimensional semantic space; semantic analysis can also be performed based on external semantic knowledge, for example, text meanings in text contents are extracted through a harmonic/combined word dictionary; the text content may also be subjected to semantic analysis in other ways, and the method of semantic analysis is not limited herein.

In step S130, the text content is subjected to matching operation in a preset sample model library by a gradient descent algorithm, so as to obtain a spam probability that the user original information is spam.

The gradient descent algorithm is an optimization algorithm in machine learning.

The sample model library is prepared in advance, and the sample model library contains the probability that each sample model is junk information.

The junk probability is the probability that the original information of the user is junk information.

In the gradient descent algorithm, matching operation is carried out on the text content of the user original information and the sample models in the sample model library by adopting gradient descending step by step, and after the operation is converged, the garbage probability that the user original information is garbage information is obtained.

In step S140, the user original information is identified as spam by comparing the spam probability with a preset spam probability threshold.

The garbage probability threshold is a preset garbage probability critical value.

And when the junk probability that the user original information is the junk information reaches the junk probability, identifying that the user original information is the junk information.

For example, the preset spam probability threshold is 70%, and when the spam probability of the user original information reaches 70%, the user original information is identified as spam information.

By utilizing the method, the text content of the user original information is subjected to semantic reduction, the reduced text obtained after the semantic reduction is subjected to matching operation in the preset sample model library to obtain the junk probability that the user original information is junk information, and the user original information is identified as the junk information according to the preset junk probability threshold, so that the junk information can be identified according to the user original information subjected to semantic conversion, and the accuracy of identifying the junk information is greatly improved.

Fig. 2 is a flow chart illustrating a spam identification method according to an example embodiment. As shown in fig. 2, the step S120 shown in the corresponding embodiment of fig. 1 may include the following steps.

In step S121, chinese numerals in the text content are identified.

Chinese numbers are numbers expressed in chinese form. Chinese numbers include upper case numbers and lower case numbers, such as "one" and "one".

In a specific exemplary embodiment, the Chinese numbers in the text content are identified by comparing the text content to a preset thesaurus of numbers.

In step S122, the chinese numbers are converted into arabic numbers, and a restored text corresponding to the text content is obtained.

By using the method, the Chinese numbers in the user original information are identified, then the Chinese numbers are converted into Arabic numbers, and then the Arabic numbers are subjected to matching operation in a preset sample model library to obtain the junk probability that the user original information is junk information, so that the junk information can be identified for the Chinese numbers, and the accuracy of identifying the junk information is greatly improved.

Optionally, step S120 shown in the embodiment corresponding to fig. 1 may further include the following steps:

and according to the preset harmonious character/combined character library, performing semantic reduction on the text content to obtain a corresponding reduced text.

The harmonic/combined word library is a dictionary containing each text and corresponding harmonic and/or combined words.

It is understood that there may be words converted from harmonious characters/combined characters in the text content of the user's original information. Therefore, the text content is subjected to semantic analysis and semantic restoration according to the preset harmonic character/combined character dictionary.

For example, the user-originated information is intended to be "rogue deaths", but to avoid identifying spam, the text content at the time of publication is "durum deaths". The semantics of the original information of the user is identified through a preset harmonic character/combined character library, and the conversion of the harmonic characters/combined characters is carried out, so that the 'death removal of the awns' is converted into 'death removal of the rogue'.

By using the method, the semantics of the original information of the user is identified through the preset harmonic character/combined character library, and the conversion of the harmonic characters/combined characters is carried out, so that the situation that part of garbage information cannot be identified as garbage information through the conversion of the harmonic characters/combined characters is avoided, and the accuracy of garbage information identification is greatly improved.

Fig. 3 is a flow chart illustrating a spam identification method according to an example embodiment. As shown in fig. 3, before step S130 in the corresponding embodiment of fig. 1, the spam identification method may further include the following steps.

In step S210, content data is extracted from a predetermined database.

The database is a data warehouse for storing and managing website community information according to a data structure.

For example, various information data of the grapefruit community are stored in a predetermined database in accordance with a data structure.

The content data is text information stored in a database in a data structure.

In step S220, feature extraction of text vectors is performed from the content data by a random forest algorithm.

In machine learning, a random forest is a classifier that contains multiple decision trees.

The text vector is a data form characterized by extracting the characteristics of the content data through a decision tree classifier.

The random forest is composed of a plurality of decision trees. Each node in the decision tree is a condition about a certain characteristic, content data is classified according to different conditions, and then the content data is converted into a text vector according to the classification.

In step S230, a data category corresponding to the content data is obtained according to the text vector and the corresponding weight vector.

The weight vector corresponds to the text vector. Each weight component in the weight vector is in one-to-one correspondence with a text component in the text vector.

When content data is classified according to different conditions, each different condition corresponds to a corresponding weight, so that each text component also has a corresponding weight component in a text vector obtained after the feature extraction of the text data is performed on the content data.

In a specific exemplary embodiment, the data category is an information spam degree corresponding to the content data, and the content data is classified according to different information spam degrees.

In a specific exemplary embodiment, the corresponding data category is found from the product by calculating the product between the text vector and the corresponding weight vector.

In step S240, a rule engine is configured according to the content data and the corresponding data type to form a sample model library.

The rules engine is a business rules decision component.

In the rule engine, rule conditions correspond to rule actions. By accepting data input, business rules are interpreted, and business decisions are made based on the business rules. And when the rule condition in the business rule is satisfied, triggering to execute a corresponding rule action.

In a specific exemplary embodiment, by configuring a similarity ratio between the input text content and the content data, when the probability that the input text content is similar to the content data reaches the similarity ratio, the input text content is identified as the data category corresponding to the content data.

For example, the data type corresponding to the content data B is spam, and the rule condition in the rule engine configuration is that the similarity rate with the content data B is 80%. And after calculation and analysis, if the similarity rate between the input text content A and the content data B is 90%, identifying and confirming that the text content A is junk information.

By utilizing the method, the sample model library is formed by extracting the characteristics of the content data in the database in advance and configuring the rule engine, and the spam probability of the text content is calculated in the sample model library when the spam is judged in the follow-up process, so that the accuracy of identifying the spam is greatly improved.

FIG. 4 is a depiction of further details of step S220, shown in accordance with an exemplary embodiment. As shown in fig. 3, the sample model library is divided into a plurality of sample model classes, and the step S220 may include the following steps.

In step S221, the content data is semantically restored.

It can be understood that, in order to avoid the released junk information being screened out, the user releases the original information of the user after the user performs operations such as splitting the homophone/the harmonic character.

Therefore, semantic reduction processing is required to be performed on the content data.

Semantic reduction is the text processing of content data according to semantics. For example, a string of Chinese numbers is first converted into Arabic numbers, and then into QQ, WeChat.

In a specific exemplary embodiment, the content data is: home i will send you. The 'family my WeChat sending you' is converted into 'add my WeChat sending you' through the reduction of the harmonic characters/combined characters. And restoring the harmonic characters/combined characters through a preset harmonic character/combined character dictionary, thereby screening the garbage information.

In a specific exemplary embodiment, the content data is: the unconventional Vyuting1028103172 is a little more. The QQ and the WeChat are converted into the same word through semantic reduction, namely Vyuting1028103172 is extracted and converted into a universal dimension, and the obtained content data after semantic reduction is 'ununified wechat how many teachers and you'. Because the junk information usually has the conditions of adding WeChat, QQ and the like, the obtained text vector is prevented from being overlarge by uniformly processing various WeChat and QQ signals into one dimension, and the condition that one WeChat and QQ signal is not over to cause the condition of being unable to be identified is also avoided.

In step S222, a word segmentation operation is performed on the semantic-restored content data to obtain a text word corresponding to the content data.

It will be appreciated that the content data may be composed of a plurality of words, such as "add me to believe you" for example.

If the feature extraction is directly carried out on the content data after the semantic restoration, the similarity between the texts is greatly influenced, so that before the feature extraction, the feature extraction of the text vector is respectively carried out on the text participles obtained after the participle operation by carrying out the participle operation on the content data in advance.

The word segmentation operation is to refer to the segmentation of a word sequence into a single word.

As described above, the content data is text information stored in the database in accordance with a data structure. The text information may be a word, a plurality of words, or other forms.

Therefore, the content data is segmented into a single text segmentation by performing a segmentation operation on the content data.

There are various ways of performing word segmentation operations on content data. The content data can be mechanically segmented into text participles one by one based on the character strings to obtain the text participles corresponding to the content data; semantic analysis can also be carried out on the content data, and then the content data is divided into one text participle based on semantics to obtain the text participle corresponding to the content data; the word segmentation operation may also be performed on the content data in other manners, which are not limited herein.

In step S223, feature extraction of text vectors is performed on the text segments corresponding to the content data by a random forest algorithm, respectively.

By using the method, when the sample model library is manufactured, before the feature extraction of the text vector is carried out on the content data, the semantic reduction and the word segmentation operation are carried out on the content data in advance, so that the text vector obtained by carrying out the feature extraction on the content data is more accurate, and the accuracy of the sample model library is improved.

Fig. 5 is a depiction of further details of step S130, shown in accordance with an exemplary embodiment. As shown in fig. 5, the sample model library is divided into a plurality of sample model classes, and the step S130 may include the following steps.

In step S131, a corresponding sample model class is selected from the sample model library according to the user original information.

In the sample model library, the sample models are divided into a plurality of sample model classes, and each sample model class contains a predetermined number of sample models.

In step S132, a gradient descent algorithm is used to perform matching operation on the user original information and the sample model type, so as to obtain a spam probability that the user original information is spam.

When matching operation is carried out, random gradient operation is carried out by using sample models in a sample model class each time. Namely:

X(t+1)＝X(t)+ΔX(t)

ΔX(t)＝-ηg(t)

where η is the learning rate, and g (t) is the gradient of X at time t.

By classifying the sample model classes in the sample model library, when the number of sample models in the sample model library is large, one sample model class is selected for matching operation, so that the consumption of resources in the matching operation is reduced, and the convergence can be faster.

For example, if the gradients of the first half sample model and the second half sample model in the sample model library are the same, the first half sample model is used as one sample model class, and the second half sample model is used as the other sample model class, so that the method of the sample model class advances two steps towards the optimal solution in one traversal matching operation of the sample model library, while the overall matching budget method advances only one step.

Optionally, when there are duplicate sample models in the sample model library, the convergence of the matching operation can be promoted more quickly by the classification of the sample model classes.

First, after each matching operation, the content data identified as spam is stored in the sample model library as a sample model.

By the method, the sample models in the sample model library are divided into the plurality of sample model classes, and then random gradient matching operation is performed in one sample model class each time, so that the consumption of operation resources is greatly reduced, convergence is achieved more quickly, and the efficiency of garbage information identification is improved.

The following is an embodiment of the apparatus of the present invention, which can be used to implement the above-mentioned embodiment of the spam information identifying method. For details that are not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the spam identification method of the present invention.

FIG. 6 is a block diagram illustrating a spam identification device according to an exemplary embodiment, the system including, but not limited to: a text content obtaining module 110, a semantic restoring module 120, a matching operation module 130 and a spam identification module 140.

A text content extracting module 110, configured to extract text content of the user original information;

the semantic reduction module 120 is configured to perform semantic reduction on the text content to obtain a reduced text;

the matching operation module 130 is configured to perform matching operation on the restored text in a preset sample model library through a gradient descent algorithm to obtain a spam probability that the user original information is spam;

and the spam information identifying module 140 is configured to identify the user original information as spam information by comparing the spam probability with a preset spam probability threshold.

The implementation process of the functions and actions of each module in the device is specifically described in the implementation process of the corresponding step in the spam identification method, and is not described herein again.

Optionally, as shown in fig. 7, in the spam recognition apparatus shown in fig. 6 corresponding to the embodiment, the semantic restoring module 120 further includes but is not limited to: a chinese number recognition unit 121 and a number conversion unit 122.

A chinese number recognition unit 121 for recognizing chinese numbers in the text contents;

the number conversion unit 122 is configured to convert the chinese numbers into arabic numbers, and obtain a restored text corresponding to the text content.

Optionally, in the spam recognition apparatus shown in the embodiment corresponding to fig. 6, the semantic restoring module 120 further includes but is not limited to: harmonic character/combined character restoring unit.

And the harmonic character/combined character reduction unit is used for carrying out semantic reduction on the text content according to a preset harmonic character/combined character library to obtain a corresponding reduced text.

Fig. 8 is a block diagram of another spam identification apparatus according to the corresponding embodiment of fig. 6, further including but not limited to: a content data extraction module 210, a feature extraction module 220, a data category determination module 230, and a sample model library generation module 240.

A content data extraction module 210 for extracting content data from a predetermined database;

the feature extraction module 220 is configured to perform feature extraction of a text vector from content data through a random forest algorithm;

a data type determining module 230, configured to determine a data type corresponding to the content data according to the text vector and the corresponding weight vector;

and a sample model library generating module 240, configured to configure the rule engine according to the content data and the corresponding data type, so as to form a sample model library.

Optionally, as shown in fig. 9, the feature extraction module 220 shown in fig. 8 in the corresponding embodiment includes but is not limited to: semantic restoring unit 221, participle unit 222 and participle feature extracting unit 223.

A semantic reduction unit 221, configured to perform semantic reduction on the content data;

a word segmentation unit 222, configured to perform word segmentation on the semantic-restored content data to obtain text words corresponding to the content data;

and a segmentation feature extraction unit 223, configured to perform feature extraction on text vectors for text segmentation corresponding to the content data through a random forest algorithm, respectively.

Optionally, as shown in fig. 10, the sample model library is divided into a plurality of sample model classes, and the matching operation module 130 shown in fig. 6 according to the corresponding embodiment includes but is not limited to: a sample model class selecting unit 131 and a matching operation unit 132.

The sample model class selecting unit 131 is configured to select a corresponding sample model class from the sample model library according to the user original information;

and the matching operation unit 132 is configured to perform matching operation on the user original information and the sample model type through a gradient descent algorithm to obtain a spam probability that the user original information is spam.

Fig. 11 is a block diagram illustrating a system 100 according to an example embodiment. Referring to fig. 11, the system 100 may include one or more of the following components: a processing component 101, a memory 102, a power component 103, a multimedia component 104, an audio component 105, a sensor component 107 and a communication component 108. The above components are not all necessary, and the system 100 may add other components or reduce some components according to its own functional requirements, which is not limited in this embodiment.

The processing component 101 generally controls overall operation of the system 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 101 may include one or more processors 109 to execute instructions to perform all or a portion of the above-described operations. Further, the processing component 101 may include one or more modules that facilitate interaction between the processing component 101 and other components. For example, the processing component 101 may include a multimedia module to facilitate interaction between the multimedia component 104 and the processing component 101.

The memory 102 is configured to store various types of data to support operations at the system 100. Examples of such data include instructions for any application or method operating on system 100. The Memory 102 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as an SRAM (Static Random Access Memory), an EEPROM (Electrically Erasable Programmable Read-Only Memory), an EPROM (Erasable Programmable Read-Only Memory), a PROM (Programmable Read-Only Memory), a ROM (Read-Only Memory), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk. Also stored in memory 102 are one or more modules configured to be executed by the one or more processors 109 to perform all or a portion of the steps of any of the methods illustrated in fig. 1, 2, 3, 4, and 5.

The power supply component 103 provides power to the various components of the system 100. The power components 103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the system 100.

The multimedia component 104 includes a screen that provides an output interface between the system 100 and the user. In some embodiments, the screen may include an LCD (Liquid Crystal Display) and a TP (Touch Panel). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The audio component 105 is configured to output and/or input audio signals. For example, the audio component 105 includes a microphone configured to receive external audio signals when the system 100 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 102 or transmitted via the communication component 108. In some embodiments, audio component 105 also includes a speaker for outputting audio signals.

The sensor assembly 107 includes one or more sensors for providing various aspects of status assessment for the system 100. For example, the sensor assembly 107 may detect an open/closed state of the system 100, the relative positioning of the components, the sensor assembly 107 may also detect a change in position of the system 100 or a component of the system 100, and a change in temperature of the system 100. In some embodiments, the sensor assembly 107 may also include a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 108 is configured to facilitate wired or wireless communication between the system 100 and other devices. The system 100 may access a Wireless network based on a communication standard, such as WiFi (Wireless-Fidelity), 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 108 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the Communication component 108 further includes a Near Field Communication (NFC) module to facilitate short-range Communication. For example, the NFC module may be implemented based on an RFID (Radio Frequency Identification) technology, an IrDA (Infrared Data Association) technology, an UWB (Ultra-Wideband) technology, a BT (Bluetooth) technology, and other technologies.

In an exemplary embodiment, the system 100 may be implemented by one or more ASICs (Application Specific Integrated circuits), DSPs (Digital Signal processors), PLDs (Programmable Logic devices), FPGAs (Field-Programmable Gate arrays), controllers, microcontrollers, microprocessors or other electronic components for performing the above-described methods.

The specific manner in which the processor of the system in this embodiment performs operations has been described in detail in the embodiment of the control method related to the data transmission, and will not be elaborated upon here.

Optionally, the present invention further provides a system, which executes all or part of the steps of the spam identification method shown in any one of fig. 1, fig. 2, fig. 3, fig. 4, and fig. 5. The system comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform:

extracting text content of user original information;

performing semantic reduction on the text content to obtain a reduced text;

The specific manner in which the processor of the system in this embodiment performs operations has been described in detail in relation to the embodiment of the spam recognition method and will not be described in detail here.

In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be transitory and non-transitory computer-readable storage media, including instructions. The storage medium, for example, includes a memory 102 of instructions executable by a processor 109 of the system 100 to perform the spam identification method described above.

It is to be understood that the invention is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be effected therein by one skilled in the art without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A spam identification method, the method comprising:

extracting text content of user original information;

performing semantic reduction on the text content to obtain a reduced text;

selecting a corresponding sample model class from a preset sample model library according to the original information of the user, wherein the sample model library comprises a plurality of sample model classes;

matching operation is carried out on the restored text and the corresponding sample model class through a gradient descent algorithm, and the junk probability that the user original information is junk information is obtained;

identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value;

before selecting a corresponding sample model class from a preset sample model library according to the user original information, wherein the sample model library comprises a plurality of sample model classes, the method further comprises the following steps:

extracting content data from a predetermined database;

extracting the characteristics of text vectors from the content data through a random forest algorithm;

determining the data category corresponding to the content data according to the product between the text vector and the corresponding weight vector;

and configuring a rule engine according to the content data and the corresponding data type to form the sample model library.

2. The method according to claim 1, wherein the step of semantically restoring the text content to obtain a restored text comprises:

identifying Chinese numbers in the text content;

and converting the Chinese numbers into Arabic numbers to obtain a reduced text corresponding to the text content.

3. The method according to claim 1, wherein the step of semantically restoring the text content to obtain a restored text comprises:

and performing semantic reduction on the text content according to a preset harmonic character/combined character library to obtain a corresponding reduced text.

4. The method of claim 1, wherein the step of feature extraction of text vectors from the content data by a random forest algorithm comprises:

performing semantic reduction on the content data;

performing word segmentation operation on the content data after semantic reduction to obtain text word segmentation corresponding to the content data;

and respectively extracting the characteristics of text vectors for the text participles corresponding to the content data through a random forest algorithm.

5. A spam recognition apparatus, comprising:

the semantic reduction module is used for carrying out semantic reduction on the text content to obtain a reduced text;

the matching operation module is used for selecting a corresponding sample model class from a preset sample model library according to the user original information, and the sample model library comprises a plurality of sample model classes; matching operation is carried out on the restored text and the corresponding sample model class through a gradient descent algorithm, and the junk probability that the user original information is junk information is obtained;

the junk information identification module is used for identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value;

the device further comprises:

the content data extraction module is used for selecting a corresponding sample model class from a preset sample model library in the matching operation module according to the user original information, and extracting content data from a preset database before the sample model library comprises a plurality of sample model classes;

the feature extraction module is used for extracting features of text vectors from the content data through a random forest algorithm;

the data category determining module is used for determining the data category corresponding to the content data according to the product between the text vector and the corresponding weight vector;

and the sample model library generating module is used for configuring a rule engine according to the content data and the corresponding data type to form the sample model library.

6. The apparatus of claim 5, wherein the semantic reduction module comprises:

the Chinese number identification unit is used for identifying Chinese numbers in the text content;

and the digital conversion unit is used for converting the Chinese numbers into Arabic numbers to obtain a reduced text corresponding to the text content.

7. The apparatus of claim 5, wherein the semantic reduction module comprises:

and the harmonic character/combined character reduction unit is used for reducing the language of the text content according to a preset harmonic character/combined character library to obtain a corresponding reduced text.

8. The apparatus of claim 5, wherein the feature extraction module comprises:

the semantic restoring unit is used for performing semantic restoring on the content data;

the word segmentation unit is used for carrying out word segmentation operation on the content data after semantic reduction to obtain text words corresponding to the content data;

and the word segmentation feature extraction unit is used for respectively extracting the feature of the text vector for the text word segmentation corresponding to the content data through a random forest algorithm.

9. A spam recognition system, said system comprising:

a processor;

a memory for storing processor-executable instructions; wherein the processor is configured to perform: extracting text content of user original information; performing semantic reduction on the text content to obtain a reduced text; selecting a corresponding sample model class from a preset sample model library according to the original information of the user, wherein the sample model library comprises a plurality of sample model classes; matching operation is carried out on the restored text and the corresponding sample model class through a gradient descent algorithm, and the junk probability that the user original information is junk information is obtained; identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value;

before selecting a corresponding sample model class from a preset sample model library according to the user original information, wherein the sample model library comprises a plurality of sample model classes, the processor is further configured to execute: extracting content data from a predetermined database; extracting the characteristics of text vectors from the content data through a random forest algorithm; determining the data category corresponding to the content data according to the product between the text vector and the corresponding weight vector; and configuring a rule engine according to the content data and the corresponding data type to form the sample model library.