[go: up one dir, main page]

CN107239447B - Junk information identification method, device and system - Google Patents

Junk information identification method, device and system Download PDF

Info

Publication number
CN107239447B
CN107239447B CN201710417747.0A CN201710417747A CN107239447B CN 107239447 B CN107239447 B CN 107239447B CN 201710417747 A CN201710417747 A CN 201710417747A CN 107239447 B CN107239447 B CN 107239447B
Authority
CN
China
Prior art keywords
text
sample model
junk
content data
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710417747.0A
Other languages
Chinese (zh)
Other versions
CN107239447A (en
Inventor
陈方毅
陈伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meishao Co ltd
Original Assignee
Xiamen Meishao Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meishao Co ltd filed Critical Xiamen Meishao Co ltd
Priority to CN201710417747.0A priority Critical patent/CN107239447B/en
Publication of CN107239447A publication Critical patent/CN107239447A/en
Application granted granted Critical
Publication of CN107239447B publication Critical patent/CN107239447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a junk information identification method, device and system, and belongs to the technical field of internet application. The method comprises the following steps: extracting text content of user original information, performing semantic reduction on the text content to obtain a reduced text, performing matching operation on the reduced text in a preset sample model library through a gradient descent algorithm to obtain a junk probability that the user original information is junk information, and comparing the junk probability with a preset junk probability threshold value to identify that the user original information is the junk information. In addition, a junk information identification device and system are also provided. The junk information identification method, the device and the system can identify the junk information aiming at the user original information after semantic conversion.

Description

Junk information identification method, device and system
Technical Field
The invention relates to the technical field of internet application, in particular to a junk information identification method, device and system.
Background
With the development of internet technology, network information is increasingly rich, and various user original information on websites is mixed with fish and dragon, so that more and more junk information such as useless advertisements, pornography and the like is available. Therefore, the user original information in the website should be filtered by the junk words in advance, that is, the user original information should be identified in advance, and the user original information identified as the junk information should be shielded to ensure the purity of the website information.
However, when the user original information is distributed, the semantic conversion is performed on the user original information in advance, so that the purpose of avoiding being recognized as spam is achieved. For example, in order to avoid being recognized as spam, when advertisement information is released, arabic numerals such as QQ numbers are converted into chinese numerals, thereby achieving the purpose of avoiding being recognized as spam.
At present, the existing junk information identification generally identifies the junk information by means of complete matching or partial matching with a reference junk word, and the junk information cannot be identified according to the original information of the user after semantic conversion, so that the accuracy of junk information identification is greatly reduced, and the misjudgment rate of the junk information is high.
Disclosure of Invention
The invention provides a junk information identification method, device and system, and aims to solve the technical problem that junk information identification cannot be performed on user original information subjected to semantic conversion in the related art.
The embodiment of the invention provides a junk information identification method, which comprises the following steps:
extracting text content of user original information;
performing semantic reduction on the text content to obtain a reduced text;
matching the reduced text in a preset sample model library through a gradient descent algorithm to obtain the garbage probability that the user original information is garbage information;
and identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value.
In addition, an embodiment of the present invention provides a spam information identifying apparatus, including:
the text content extraction module is used for extracting the text content of the original information of the user;
the semantic reduction module is used for carrying out semantic reduction on the text content to obtain a reduced text; the matching operation module is used for performing matching operation on the reduced text in a preset sample model library through a gradient descent algorithm to obtain the garbage probability that the user original information is garbage information;
and the junk information identification module is used for identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value.
In addition, an embodiment of the present invention further provides a system, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform:
extracting text content of user original information;
performing semantic reduction on the text content to obtain a reduced text;
matching the reduced text in a preset sample model library through a gradient descent algorithm to obtain the garbage probability that the user original information is garbage information;
and identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
when the junk information of the user original information is identified, the text content of the user original information is subjected to semantic restoration, so that the junk information can be identified for the user original information subjected to semantic conversion, the accuracy of identifying the junk information is greatly improved, and the misjudgment rate of the junk information is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a flow chart illustrating a spam identification method according to an exemplary embodiment.
Fig. 2 is a flow chart illustrating a spam identification method according to an example embodiment.
Fig. 3 is a flow chart illustrating a spam identification method according to an example embodiment.
Fig. 4 is a flowchart illustrating a specific implementation of step S220 in the spam identification method according to the corresponding embodiment in fig. 3.
Fig. 5 is a flowchart illustrating a specific implementation of step S130 in the spam identification method according to the corresponding embodiment in fig. 1.
Fig. 6 is a block diagram illustrating a spam recognition device according to an example embodiment.
Fig. 7 is a block diagram of the semantic restoring module 120 in the spam recognition device according to the corresponding embodiment in fig. 6.
Fig. 8 is a block diagram of another spam identification device according to the corresponding embodiment of fig. 6.
Fig. 9 is a block diagram of the feature extraction module 220 in the spam recognition device according to the embodiment shown in fig. 8.
Fig. 10 is a block diagram of the matching operation module 130 in the spam recognition device according to the corresponding embodiment of fig. 6.
FIG. 11 is a block diagram illustrating a system in accordance with an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Fig. 1 is a flow chart illustrating a spam identification method according to an exemplary embodiment. As shown in fig. 1, the spam recognition method may include the following steps.
In step S110, the text content of the user original information is extracted.
The user-originated information is information input by a user on the network. For example, in a forum, users leave comments on a topic.
It is understood that the user original information includes expressions, text, and the like.
The original information of the user is mixed with the fish and the dragon, the text usually contains a lot of junk information, and the junk information needs to be identified in advance for the text content of the original information of the user. Thus, the text content is extracted from the user's original information.
When extracting text content from the user original information, the text content may be extracted by various text extraction methods, which are not limited herein.
In step S120, semantic reduction is performed on the text content to obtain a reduced text.
And semantic reduction is to perform text processing on text content data according to semantics. And after semantic analysis is carried out on the text content, corresponding reduction processing is carried out to obtain a reduced text.
It is understood that in order to avoid the spam being released, the user can avoid the original information of the released user from being identified as spam by converting the semantic meaning.
For example, by converting the QQ number "1234567" from an arabic number to "one, two, three, four, five, six, seven", it is prevented from being recognized as spam.
For another example, by conversion of a harmonic character/a combined character, "plus me WeChat you" is converted into "Home me WeiChat you", thereby avoiding being recognized as spam.
Therefore, semantic restoration processing needs to be performed on the text content of the user original information.
Semantic reduction is to perform semantic analysis on the text content and extract the text meaning represented by the text content.
The method for semantic analysis of the text content comprises various methods, the text content can be represented as a feature-document matrix form based on a vector space model through a potential semantic indexing method, the matrix is reduced in rank through a singular value decomposition technology, and the text content and feature words are mapped to the same low-dimensional semantic space; semantic analysis can also be performed based on external semantic knowledge, for example, text meanings in text contents are extracted through a harmonic/combined word dictionary; the text content may also be subjected to semantic analysis in other ways, and the method of semantic analysis is not limited herein.
In step S130, the text content is subjected to matching operation in a preset sample model library by a gradient descent algorithm, so as to obtain a spam probability that the user original information is spam.
The gradient descent algorithm is an optimization algorithm in machine learning.
The sample model library is prepared in advance, and the sample model library contains the probability that each sample model is junk information.
The junk probability is the probability that the original information of the user is junk information.
In the gradient descent algorithm, matching operation is carried out on the text content of the user original information and the sample models in the sample model library by adopting gradient descending step by step, and after the operation is converged, the garbage probability that the user original information is garbage information is obtained.
In step S140, the user original information is identified as spam by comparing the spam probability with a preset spam probability threshold.
The garbage probability threshold is a preset garbage probability critical value.
And when the junk probability that the user original information is the junk information reaches the junk probability, identifying that the user original information is the junk information.
For example, the preset spam probability threshold is 70%, and when the spam probability of the user original information reaches 70%, the user original information is identified as spam information.
By utilizing the method, the text content of the user original information is subjected to semantic reduction, the reduced text obtained after the semantic reduction is subjected to matching operation in the preset sample model library to obtain the junk probability that the user original information is junk information, and the user original information is identified as the junk information according to the preset junk probability threshold, so that the junk information can be identified according to the user original information subjected to semantic conversion, and the accuracy of identifying the junk information is greatly improved.
Fig. 2 is a flow chart illustrating a spam identification method according to an example embodiment. As shown in fig. 2, the step S120 shown in the corresponding embodiment of fig. 1 may include the following steps.
In step S121, chinese numerals in the text content are identified.
Chinese numbers are numbers expressed in chinese form. Chinese numbers include upper case numbers and lower case numbers, such as "one" and "one".
In a specific exemplary embodiment, the Chinese numbers in the text content are identified by comparing the text content to a preset thesaurus of numbers.
In step S122, the chinese numbers are converted into arabic numbers, and a restored text corresponding to the text content is obtained.
By using the method, the Chinese numbers in the user original information are identified, then the Chinese numbers are converted into Arabic numbers, and then the Arabic numbers are subjected to matching operation in a preset sample model library to obtain the junk probability that the user original information is junk information, so that the junk information can be identified for the Chinese numbers, and the accuracy of identifying the junk information is greatly improved.
Optionally, step S120 shown in the embodiment corresponding to fig. 1 may further include the following steps:
and according to the preset harmonious character/combined character library, performing semantic reduction on the text content to obtain a corresponding reduced text.
The harmonic/combined word library is a dictionary containing each text and corresponding harmonic and/or combined words.
It is understood that there may be words converted from harmonious characters/combined characters in the text content of the user's original information. Therefore, the text content is subjected to semantic analysis and semantic restoration according to the preset harmonic character/combined character dictionary.
For example, the user-originated information is intended to be "rogue deaths", but to avoid identifying spam, the text content at the time of publication is "durum deaths". The semantics of the original information of the user is identified through a preset harmonic character/combined character library, and the conversion of the harmonic characters/combined characters is carried out, so that the 'death removal of the awns' is converted into 'death removal of the rogue'.
By using the method, the semantics of the original information of the user is identified through the preset harmonic character/combined character library, and the conversion of the harmonic characters/combined characters is carried out, so that the situation that part of garbage information cannot be identified as garbage information through the conversion of the harmonic characters/combined characters is avoided, and the accuracy of garbage information identification is greatly improved.
Fig. 3 is a flow chart illustrating a spam identification method according to an example embodiment. As shown in fig. 3, before step S130 in the corresponding embodiment of fig. 1, the spam identification method may further include the following steps.
In step S210, content data is extracted from a predetermined database.
The database is a data warehouse for storing and managing website community information according to a data structure.
For example, various information data of the grapefruit community are stored in a predetermined database in accordance with a data structure.
The content data is text information stored in a database in a data structure.
In step S220, feature extraction of text vectors is performed from the content data by a random forest algorithm.
In machine learning, a random forest is a classifier that contains multiple decision trees.
The text vector is a data form characterized by extracting the characteristics of the content data through a decision tree classifier.
The random forest is composed of a plurality of decision trees. Each node in the decision tree is a condition about a certain characteristic, content data is classified according to different conditions, and then the content data is converted into a text vector according to the classification.
In step S230, a data category corresponding to the content data is obtained according to the text vector and the corresponding weight vector.
The weight vector corresponds to the text vector. Each weight component in the weight vector is in one-to-one correspondence with a text component in the text vector.
When content data is classified according to different conditions, each different condition corresponds to a corresponding weight, so that each text component also has a corresponding weight component in a text vector obtained after the feature extraction of the text data is performed on the content data.
In a specific exemplary embodiment, the data category is an information spam degree corresponding to the content data, and the content data is classified according to different information spam degrees.
In a specific exemplary embodiment, the corresponding data category is found from the product by calculating the product between the text vector and the corresponding weight vector.
In step S240, a rule engine is configured according to the content data and the corresponding data type to form a sample model library.
The rules engine is a business rules decision component.
In the rule engine, rule conditions correspond to rule actions. By accepting data input, business rules are interpreted, and business decisions are made based on the business rules. And when the rule condition in the business rule is satisfied, triggering to execute a corresponding rule action.
In a specific exemplary embodiment, by configuring a similarity ratio between the input text content and the content data, when the probability that the input text content is similar to the content data reaches the similarity ratio, the input text content is identified as the data category corresponding to the content data.
For example, the data type corresponding to the content data B is spam, and the rule condition in the rule engine configuration is that the similarity rate with the content data B is 80%. And after calculation and analysis, if the similarity rate between the input text content A and the content data B is 90%, identifying and confirming that the text content A is junk information.
By utilizing the method, the sample model library is formed by extracting the characteristics of the content data in the database in advance and configuring the rule engine, and the spam probability of the text content is calculated in the sample model library when the spam is judged in the follow-up process, so that the accuracy of identifying the spam is greatly improved.
FIG. 4 is a depiction of further details of step S220, shown in accordance with an exemplary embodiment. As shown in fig. 3, the sample model library is divided into a plurality of sample model classes, and the step S220 may include the following steps.
In step S221, the content data is semantically restored.
It can be understood that, in order to avoid the released junk information being screened out, the user releases the original information of the user after the user performs operations such as splitting the homophone/the harmonic character.
Therefore, semantic reduction processing is required to be performed on the content data.
Semantic reduction is the text processing of content data according to semantics. For example, a string of Chinese numbers is first converted into Arabic numbers, and then into QQ, WeChat.
In a specific exemplary embodiment, the content data is: home i will send you. The 'family my WeChat sending you' is converted into 'add my WeChat sending you' through the reduction of the harmonic characters/combined characters. And restoring the harmonic characters/combined characters through a preset harmonic character/combined character dictionary, thereby screening the garbage information.
In a specific exemplary embodiment, the content data is: the unconventional Vyuting1028103172 is a little more. The QQ and the WeChat are converted into the same word through semantic reduction, namely Vyuting1028103172 is extracted and converted into a universal dimension, and the obtained content data after semantic reduction is 'ununified wechat how many teachers and you'. Because the junk information usually has the conditions of adding WeChat, QQ and the like, the obtained text vector is prevented from being overlarge by uniformly processing various WeChat and QQ signals into one dimension, and the condition that one WeChat and QQ signal is not over to cause the condition of being unable to be identified is also avoided.
In step S222, a word segmentation operation is performed on the semantic-restored content data to obtain a text word corresponding to the content data.
It will be appreciated that the content data may be composed of a plurality of words, such as "add me to believe you" for example.
If the feature extraction is directly carried out on the content data after the semantic restoration, the similarity between the texts is greatly influenced, so that before the feature extraction, the feature extraction of the text vector is respectively carried out on the text participles obtained after the participle operation by carrying out the participle operation on the content data in advance.
The word segmentation operation is to refer to the segmentation of a word sequence into a single word.
As described above, the content data is text information stored in the database in accordance with a data structure. The text information may be a word, a plurality of words, or other forms.
Therefore, the content data is segmented into a single text segmentation by performing a segmentation operation on the content data.
There are various ways of performing word segmentation operations on content data. The content data can be mechanically segmented into text participles one by one based on the character strings to obtain the text participles corresponding to the content data; semantic analysis can also be carried out on the content data, and then the content data is divided into one text participle based on semantics to obtain the text participle corresponding to the content data; the word segmentation operation may also be performed on the content data in other manners, which are not limited herein.
In step S223, feature extraction of text vectors is performed on the text segments corresponding to the content data by a random forest algorithm, respectively.
By using the method, when the sample model library is manufactured, before the feature extraction of the text vector is carried out on the content data, the semantic reduction and the word segmentation operation are carried out on the content data in advance, so that the text vector obtained by carrying out the feature extraction on the content data is more accurate, and the accuracy of the sample model library is improved.
Fig. 5 is a depiction of further details of step S130, shown in accordance with an exemplary embodiment. As shown in fig. 5, the sample model library is divided into a plurality of sample model classes, and the step S130 may include the following steps.
In step S131, a corresponding sample model class is selected from the sample model library according to the user original information.
In the sample model library, the sample models are divided into a plurality of sample model classes, and each sample model class contains a predetermined number of sample models.
In step S132, a gradient descent algorithm is used to perform matching operation on the user original information and the sample model type, so as to obtain a spam probability that the user original information is spam.
When matching operation is carried out, random gradient operation is carried out by using sample models in a sample model class each time. Namely:
X(t+1)=X(t)+ΔX(t)
ΔX(t)=-ηg(t)
where η is the learning rate, and g (t) is the gradient of X at time t.
By classifying the sample model classes in the sample model library, when the number of sample models in the sample model library is large, one sample model class is selected for matching operation, so that the consumption of resources in the matching operation is reduced, and the convergence can be faster.
For example, if the gradients of the first half sample model and the second half sample model in the sample model library are the same, the first half sample model is used as one sample model class, and the second half sample model is used as the other sample model class, so that the method of the sample model class advances two steps towards the optimal solution in one traversal matching operation of the sample model library, while the overall matching budget method advances only one step.
Optionally, when there are duplicate sample models in the sample model library, the convergence of the matching operation can be promoted more quickly by the classification of the sample model classes.
First, after each matching operation, the content data identified as spam is stored in the sample model library as a sample model.
By the method, the sample models in the sample model library are divided into the plurality of sample model classes, and then random gradient matching operation is performed in one sample model class each time, so that the consumption of operation resources is greatly reduced, convergence is achieved more quickly, and the efficiency of garbage information identification is improved.
The following is an embodiment of the apparatus of the present invention, which can be used to implement the above-mentioned embodiment of the spam information identifying method. For details that are not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the spam identification method of the present invention.
FIG. 6 is a block diagram illustrating a spam identification device according to an exemplary embodiment, the system including, but not limited to: a text content obtaining module 110, a semantic restoring module 120, a matching operation module 130 and a spam identification module 140.
A text content extracting module 110, configured to extract text content of the user original information;
the semantic reduction module 120 is configured to perform semantic reduction on the text content to obtain a reduced text;
the matching operation module 130 is configured to perform matching operation on the restored text in a preset sample model library through a gradient descent algorithm to obtain a spam probability that the user original information is spam;
and the spam information identifying module 140 is configured to identify the user original information as spam information by comparing the spam probability with a preset spam probability threshold.
The implementation process of the functions and actions of each module in the device is specifically described in the implementation process of the corresponding step in the spam identification method, and is not described herein again.
Optionally, as shown in fig. 7, in the spam recognition apparatus shown in fig. 6 corresponding to the embodiment, the semantic restoring module 120 further includes but is not limited to: a chinese number recognition unit 121 and a number conversion unit 122.
A chinese number recognition unit 121 for recognizing chinese numbers in the text contents;
the number conversion unit 122 is configured to convert the chinese numbers into arabic numbers, and obtain a restored text corresponding to the text content.
Optionally, in the spam recognition apparatus shown in the embodiment corresponding to fig. 6, the semantic restoring module 120 further includes but is not limited to: harmonic character/combined character restoring unit.
And the harmonic character/combined character reduction unit is used for carrying out semantic reduction on the text content according to a preset harmonic character/combined character library to obtain a corresponding reduced text.
Fig. 8 is a block diagram of another spam identification apparatus according to the corresponding embodiment of fig. 6, further including but not limited to: a content data extraction module 210, a feature extraction module 220, a data category determination module 230, and a sample model library generation module 240.
A content data extraction module 210 for extracting content data from a predetermined database;
the feature extraction module 220 is configured to perform feature extraction of a text vector from content data through a random forest algorithm;
a data type determining module 230, configured to determine a data type corresponding to the content data according to the text vector and the corresponding weight vector;
and a sample model library generating module 240, configured to configure the rule engine according to the content data and the corresponding data type, so as to form a sample model library.
Optionally, as shown in fig. 9, the feature extraction module 220 shown in fig. 8 in the corresponding embodiment includes but is not limited to: semantic restoring unit 221, participle unit 222 and participle feature extracting unit 223.
A semantic reduction unit 221, configured to perform semantic reduction on the content data;
a word segmentation unit 222, configured to perform word segmentation on the semantic-restored content data to obtain text words corresponding to the content data;
and a segmentation feature extraction unit 223, configured to perform feature extraction on text vectors for text segmentation corresponding to the content data through a random forest algorithm, respectively.
Optionally, as shown in fig. 10, the sample model library is divided into a plurality of sample model classes, and the matching operation module 130 shown in fig. 6 according to the corresponding embodiment includes but is not limited to: a sample model class selecting unit 131 and a matching operation unit 132.
The sample model class selecting unit 131 is configured to select a corresponding sample model class from the sample model library according to the user original information;
and the matching operation unit 132 is configured to perform matching operation on the user original information and the sample model type through a gradient descent algorithm to obtain a spam probability that the user original information is spam.
Fig. 11 is a block diagram illustrating a system 100 according to an example embodiment. Referring to fig. 11, the system 100 may include one or more of the following components: a processing component 101, a memory 102, a power component 103, a multimedia component 104, an audio component 105, a sensor component 107 and a communication component 108. The above components are not all necessary, and the system 100 may add other components or reduce some components according to its own functional requirements, which is not limited in this embodiment.
The processing component 101 generally controls overall operation of the system 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 101 may include one or more processors 109 to execute instructions to perform all or a portion of the above-described operations. Further, the processing component 101 may include one or more modules that facilitate interaction between the processing component 101 and other components. For example, the processing component 101 may include a multimedia module to facilitate interaction between the multimedia component 104 and the processing component 101.
The memory 102 is configured to store various types of data to support operations at the system 100. Examples of such data include instructions for any application or method operating on system 100. The Memory 102 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as an SRAM (Static Random Access Memory), an EEPROM (Electrically Erasable Programmable Read-Only Memory), an EPROM (Erasable Programmable Read-Only Memory), a PROM (Programmable Read-Only Memory), a ROM (Read-Only Memory), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk. Also stored in memory 102 are one or more modules configured to be executed by the one or more processors 109 to perform all or a portion of the steps of any of the methods illustrated in fig. 1, 2, 3, 4, and 5.
The power supply component 103 provides power to the various components of the system 100. The power components 103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the system 100.
The multimedia component 104 includes a screen that provides an output interface between the system 100 and the user. In some embodiments, the screen may include an LCD (Liquid Crystal Display) and a TP (Touch Panel). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The audio component 105 is configured to output and/or input audio signals. For example, the audio component 105 includes a microphone configured to receive external audio signals when the system 100 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 102 or transmitted via the communication component 108. In some embodiments, audio component 105 also includes a speaker for outputting audio signals.
The sensor assembly 107 includes one or more sensors for providing various aspects of status assessment for the system 100. For example, the sensor assembly 107 may detect an open/closed state of the system 100, the relative positioning of the components, the sensor assembly 107 may also detect a change in position of the system 100 or a component of the system 100, and a change in temperature of the system 100. In some embodiments, the sensor assembly 107 may also include a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 108 is configured to facilitate wired or wireless communication between the system 100 and other devices. The system 100 may access a Wireless network based on a communication standard, such as WiFi (Wireless-Fidelity), 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 108 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the Communication component 108 further includes a Near Field Communication (NFC) module to facilitate short-range Communication. For example, the NFC module may be implemented based on an RFID (Radio Frequency Identification) technology, an IrDA (Infrared Data Association) technology, an UWB (Ultra-Wideband) technology, a BT (Bluetooth) technology, and other technologies.
In an exemplary embodiment, the system 100 may be implemented by one or more ASICs (Application Specific Integrated circuits), DSPs (Digital Signal processors), PLDs (Programmable Logic devices), FPGAs (Field-Programmable Gate arrays), controllers, microcontrollers, microprocessors or other electronic components for performing the above-described methods.
The specific manner in which the processor of the system in this embodiment performs operations has been described in detail in the embodiment of the control method related to the data transmission, and will not be elaborated upon here.
Optionally, the present invention further provides a system, which executes all or part of the steps of the spam identification method shown in any one of fig. 1, fig. 2, fig. 3, fig. 4, and fig. 5. The system comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform:
extracting text content of user original information;
performing semantic reduction on the text content to obtain a reduced text;
matching the reduced text in a preset sample model library through a gradient descent algorithm to obtain the garbage probability that the user original information is garbage information;
and identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value.
The specific manner in which the processor of the system in this embodiment performs operations has been described in detail in relation to the embodiment of the spam recognition method and will not be described in detail here.
In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be transitory and non-transitory computer-readable storage media, including instructions. The storage medium, for example, includes a memory 102 of instructions executable by a processor 109 of the system 100 to perform the spam identification method described above.
It is to be understood that the invention is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be effected therein by one skilled in the art without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (9)

1. A spam identification method, the method comprising:
extracting text content of user original information;
performing semantic reduction on the text content to obtain a reduced text;
selecting a corresponding sample model class from a preset sample model library according to the original information of the user, wherein the sample model library comprises a plurality of sample model classes;
matching operation is carried out on the restored text and the corresponding sample model class through a gradient descent algorithm, and the junk probability that the user original information is junk information is obtained;
identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value;
before selecting a corresponding sample model class from a preset sample model library according to the user original information, wherein the sample model library comprises a plurality of sample model classes, the method further comprises the following steps:
extracting content data from a predetermined database;
extracting the characteristics of text vectors from the content data through a random forest algorithm;
determining the data category corresponding to the content data according to the product between the text vector and the corresponding weight vector;
and configuring a rule engine according to the content data and the corresponding data type to form the sample model library.
2. The method according to claim 1, wherein the step of semantically restoring the text content to obtain a restored text comprises:
identifying Chinese numbers in the text content;
and converting the Chinese numbers into Arabic numbers to obtain a reduced text corresponding to the text content.
3. The method according to claim 1, wherein the step of semantically restoring the text content to obtain a restored text comprises:
and performing semantic reduction on the text content according to a preset harmonic character/combined character library to obtain a corresponding reduced text.
4. The method of claim 1, wherein the step of feature extraction of text vectors from the content data by a random forest algorithm comprises:
performing semantic reduction on the content data;
performing word segmentation operation on the content data after semantic reduction to obtain text word segmentation corresponding to the content data;
and respectively extracting the characteristics of text vectors for the text participles corresponding to the content data through a random forest algorithm.
5. A spam recognition apparatus, comprising:
the text content extraction module is used for extracting the text content of the original information of the user;
the semantic reduction module is used for carrying out semantic reduction on the text content to obtain a reduced text;
the matching operation module is used for selecting a corresponding sample model class from a preset sample model library according to the user original information, and the sample model library comprises a plurality of sample model classes; matching operation is carried out on the restored text and the corresponding sample model class through a gradient descent algorithm, and the junk probability that the user original information is junk information is obtained;
the junk information identification module is used for identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value;
the device further comprises:
the content data extraction module is used for selecting a corresponding sample model class from a preset sample model library in the matching operation module according to the user original information, and extracting content data from a preset database before the sample model library comprises a plurality of sample model classes;
the feature extraction module is used for extracting features of text vectors from the content data through a random forest algorithm;
the data category determining module is used for determining the data category corresponding to the content data according to the product between the text vector and the corresponding weight vector;
and the sample model library generating module is used for configuring a rule engine according to the content data and the corresponding data type to form the sample model library.
6. The apparatus of claim 5, wherein the semantic reduction module comprises:
the Chinese number identification unit is used for identifying Chinese numbers in the text content;
and the digital conversion unit is used for converting the Chinese numbers into Arabic numbers to obtain a reduced text corresponding to the text content.
7. The apparatus of claim 5, wherein the semantic reduction module comprises:
and the harmonic character/combined character reduction unit is used for reducing the language of the text content according to a preset harmonic character/combined character library to obtain a corresponding reduced text.
8. The apparatus of claim 5, wherein the feature extraction module comprises:
the semantic restoring unit is used for performing semantic restoring on the content data;
the word segmentation unit is used for carrying out word segmentation operation on the content data after semantic reduction to obtain text words corresponding to the content data;
and the word segmentation feature extraction unit is used for respectively extracting the feature of the text vector for the text word segmentation corresponding to the content data through a random forest algorithm.
9. A spam recognition system, said system comprising:
a processor;
a memory for storing processor-executable instructions; wherein the processor is configured to perform: extracting text content of user original information; performing semantic reduction on the text content to obtain a reduced text; selecting a corresponding sample model class from a preset sample model library according to the original information of the user, wherein the sample model library comprises a plurality of sample model classes; matching operation is carried out on the restored text and the corresponding sample model class through a gradient descent algorithm, and the junk probability that the user original information is junk information is obtained; identifying the user original information as junk information by comparing the junk probability with a preset junk probability threshold value;
before selecting a corresponding sample model class from a preset sample model library according to the user original information, wherein the sample model library comprises a plurality of sample model classes, the processor is further configured to execute: extracting content data from a predetermined database; extracting the characteristics of text vectors from the content data through a random forest algorithm; determining the data category corresponding to the content data according to the product between the text vector and the corresponding weight vector; and configuring a rule engine according to the content data and the corresponding data type to form the sample model library.
CN201710417747.0A 2017-06-05 2017-06-05 Junk information identification method, device and system Active CN107239447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710417747.0A CN107239447B (en) 2017-06-05 2017-06-05 Junk information identification method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710417747.0A CN107239447B (en) 2017-06-05 2017-06-05 Junk information identification method, device and system

Publications (2)

Publication Number Publication Date
CN107239447A CN107239447A (en) 2017-10-10
CN107239447B true CN107239447B (en) 2020-12-18

Family

ID=59984879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710417747.0A Active CN107239447B (en) 2017-06-05 2017-06-05 Junk information identification method, device and system

Country Status (1)

Country Link
CN (1) CN107239447B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system
CN109344388B (en) * 2018-08-02 2023-06-09 中央电视台 Method and device for identifying spam comments and computer-readable storage medium
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN109829102A (en) * 2018-12-27 2019-05-31 浙江工业大学 A kind of web advertisement recognition methods based on random forest
CN111581959A (en) * 2019-01-30 2020-08-25 北京京东尚科信息技术有限公司 Information analysis method, terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101699432A (en) * 2009-11-13 2010-04-28 黑龙江工程学院 Ordering strategy-based information filtering system
CN102567304A (en) * 2010-12-24 2012-07-11 北大方正集团有限公司 Filtering method and device for network malicious information
CN103166830A (en) * 2011-12-14 2013-06-19 中国电信股份有限公司 Spam email filtering system and method capable of intelligently selecting training samples
CN104050556A (en) * 2014-05-27 2014-09-17 哈尔滨理工大学 Feature selection method and detection method of junk mails
KR20160067473A (en) * 2014-12-04 2016-06-14 숭실대학교산학협력단 Method for spam classfication, recording medium and device for performing the method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300012A1 (en) * 2008-05-28 2009-12-03 Barracuda Inc. Multilevel intent analysis method for email filtration
CN101908055B (en) * 2010-03-05 2013-02-13 黑龙江工程学院 Method for setting information classification threshold for optimizing lam percentage and information filtering system using same
CN104038391B (en) * 2014-07-02 2017-11-17 网易(杭州)网络有限公司 A kind of method and apparatus of spam detection
CN104702492B (en) * 2015-03-19 2019-10-18 百度在线网络技术(北京)有限公司 Rubbish message model training method, rubbish message recognition methods and its device
CN106202330B (en) * 2016-07-01 2020-02-07 北京小米移动软件有限公司 Junk information judgment method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101699432A (en) * 2009-11-13 2010-04-28 黑龙江工程学院 Ordering strategy-based information filtering system
CN102567304A (en) * 2010-12-24 2012-07-11 北大方正集团有限公司 Filtering method and device for network malicious information
CN103166830A (en) * 2011-12-14 2013-06-19 中国电信股份有限公司 Spam email filtering system and method capable of intelligently selecting training samples
CN104050556A (en) * 2014-05-27 2014-09-17 哈尔滨理工大学 Feature selection method and detection method of junk mails
KR20160067473A (en) * 2014-12-04 2016-06-14 숭실대학교산학협력단 Method for spam classfication, recording medium and device for performing the method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Mobile SMS Spam Filtering based on Mixing Classifiers;Hassan Najadat;《International Journal of Advanced Computing Research》;20141231;第1-6页 *

Also Published As

Publication number Publication date
CN107239447A (en) 2017-10-10

Similar Documents

Publication Publication Date Title
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
US10685186B2 (en) Semantic understanding based emoji input method and device
CN107239447B (en) Junk information identification method, device and system
CN112487149B (en) Text auditing method, model, equipment and storage medium
US20210165955A1 (en) Methods and systems for modeling complex taxonomies with natural language understanding
US10831796B2 (en) Tone optimization for digital content
US20180089164A1 (en) Entity-specific conversational artificial intelligence
CN109885698A (en) A knowledge graph construction method and device, and electronic equipment
CN110019742B (en) Method and device for processing information
US20170011114A1 (en) Common data repository for improving transactional efficiencies of user interactions with a computing device
KR102193228B1 (en) Apparatus for evaluating non-financial information based on deep learning and method thereof
CN107491477B (en) Emotion symbol searching method and device
WO2020005731A1 (en) Text entity detection and recognition from images
CN111783471A (en) Semantic recognition method, device, equipment and storage medium of natural language
CN108804469B (en) Webpage identification method and electronic equipment
CN111680161A (en) Text processing method and device and computer readable storage medium
CN112416142A (en) Method and device for inputting characters and electronic equipment
CN117312641A (en) Method, device, equipment and storage medium for intelligently acquiring information
CN106095747A (en) The recognition methods of a kind of refuse messages and system
US20170229118A1 (en) Linguistic model database for linguistic recognition, linguistic recognition device and linguistic recognition method, and linguistic recognition system
CN113505889B (en) Processing method and device of mapping knowledge base, computer equipment and storage medium
CN114881174A (en) Content classification method and device, readable storage medium and electronic equipment
CN112558784A (en) Method and device for inputting characters and electronic equipment
WO2025091763A1 (en) Data configuration method, electronic device and storage medium
CN115730237B (en) Junk mail detection method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 361000 Area 1F-D1, Huaxun Building A, Software Park, Xiamen Torch High-tech Zone, Xiamen City, Fujian Province

Applicant after: Xiamen Meishao Co., Ltd.

Address before: Unit G03, Room 102, 22 Guanri Road, Phase II, Xiamen Software Park, Fujian Province

Applicant before: XIAMEN MEIYOU INFORMATION SCIENCE & TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant