[go: up one dir, main page]

CN112291424B - Fraud number identification method and device, computer equipment and storage medium - Google Patents

Fraud number identification method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112291424B
CN112291424B CN202011176102.0A CN202011176102A CN112291424B CN 112291424 B CN112291424 B CN 112291424B CN 202011176102 A CN202011176102 A CN 202011176102A CN 112291424 B CN112291424 B CN 112291424B
Authority
CN
China
Prior art keywords
fraud
training
fraud number
preset
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011176102.0A
Other languages
Chinese (zh)
Other versions
CN112291424A (en
Inventor
钱沁莹
葛胜利
汲丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202011176102.0A priority Critical patent/CN112291424B/en
Publication of CN112291424A publication Critical patent/CN112291424A/en
Application granted granted Critical
Publication of CN112291424B publication Critical patent/CN112291424B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2281Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/12Detection or prevention of fraud

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Technology Law (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention is applicable to the technical field of computers, and provides a fraud number identification method, a fraud number identification device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring communication characteristic information of a number to be identified; processing the communication characteristic information according to a preset fraud number identification model to generate a fraud number identification result; the preset fraud number recognition model is generated by training a self-training classification algorithm based on semi-supervised learning in advance. The fraud number identification method provided by the invention trains and generates the fraud number identification model by using the self-training classification algorithm which does not need to rely on a large amount of sample data marked with labels in the training process, can train and obtain a better identification model, has good adaptability in the field of fraud telephone identification with insufficient sample data, and has high accuracy of fraud number identification results obtained by processing communication characteristic information by using the fraud number identification model.

Description

Fraud number identification method and device, computer equipment and storage medium
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a fraud number identification method and device, computer equipment and a storage medium.
Background
In the operator's business scenario, fraud phone identification is one of the more important parts. Existing fraud telephone identification solutions that are common in the industry have two broad categories, regular engines and machine learning methods. The machine learning method is widely popularized and applied in anti-fraud scenes due to the characteristics of automation and intellectualization. From a technical perspective, identification of fraudulent calls can be abstracted as a classification problem in supervised learning. In practical application, the problem that positive and negative sample labels in supervised learning are difficult to obtain is to be solved urgently.
Supervised learning techniques require carriers to have a sufficient accumulation of historical tags, or rely on expert experience to label portions of fraudulent telephone tags. Therefore, the comprehensiveness and reliability of the existing sample label have a great influence on the accuracy of the model identification result. In conclusion, the supervised learning technology is too dependent on the labeling of sample labels, the technology for fraud telephone identification by using simple supervised learning has limited use scenes, and the identification capability for novel fraud groups is weak.
As can be seen, the existing fraud telephone identification technology relies on the labeling of the known fraud telephone label, which affects the identification accuracy, identification efficiency and application range of fraud telephone identification.
Disclosure of Invention
The embodiment of the invention aims to provide a fraud number identification method, and aims to solve the technical problem that the identification accuracy, identification efficiency and application range of fraud telephone identification are influenced by depending on marking of known fraud telephone labels in the existing fraud telephone identification technology.
The embodiment of the invention is realized in such a way that a fraud number identification method comprises the following steps:
acquiring communication characteristic information of a number to be identified; the communication characteristic information at least comprises one or more than two of base station data, call data, short message data and flow data;
processing the communication characteristic information according to a preset fraud number identification model to generate a fraud number identification result; the preset fraud number recognition model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
Another object of an embodiment of the present invention is to provide a fraud number identification apparatus, including:
the communication characteristic information acquisition unit is used for acquiring the communication characteristic information of the number to be identified; the communication characteristic information at least comprises one or more than two of base station data, call data, short message data and flow data;
the fraud number identification unit is used for processing the communication characteristic information according to a preset fraud number identification model to generate a fraud number identification result; the preset fraud number recognition model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
It is a further object of embodiments of the present invention to provide a computer device, comprising a memory and a processor, said memory having stored therein a computer program, which, when executed by said processor, causes said processor to perform the steps of said fraud number identification method as described above.
It is a further object of embodiments of the present invention to provide a computer-readable storage medium, having stored thereon a computer program, which, when executed by a processor, causes the processor to perform the steps of the fraud number identification method as described above.
According to the fraud number identification method provided by the embodiment of the invention, after the communication characteristic information of the number to be identified, such as base station data, call data, short message data and traffic data, is obtained, the communication characteristic information is directly processed according to the preset fraud number identification model to generate a fraud number identification result, wherein the preset fraud number identification model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
Drawings
FIG. 1 is a flow chart illustrating steps of a method for identifying a fraud number according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of training a fraud number recognition model according to an embodiment of the present invention;
FIG. 3 is a flowchart of another step of training a fraud number identification model according to an embodiment of the present invention;
FIG. 4 is a flow chart of steps of another fraud number identification method provided by the embodiment of the present invention;
FIG. 5(a) is a undirected connectivity graph of normal user group numbers and calling devices;
FIG. 5(b) is a undirected connectivity graph of abnormal subscriber group numbers and calling devices;
FIG. 6 is a flow chart of steps of still another fraud number identification method provided by the embodiment of the present invention;
FIG. 7 is a flow chart illustrating steps of a method for identifying a victim group according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a fraud number identification apparatus according to an embodiment of the present invention;
fig. 9 is an internal structural diagram of a computer device for executing a fraud number identification method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a flow chart of steps of a fraud number identification method provided in the embodiment of the present invention specifically includes the following steps:
and step S102, obtaining the communication characteristic information of the number to be identified.
In the embodiment of the present invention, the communication characteristic information generally includes one or more of base station data, call data, short message data, and flow data, specifically, the base station data includes related information such as a number attribution, the call data includes related information such as a calling number, a called number, a location of the calling number, a device number used by the calling number, a call duration, and the like, the short message record includes related information such as the calling number, the called number, the location of the calling number, and all device numbers of the calling number, and the flow data includes related information such as a flow number of each month, and an app corresponding to the flow.
In the embodiment of the present invention, preferably, in consideration of the timeliness required for fraudulent call identification, the acquired communication feature information is generally acquired from all records of the month with the largest amount of money consumed in the last half year and all records of the last month.
In the embodiment of the invention, the original communication characteristic information is usually stored by dictionary data, and the communication characteristic information is described by vectorizing the dictionary data structure by utilizing the characteristic vectorization technology, so that a large space is saved for sparse matrixes and class type variables. Specifically, for example, for the feature that apps of users use the flow number (MB) in the current month, if stored as a sparse matrix, the flow number of each app as one feature will yield tens of thousands of dimensions in total. Wherein, for apps rarely used by users, a large number of characteristic values with a value of 0 will be generated. The matrix storage form of the large-width table consumes a large amount of memory, and the memory can be effectively saved based on the characteristic vectorization technology.
And step S104, processing the communication characteristic information according to a preset fraud number identification model to generate a fraud number identification result.
In the embodiment of the invention, the preset fraud number recognition model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
In the embodiment of the invention, compared with the conventional machine learning algorithm, the self-training classification algorithm (self-training) of semi-supervised learning can train to obtain a better recognition model without depending on the labeling of the sample labels, and has better adaptability in the field of fraud telephone recognition with less sample labels, wherein the step of training and generating the fraud number recognition model based on the self-training classification algorithm (self-training) can refer to fig. 2 and the content explained by the same.
According to the fraud number identification method provided by the embodiment of the invention, after the communication characteristic information of the number to be identified, such as base station data, call data, short message data and traffic data, is obtained, the communication characteristic information is directly processed according to the preset fraud number identification model to generate a fraud number identification result, wherein the preset fraud number identification model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
As shown in fig. 2, a flowchart of the steps for training and generating a fraud number recognition model provided in the embodiment of the present invention specifically includes the following steps:
step S202, a labeled data set and an unlabeled data set are obtained.
In the embodiment of the present invention, the tagged data set comprises a plurality of tagged sample numbers carrying fraud number identification result information and communication characteristic information, and usually comprises a positive sample number partially determined as a fraud number and a negative sample number of a plurality of normal users, and the non-tagged data set comprises a plurality of non-tagged sample numbers carrying only communication characteristic information and not knowing whether the non-tagged sample numbers are fraud numbers, and in the normal fraud number identification field, the number of non-tagged data sets is usually much larger than that of tagged data sets. However, the conventional machine learning algorithm can only be trained by using the labeled data set, and in most cases, the recognition accuracy of the trained recognition model is not high in the actual recognition application process because the number of the positive samples which are definitely determined as the fraud number is insufficient.
And step S204, determining the labeled data set as a training set, and training a fraud number recognition teacher model with the optimal current recognition effect based on a preset training rule.
In the embodiment of the invention, the preset training rule is usually a neural network model algorithm, after the labeled data set is given as the training set, the fraud number recognition teacher model with the optimal current recognition effect can be determined based on the conventional neural network model algorithm, and considering the sample number problem of the labeled data set, it is obvious that the fraud number recognition teacher model with the optimal current recognition effect cannot necessarily ensure that the classification effect in the actual application process is also better.
Step S206, the non-label data set is identified according to the fraud number identification teacher model, and the fraud result prediction probabilities of the plurality of non-label sample numbers are determined.
In the embodiment of the invention, further, the fraud number recognition teacher model is used for recognizing the unlabeled data set, and the final output layer adopts a softmax form, so as to ensure that the processing result of the fraud number recognition teacher model on the unlabeled data set is the fraud result prediction probability P of the unlabeled sample number, wherein P belongs to [0,1 ].
In step S208, the non-label sample number with the fraud result prediction probability exceeding the preset confidence threshold is updated to the pseudo-label data set.
In the embodiment of the present invention, the fraud result prediction probability P of the non-labeled sample number may describe the probability that the non-labeled sample number is a fraud number, the closer to the value of 1, the more likely to be a fraud number, the closer to the value of 0, the less likely to be a fraud number, so that the non-labeled sample number whose prediction probability satisfies the preset condition is updated to the pseudo-label data set by setting the confidence threshold a, and in general, the confidence interval determined according to the confidence threshold a includes two ends, [0, a ] "a, 1], when P is within the confidence interval, the fraud result of the non-labeled sample number can be considered to be more credible (more likely to be a normal number or more likely to be a fraud number), and for such non-labeled sample number, the non-labeled sample number is updated to the pseudo-label data set.
In the embodiment of the present invention, the confidence threshold a is a key for making a pseudo tag, and too high a confidence threshold may result in too many false negative examples (FN) in the pseudo tag, but too low may introduce some false positive examples (FP). Therefore, the confidence threshold a is usually required to be adaptively adjusted to screen out the optimal model with the highest accuracy. Sample imbalance phenomena usually exist due to fraudulent phone identification scenarios, namely: the sample size of the normal number is far larger than that of the fraud telephone, and the simple precision ratio and recall ratio are difficult to comprehensively measure the identification accuracy. According to the scheme, the class imbalance is considered, the identification accuracy is measured by adopting the index weighted f1-score, and model screening is carried out.
Step S210, determining the labeled data set and the pseudo label data set as a new training set, and training and generating a fraud number recognition student model according to a preset training rule.
In the embodiment of the invention, the pseudo-label data set is added into the labeled data set to form a new training set so as to expand the training set, and the fraud number recognition student model is generated under the condition that the word sample capacity is larger and according to the training rule, wherein the recognition effect of the fraud number recognition student model is required to be better than that of the fraud number recognition teacher model.
In step S212, it is determined whether a preset training end condition is satisfied. When it is determined that the preset training end condition is not satisfied, performing step S214; when it is judged that the preset training end condition is satisfied, step S216 is performed.
In the embodiment of the present invention, the preset training result condition is usually that the number of iterations is used as the determination condition, and certainly, whether a fraud number recognition student model with better recognition effect than the fraud number recognition teacher model exists may also be used as the determination condition, when the training end condition is not satisfied, it indicates that further iteration is needed, at this time, step S224 is executed, and when the training end condition is satisfied, the fraud number recognition student model is the fraud number recognition model obtained by training.
Step S214, determining the fraud number identification student model as a new fraud number identification teacher model and returning to said step S206;
in the embodiment of the invention, the fraud number recognition student model is determined as the fraud number recognition teacher model again, and then the non-tag data set is processed again, at the moment, the non-tag data updated to the pseudo-tag data set is removed from the non-tag data set.
Step S216, the fraud number recognition student model is determined into a fraud number recognition model generated by training.
In the embodiment of the present invention, a detailed step of training and generating a fraud number recognition model based on a self-training classification algorithm (self-training) is provided, and further, in consideration of a model overfitting problem that the self-training classification algorithm is easy to exist, the present invention further provides an improved self-training classification algorithm for solving the above problem, and specifically, refer to fig. 3 and the explanation thereof.
FIG. 3 is a flowchart of another procedure for training a fraud number recognition model according to an embodiment of the present invention, which is described in detail below.
In the embodiment of the present invention, the difference from the flowchart of the steps of training the fraud number identification model shown in fig. 2 is that the step S210 specifically includes:
step S302, determining the labeled data set and the pseudo label data set as new training sets, and generating fraud number recognition student models according to preset training rule training and preset noise adding rules.
In the embodiment of the invention, the improved self-training classification algorithm is provided by introducing random noise information in the process of training the generation of the student model. Specifically, after a new training set is determined, a fraud number recognition student model is generated according to a preset training rule training and a preset noise adding rule, wherein the preset noise adding rule comprises a data noise adding rule for adding noise in the training set and a model noise adding rule for adding noise in the fraud number recognition student model, generally speaking, the model noise adding rule comprises one or more than two of dropout, random depth and random enhancement, and the data noise mainly relates to corrosion of data, such as deleting and modifying sample data in a certain proportion. dropout, random depth, and random enhancement rules the present invention is not described herein in detail.
In the embodiment of the invention, the generalization capability of the model can be effectively improved by adding the data noise, and the robustness of the model can be further improved by adding the model noise, so that the problem of model overfitting easily existing in the conventional self-training classification algorithm is solved.
As shown in FIG. 4, a flow chart of steps of another fraud number identification method provided by the embodiment of the invention is described in detail as follows.
In the embodiment of the present invention, the difference from the flow chart of steps of a fraud number identification method shown in fig. 1 is that after step S104, the method further includes:
step S402, judging whether the calling equipment number of the number to be identified meets the preset fraud characteristics.
In the embodiment of the invention, in addition to utilizing the fraud number identification model to identify the number to be identified, the invention provides a scheme for further judging whether the number is a fraud number by utilizing the calling equipment number of the number to be identified through researching a multidirectional connection diagram formed by the calling number and the calling equipment number in the call records of normal users and abnormal users, and particularly, the communication characteristic information of the number to be identified also comprises the calling equipment number.
Step S404, when the calling device number of the number to be identified is judged to meet the preset fraud characteristics, the number to be identified is confirmed to be a fraud number.
In the embodiment of the present invention, in general, the calling number of a normal user and the number of a calling device are in a one-to-one relationship, in few cases, there is a one-to-few relationship, and in a multidirectional connection graph formed by the calling number of an abnormal user and the number of a calling device, the number of nodes reaches thousands at most, that is, there is a large number of one-to-many relationships, and specifically, the multidirectional connection graph of the numbers of the normal user group and the abnormal user group and the calling device may be referred to as shown in fig. 5. Therefore, whether the number to be identified is a fraud number can be further judged by judging that the calling device number of the number to be identified meets the preset fraud characteristics, and the judgment result can be integrated with the judgment result of the fraud number identification model to realize the judgment of the fraud number, so that the accuracy of the judgment result is further improved.
As shown in fig. 5(a) and 5(b), the undirected connectivity graphs of the numbers of the normal user group and the abnormal user group and the calling device are respectively described as follows.
In the embodiment of the present invention, as shown in fig. 5(a), for a undirected connected graph formed by a calling number and a calling device number in a normal user call record, the number of nodes in a connected component sub-graph does not exceed 3, that is, almost all of the nodes correspond to one or a few device numbers.
In the embodiment of the present invention, as shown in fig. 5(b), the undirected connected graph is formed by the calling number and the calling device number in the abnormal user call record, the number of nodes reaches thousands at most, and the connection forms presented by different groups are different.
Fig. 6 is a flow chart showing the steps of still another fraud number identification method provided by the embodiment of the present invention, which is described in detail as follows.
In the embodiment of the present invention, the difference from the flow chart of steps of a fraud number identification method shown in fig. 1 is that after step S104, the method further includes:
step S602, after the number to be identified is determined to be a fraud number, determining a victim according to the communication characteristic information and a preset victim identification rule.
In the embodiment of the invention, in order to further fully utilize the identification result of the fraud phone and serve network environment management and detection, based on the communication characteristic information of the fraud phone, the victim group and the victim degree can be obtained by using the identification rule of the victim group, and the suspected victim group closely related to the victim can be obtained by using the identification rule of the victim group, and the specific implementation rule is shown in fig. 7.
As shown in fig. 7, a flowchart of steps of a method for identifying a victim group according to an embodiment of the present invention specifically includes the following steps:
step S702, determining a sensitive number having communication interaction behavior with the fraud number according to the communication characteristic information.
In the embodiment of the invention, the number which has communication interaction with the fraud number is determined as the sensitive number according to the communication characteristic information.
Step S704, judging whether the sensitive number belongs to a victim group according to the conversation time length and the conversation times of the sensitive number and the fraud number and the judgment result of whether the sensitive number has communication interaction with other fraud numbers.
In the embodiment of the present invention, it may be further determined whether the sensitive number belongs to a victim group by determining a call duration, a call number of the sensitive number and the fraud number, and a determination result of whether the sensitive number has a communication interaction behavior with other fraud numbers, where usually, threshold values of the call duration and the call number may be set, and the sensitive number whose call duration exceeds the threshold value or whose call number exceeds the threshold value is determined as a victim group, and certainly, if the sensitive number also has a communication interaction behavior with other fraud numbers, the sensitive number may also be determined as a victim group, at this time, the degree of the victim group may be evaluated by setting different threshold values, and the final determination result may be used for subsequent network environment governance, for example, performing anti-fraud education and the like on the victim group with a higher degree of the victim group.
As shown in fig. 8, a schematic structural diagram of a fraud number identification apparatus provided in an embodiment of the present invention specifically includes the following structures:
a communication characteristic information obtaining unit 810, configured to obtain communication characteristic information of the number to be identified.
In the embodiment of the present invention, the communication characteristic information generally includes one or more of base station data, call data, short message data, and flow data, specifically, the base station data includes related information such as a number attribution, the call data includes related information such as a calling number, a called number, a location of the calling number, a device number used by the calling number, a call duration, and the like, the short message record includes related information such as the calling number, the called number, the location of the calling number, and all device numbers of the calling number, and the flow data includes related information such as a flow number of each month, and an app corresponding to the flow.
In the embodiment of the present invention, preferably, in consideration of the timeliness required for fraudulent call identification, the acquired communication feature information is generally acquired from all records of the month with the largest amount of money consumed in the last half year and all records of the last month.
In the embodiment of the invention, the original communication characteristic information is usually stored by dictionary data, and the communication characteristic information is described by vectorizing the dictionary data structure by utilizing the characteristic vectorization technology, so that a large space is saved for sparse matrixes and class type variables. Specifically, for example, for the feature that apps of users use the flow number (MB) in the current month, if stored as a sparse matrix, the flow number of each app as one feature will yield tens of thousands of dimensions in total. Wherein, for apps rarely used by users, a large number of characteristic values with a value of 0 will be generated. The matrix storage form of the large-width table consumes a large amount of memory, and the memory can be effectively saved based on the characteristic vectorization technology.
A fraud number recognition unit 820, configured to process the communication characteristic information according to a preset fraud number recognition model, and generate a fraud number recognition result.
In the embodiment of the invention, the preset fraud number recognition model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
In the embodiment of the invention, compared with the conventional machine learning algorithm, the self-training classification algorithm (self-training) of semi-supervised learning can train to obtain a better recognition model without depending on the labeling of the sample labels, and has better adaptability in the field of fraud telephone recognition with less sample labels, wherein the step of training and generating the fraud number recognition model based on the self-training classification algorithm (self-training) can refer to fig. 2 and the content explained by the same.
According to the fraud number identification device provided by the embodiment of the invention, after the communication characteristic information of the number to be identified, such as base station data, call data, short message data and traffic data, is obtained, the communication characteristic information is directly processed according to the preset fraud number identification model to generate a fraud number identification result, wherein the preset fraud number identification model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
FIG. 9 is a diagram illustrating an internal structure of a computer device in one embodiment. As shown in fig. 9, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may further store a computer program that, when executed by the processor, causes the processor to implement the fraud number identification method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a fraud number identification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the fraud number identification apparatus provided herein may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 9. The memory of the computer device may store therein various program modules constituting the fraud number identification apparatus, such as the communication characteristic information acquisition unit 810 and the fraud number identification unit 820 shown in FIG. 8. The respective program modules constitute computer programs that cause the processors to execute the steps in the fraud number identification methods of the respective embodiments of the present application described in the present specification.
For example, the computer device shown in fig. 9 may execute step S102 by the communication characteristic information acquiring unit 810 in the fraud number identification apparatus shown in fig. 8; the computer device may perform step S104 through the fraud number identification unit 820.
In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring communication characteristic information of a number to be identified; the communication characteristic information at least comprises one or more than two of base station data, call data, short message data and flow data;
processing the communication characteristic information according to a preset fraud number identification model to generate a fraud number identification result; the preset fraud number recognition model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of:
acquiring communication characteristic information of a number to be identified; the communication characteristic information at least comprises one or more than two of base station data, call data, short message data and flow data;
processing the communication characteristic information according to a preset fraud number identification model to generate a fraud number identification result; the preset fraud number recognition model is generated by training a self-training classification algorithm based on semi-supervised learning in advance.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A fraud number identification method, comprising:
acquiring communication characteristic information of a number to be identified; the communication characteristic information at least comprises one or more than two of base station data, call data, short message data and flow data;
processing the communication characteristic information according to a preset fraud number identification model to generate a fraud number identification result; the preset fraud number recognition model is generated by training a self-training classification algorithm based on semi-supervised learning in advance;
the step of training and generating the preset fraud number recognition model specifically comprises:
acquiring a tagged data set and a non-tagged data set; the tagged data set comprises a plurality of tagged sample numbers carrying fraud number identification result information and communication characteristic information; the non-tag data set comprises a plurality of non-tag sample numbers carrying communication characteristic information;
determining the labeled data set as a training set and training a fraud number recognition teacher model with the optimal current recognition effect based on a preset training rule;
identifying the unlabeled data set according to the fraud number identification teacher model, and determining fraud result prediction probabilities of the unlabeled sample numbers;
updating the non-label sample numbers with the fraud result prediction probability exceeding a preset confidence coefficient threshold value into a pseudo-label data set;
determining the labeled data set and the pseudo-label data set as a new training set, and training and generating a fraud number recognition student model according to a preset training rule; the recognition effect of the fraud number recognition student model is superior to that of the fraud number recognition teacher model;
judging whether a preset training end condition is met;
when the preset training end condition is judged not to be met, determining the fraud number recognition student model as a new fraud number recognition teacher model, returning to the step of performing recognition processing on the unlabeled data set according to the fraud number recognition teacher model, and determining fraud result prediction probabilities of the plurality of unlabeled sample numbers;
and when the preset training end condition is judged to be met, determining the fraud number recognition student model generated by training by using the fraud number recognition student model.
2. The fraud number identification method of claim 1, wherein the preset fraud number identification model is generated in advance based on an improved self-training classification algorithm training: the improved self-training classification algorithm introduces random noise information in the process of training and generating a student model;
the step of training and generating the fraud number recognition student model according to the preset training rule specifically comprises the following steps:
and generating a fraud number recognition student model according to the preset training rule training and the preset noise adding rule.
3. The fraud number identification method of claim 2, wherein the preset noise-adding rules comprise data noise-adding rules for adding noise in a training set and model noise-adding rules for adding noise in a fraud number identification student model; the model noise adding rule comprises one or more than two of dropout, random depth and random enhancement.
4. The fraud number identification method of claim 1, wherein the communication feature information further includes a calling device number; after the step of processing the communication characteristic information according to the preset fraud number identification model to generate a fraud number identification result, the method further comprises the following steps:
judging whether the calling equipment number of the number to be identified meets the preset fraud characteristics or not;
and when the calling equipment number of the number to be identified meets the preset fraud characteristics, determining that the number to be identified is a fraud number.
5. The fraud number identification method according to claim 1, wherein after said step of processing said communication characteristic information according to a preset fraud number identification model to generate a fraud number identification result, further comprising:
and after the number to be identified is determined to be a fraud number, determining a victim group according to the communication characteristic information and a preset victim group identification rule.
6. The fraud number identification method of claim 5, wherein the step of determining the victim group according to the communication characteristic information and the preset victim group identification rules specifically comprises:
determining a sensitive number which has communication interaction with the fraud number according to the communication characteristic information;
and judging whether the sensitive number belongs to a victim group according to the conversation time length and the conversation times of the sensitive number and the fraud number and the judgment result of whether the sensitive number has communication interaction with other fraud numbers.
7. A computer device, characterized in that it comprises a memory and a processor, said memory having stored therein a computer program which, when executed by said processor, causes said processor to carry out the steps of the fraud number identification method of any one of claims 1 to 6.
8. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, causes the processor to carry out the steps of the fraud number identification method of any one of claims 1 to 6.
CN202011176102.0A 2020-10-29 2020-10-29 Fraud number identification method and device, computer equipment and storage medium Active CN112291424B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011176102.0A CN112291424B (en) 2020-10-29 2020-10-29 Fraud number identification method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011176102.0A CN112291424B (en) 2020-10-29 2020-10-29 Fraud number identification method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112291424A CN112291424A (en) 2021-01-29
CN112291424B true CN112291424B (en) 2021-09-14

Family

ID=74373939

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011176102.0A Active CN112291424B (en) 2020-10-29 2020-10-29 Fraud number identification method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112291424B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112866486B (en) * 2021-02-01 2022-06-07 西安交通大学 Multi-source feature-based fraud telephone identification method, system and equipment
CN113747443B (en) * 2021-02-26 2024-06-07 上海观安信息技术股份有限公司 Safety detection method and device based on machine learning algorithm
CN113343429B (en) * 2021-05-17 2022-10-25 同济大学 A method and system for predicting the quality of liner adhesion during industrial processing
CN115878990A (en) * 2021-09-26 2023-03-31 中国移动通信集团浙江有限公司 Number identification method, device, equipment and computer readable storage medium
CN115884088B (en) * 2021-09-28 2025-11-18 中国移动通信集团贵州有限公司 A method, apparatus, and electronic device for determining device location information
CN114205462A (en) * 2021-12-14 2022-03-18 王晨 Fraud telephone identification method, device, system and computer storage medium
CN114066490B (en) * 2022-01-17 2022-04-29 浙江鹏信信息科技股份有限公司 GoIP fraud nest point identification method, system and computer readable storage medium
CN114567697B (en) * 2022-03-01 2024-11-01 恒安嘉新(北京)科技股份公司 Method, device, equipment and storage medium for identifying abnormal telephone
CN114745471B (en) * 2022-03-17 2023-05-23 西安交通大学 Fraud telephone identification method and system based on self-encoder
CN115640330A (en) * 2022-03-22 2023-01-24 北京零点远景网络科技有限公司 Method for identifying potential victims of fraud cases, model training method and device
CN115022464A (en) * 2022-05-06 2022-09-06 中国联合网络通信集团有限公司 Number processing method, system, computing device and storage medium
TWI827066B (en) * 2022-05-25 2023-12-21 台灣大哥大股份有限公司 Methods and systems for preventing and controlling Internet fraud
CN115174745B (en) * 2022-07-04 2023-08-15 联通(山东)产业互联网有限公司 Telephone number fraud pattern recognition method based on graph network and machine learning
CN115334510B (en) * 2022-07-28 2025-06-10 中国电信股份有限公司 Identification method and device for fraud number
CN116416445A (en) * 2022-09-06 2023-07-11 广州市申迪计算机系统有限公司 A method, system and storage medium for telecommunications anti-fraud identification based on pseudo-labels
CN115550506B (en) * 2022-09-27 2024-08-23 中国电信股份有限公司 Training of user identification model, user identification method and device
CN116361431A (en) * 2023-01-19 2023-06-30 阿里巴巴(中国)有限公司 Model training method, dialog generation method, electronic equipment and related devices
CN116939100B (en) * 2023-08-31 2025-09-26 北京九栖科技有限责任公司 A method and system for intelligent identification of fraudulent calls based on distributed ensemble learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986788A (en) * 2018-06-06 2018-12-11 国网安徽省电力有限公司信息通信分公司 A kind of noise robust acoustic modeling method based on aposterior knowledge supervision
CN111340086A (en) * 2020-02-21 2020-06-26 同济大学 Method, system, medium and terminal for processing unlabeled data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5884289A (en) * 1995-06-16 1999-03-16 Card Alert Services, Inc. Debit card fraud detection and control system
CN109819127B (en) * 2019-03-08 2020-03-06 周诚 Method and system for managing crank calls
CN110113757A (en) * 2019-05-07 2019-08-09 中国联合网络通信集团有限公司 Fraudulent user recognition methods and system
CN110493476B (en) * 2019-07-17 2021-08-10 中移(杭州)信息技术有限公司 Detection method, device, server and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986788A (en) * 2018-06-06 2018-12-11 国网安徽省电力有限公司信息通信分公司 A kind of noise robust acoustic modeling method based on aposterior knowledge supervision
CN111340086A (en) * 2020-02-21 2020-06-26 同济大学 Method, system, medium and terminal for processing unlabeled data

Also Published As

Publication number Publication date
CN112291424A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN112291424B (en) Fraud number identification method and device, computer equipment and storage medium
CN110059320B (en) Entity relationship extraction method and device, computer equipment and storage medium
CN110598206B (en) Text semantic recognition method and device, computer equipment and storage medium
CN112685739B (en) Malicious code detection method, data interaction method and related equipment
CN112613501A (en) Information auditing classification model construction method and information auditing method
CN111881983A (en) Data processing method and device based on classification model, electronic equipment and medium
CN110457302B (en) Intelligent structured data cleaning method
CN108520041B (en) Industry classification method and system of text, computer equipment and storage medium
CN113343711B (en) Work order generation method, device, equipment and storage medium
CN110134966A (en) A kind of sensitive information determines method and device
CN109523117A (en) Risk Forecast Method, device, computer equipment and storage medium
CN114513578A (en) Outbound method, device, computer equipment and storage medium
CN110969526A (en) Overlapping community processing method and device and electronic equipment
CN113901817A (en) Document classification method and device, computer equipment and storage medium
CN113468338A (en) Big data analysis method for digital cloud service and big data server
CN119106921A (en) A risk control rule effect evaluation method and system based on artificial intelligence
CN115098684B (en) 5G user identification network model establishment method, device and storage medium
CN112818868A (en) Behavior sequence characteristic data-based violation user identification method and device
CN115828901A (en) Sensitive information identification method and device, electronic equipment and storage medium
CN112287669B (en) Text processing method and device, computer equipment and storage medium
CN111027325B (en) Model generation method, entity identification device and electronic equipment
CN113472860A (en) Service resource allocation method and server under big data and digital environment
CN116975300B (en) Information mining method and system based on big data set
CN118070333A (en) A data desensitization method based on machine learning
CN117436075A (en) Unknown malware detection method and device based on hyperspheric embedding and out-of-distribution sample detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant