CN109145554A

CN109145554A - A kind of recognition methods of keystroke characteristic abnormal user and system based on support vector machines

Info

Publication number: CN109145554A
Application number: CN201810763718.4A
Authority: CN
Inventors: 戴大蒙; 单鹏飞; 陆岚; 夏海江
Original assignee: Cangnan Institute Of Cangnan
Current assignee: Wenzhou University
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2019-01-04

Abstract

The invention discloses a method and system for identifying abnormal users of keystroke characteristics based on support vector machine. By acquiring the first sample behavior characteristic data of a preset sample input by a sample user, it includes the number of different error types of keystroke input characters, the number of keystroke input characters, and the number of keystrokes. The key speed, the average speed of keystrokes, the instantaneous speed of keystrokes, the accuracy of keystrokes, and the stability of keystrokes will be used as the behavioral feature library, which can better reflect the individual differences in user keystrokes and characterize the user's identity. The missing behavior feature data is processed as a preset behavior feature sample library after data completion processing, so that the sample user behavior feature data is more complete, and the recognition rate is greatly improved compared with the prior art.

Description

A method and system for identifying abnormal user of keystroke characteristics based on support vector machine

技术领域technical field

本发明涉及模式识别领域，具体涉及一种基于支持向量机的击键特征异常用户识别方法及系统。The invention relates to the field of pattern recognition, in particular to a method and system for identifying abnormal users of keystroke characteristics based on a support vector machine.

背景技术Background technique

目前，在互联网操作中，口令、密码和用户名认证是主要的用户认证方式。但这种机制最大的问题就是容易泄露个人隐私。随着机器学习，深度学习等生物认证技术的发展，为互联网身份认证提供了新的解决思路。At present, in Internet operation, password, password and user name authentication are the main user authentication methods. But the biggest problem with this mechanism is that it is easy to leak personal privacy. With the development of biometric authentication technologies such as machine learning and deep learning, new solutions have been provided for Internet identity authentication.

生物认证技术是使用每个人特殊的生理信息和特有的行为信息，而这些生物信息具有很强的辨识性和唯一性，可以有效的通过生物认证技术来进行异常用户的筛选。Biometric authentication technology uses each person's special physiological information and unique behavioral information, and these biometric information has strong identification and uniqueness, and can effectively screen abnormal users through biometric authentication technology.

由于人们对键盘的击键习惯和个人性格的不同,使得每个人在输入口令或完成一段文字输入时均形成了自己独特的击键模式。击键模式能够反应一个人敲打键盘的力度，速度，停顿习惯等，这些特征很难被模仿。美国华盛顿可选基金会通过进一步的研究证实了人们击键特征的唯一性，所以击键模式可以代表用户身份。在实际情况中为了做到实时统计数据，很大几率会出现部分数据的丢失，从而容易导致降低识别率的问题。Due to people's different keystroke habits and personal characters, each person forms his own unique keystroke pattern when entering a password or completing a piece of text input. The keystroke pattern can reflect a person's strength, speed, pause habits, etc., which are difficult to imitate. Further research has confirmed the uniqueness of people's keystroke characteristics, so the keystroke pattern can represent the user's identity. In order to achieve real-time statistical data in actual situations, there is a high probability that some data will be lost, which will easily lead to the problem of reducing the recognition rate.

发明内容SUMMARY OF THE INVENTION

因此，本发明提供一种基于支持向量机的击键特征异常用户识别方法及系统，解决了现有技术中对基于击键特征的用户身份识别率不高的问题。Therefore, the present invention provides a support vector machine-based method and system for identifying users with abnormal keystroke characteristics, which solves the problem that the identification rate of users based on keystroke characteristics is low in the prior art.

本发明实施例提供的一种基于支持向量机的击键特征异常用户识别方法，包括如下步骤：获取待识别的用户输入预设样本的行为特征数据；根据预设的分类模型及预设的行为特征样本库对所述待识别的用户的行为特征数据进行识别，生成识别结果；通过以下步骤建立所述预设的行为特征样本库：获取样本用户输入预设样本的第一样本行为特征数据；对所述第一样本行为特征数据中丢失的行为特征数据进行数据补全处理，形成第二样本行为特征数据，并将所述第二样本行为特征数据作为所述预设的行为特征样本库。A method for identifying users with abnormal keystroke characteristics based on a support vector machine provided by an embodiment of the present invention includes the following steps: acquiring behavior feature data of preset samples input by users to be identified; according to preset classification models and preset behaviors The feature sample library identifies the behavior feature data of the user to be identified, and generates a recognition result; the preset behavior feature sample library is established through the following steps: obtaining the first sample behavior feature data of the preset sample input by the sample user ; Carry out data completion processing to the missing behavioral feature data in the first sample behavioral feature data, form the second sample behavioral feature data, and use the second sample behavioral feature data as the preset behavioral feature sample library.

优选地，所述第一样本行为特征数据包括以下内容中的至少之一：击键输入字符不同错误类型的数量、击键速度、击键平均速度、击键瞬时速度、击键正确率及击键的稳定性。Preferably, the first sample behavior feature data includes at least one of the following contents: the number of different error types of keystroke input characters, the keystroke speed, the average keystroke speed, the keystroke instantaneous speed, the keystroke accuracy rate and Keystroke stability.

优选地，所述对所述第一样本行为特征数据中丢失的行为特征数据进行数据补全处理的步骤，具体包括：对所述第一样本行为特征数据进行归一化处理；根据归一化处理后的第一样本行为特征数据判断样本用户的行为特征数据是否丢失；将行为特征数据未丢失的样本用户及其第一样本行为特征数据作为训练集，行为特征数据丢失的样本用户及其第一样本行为特征数据作为测试集，根据所述训练集、测试集及lasso回归模型确定未丢失数据的行为特征与丢失数据的行为特征之间的权重；根据所述权重及训练集补全丢失的行为特征数据，形成第二样本行为特征数据。Preferably, the step of performing data completion processing on the missing behavior feature data in the first sample behavior feature data specifically includes: normalizing the first sample behavior feature data; The first sample behavior feature data after normalization is used to determine whether the behavior feature data of the sample users is lost; the sample users whose behavior feature data is not lost and their first sample behavior feature data are used as the training set, and the samples whose behavior feature data is lost are used as the training set. The user and its first sample behavior feature data are used as the test set, and the weight between the behavior feature of the unmissed data and the behavior feature of the missing data is determined according to the training set, the test set and the lasso regression model; according to the weight and training The set complements the missing behavioral feature data to form the second sample behavioral feature data.

优选地，所述根据所述权重及训练集补全丢失的行为特征数据的步骤，具体包括：根据所述权重及训练集补全丢失数据的行为特征的特征值；根据丢失数据的行为特征的特征值补全丢失的行为特征数据。Preferably, the step of completing the missing behavioral feature data according to the weight and the training set specifically includes: completing the feature value of the behavioral feature of the missing data according to the weight and the training set; Eigenvalues complement missing behavioral feature data.

优选地，通过以下公式计算未丢失的行为特征与丢失特征之间的权重：Preferably, the weight between the unmissed behavioral features and the missing features is calculated by the following formula:

其中，J(W)为损失函数，X为所述训练集未丢失行为特征的特征值，N为样本用户的数量，y为所述训练集丢失行为特征的特征值，w为未丢失的行为特征与丢失特征之间的权重，α为超参数。Among them, J(W) is the loss function, X is the eigenvalue of the behavioral feature that is not lost in the training set, N is the number of sample users, y is the eigenvalue of the behavioral feature that is lost in the training set, and w is the behavior that is not lost. The weight between the feature and the missing feature, α is a hyperparameter.

优选地，根据所述权重及训练集补全丢失的行为特征数据，形成第二样本行为特征数据的步骤中，通过以下公式计算丢失数据的行为特征的特征值:Preferably, according to the described weight and the training set to complete the missing behavioral characteristic data, in the step of forming the second sample behavioral characteristic data, the characteristic value of the behavioral characteristic of the missing data is calculated by the following formula:

其中，为所述测试集丢失的行为特征的特征值，w为未丢失的行为特征与丢失特征之间的权重，b为偏值。in, is the feature value of the missing behavioral feature in the test set, w is the weight between the unmissed behavioral feature and the missing feature, and b is the bias value.

优选地，在所述根据所述权重及训练集补全丢失的行为特征数据，形成第二样本行为特征数据的步骤之后，还包括：Preferably, after the step of completing the missing behavioral feature data according to the weight and the training set to form the second sample behavioral feature data, the method further includes:

对所述第二样本行为特征数据进行相关性分析，得到分析结果；Carrying out a correlation analysis on the behavior characteristic data of the second sample to obtain an analysis result;

根据所述分析结果，对所述第二样本行为特征进行筛选；通过主成分分析对所述筛选后的数据进行降维处理；将降维处理后的数据作为用户行为特征样本库。According to the analysis result, the second sample behavior feature is screened; the filtered data is subjected to dimensionality reduction processing through principal component analysis; and the dimensionality reduction processed data is used as a user behavior feature sample library.

优选地，通过以下公式对所述筛选后的数据进行降维处理：Preferably, dimension reduction processing is performed on the filtered data by the following formula:

其中，x⁽ⁱ⁾为当前维度的特征向量，x⁽ⁱ⁾ _approx是降维处理后的特征向量，α是预设的阈值，m代表所述样本用户的数量。Wherein, x ⁽ⁱ⁾ is the feature vector of the current dimension, x ⁽ⁱ⁾ _approx is the feature vector after dimension reduction processing, α is a preset threshold, and m represents the number of the sample users.

优选地，所述根据预设的分类模型及预设的行为特征样本库对所述待识别的用户的行为特征数据进行识别，生成识别结果的步骤，具体包括：Preferably, the step of identifying the behavior feature data of the user to be identified according to a preset classification model and a preset behavior feature sample library, and generating an identification result, specifically includes:

获取所述预设的行为特征样本库的样本用户集合；依次把所述样本用户集合中的其中一个样本用户作为正集，其他样本用户作为负集；根据所述待识别的用户的经过降维处理后的行为特征数据及支持向量机分类模型得到识别结果；将所述识别结果进行排序，将识别结果的最大值对应的作为正集的样本用户的身份确定为所述待识别的用户的身份；判断所述待识别的用户的身份是否属于所述样本用户集合；当所述待识别的用户的身份不属于所述样本用户集合时，判定所述待识别的用户为异常用户。Obtain a sample user set of the preset behavioral feature sample library; sequentially take one of the sample users in the sample user set as a positive set, and the other sample users as a negative set; according to the dimensionality reduction of the user to be identified The processed behavioral feature data and the support vector machine classification model are used to obtain identification results; the identification results are sorted, and the identity of the sample user corresponding to the maximum value of the identification results as a positive set is determined as the identity of the user to be identified ; determine whether the identity of the user to be identified belongs to the sample user set; when the identity of the user to be identified does not belong to the sample user set, determine that the user to be identified is an abnormal user.

获取所述用户行为特征样本库的样本用户集合；依次把所述用户集合中的其中一个用户样本作为正集，其他用户样本作为负集；根据所述待识别的用户的经过降维处理后的行为特征数据及支持向量机分类模型中进行识别，得到识别结果；根据所述识别结果进行排序，判断所述识别结果中的最大值是否大于一预设值；当所述识别结果中的最大值大于所述预设值时，将作为正集的样本用户的身份确定为所述待识别的用户的身份；当所述识别结果中的最大值小于所述预设值时，判定所述待识别的用户为异常用户。Obtain a sample user set of the user behavior feature sample library; take one of the user samples in the user set as a positive set, and the other user samples as a negative set; according to the dimension reduction processing of the user to be identified. Identify the behavioral feature data and the support vector machine classification model to obtain the identification result; sort according to the identification result, and judge whether the maximum value in the identification result is greater than a preset value; when the maximum value in the identification result is greater than a preset value; When it is greater than the preset value, the identity of the sample user as the positive set is determined as the identity of the user to be identified; when the maximum value in the identification result is less than the preset value, it is determined that the to-be-identified user The user is an abnormal user.

本发明实施例还提供一种基于支持向量机的击键特征异常用户识别系统，包括：待识别的用户行为特征提取模块，用于获取待识别的用户输入预设样本的行为特征数据；待识别的用户分类模块，用于根据预设的分类模型及预设的行为特征样本库对所述待识别的用户的行为特征数据进行分类，生成分类结果。通过以下步骤建立所述预设的行为特征样本库：The embodiment of the present invention also provides a support vector machine-based user identification system for abnormal keystroke features, including: a user behavior feature extraction module to be identified, configured to obtain behavior feature data of a preset sample input by a user to be identified; The user classification module is used to classify the behavior feature data of the user to be identified according to a preset classification model and a preset behavior feature sample library, and generate a classification result. The preset behavioral feature sample library is established through the following steps:

获取样本用户输入预设样本的第一样本行为特征数据；对所述第一样本行为特征数据中丢失的行为特征数据进行数据补全处理，形成第二样本行为特征数据，并将所述第二样本行为特征数据作为所述预设的行为特征样本库。Obtain the first sample behavior feature data of the preset sample input by the sample user; perform data completion processing on the missing behavior feature data in the first sample behavior feature data to form second sample behavior feature data, and combine the The second sample behavior feature data is used as the preset behavior feature sample library.

本发明实施例还提供一种计算机设备，包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器执行上述的基于支持向量机的击键特征异常用户识别方法。An embodiment of the present invention further provides a computer device, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor, so that the at least one processor executes the above-mentioned method for identifying an abnormal user of keystroke characteristics based on a support vector machine.

本发明实施例还提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机指令，所述计算机指令用于使所述计算机执行上述的基于支持向量机的击键特征异常用户识别方法。An embodiment of the present invention further provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and the computer instructions are used to make the computer perform the above-mentioned SVM-based keystroke feature abnormal user identification method.

本发明技术方案，具有如下优点：The technical scheme of the present invention has the following advantages:

1.本发明实施例提供的基于支持向量机的击键特征异常用户识别方法及系统，通过获取样本用户输入预设样本的第一样本行为特征数据，包括击键输入字符不同错误类型的数量、击键速度、击键平均速度、击键瞬时速度、击键正确率及击键的稳定性，将更能体现用户击键个体差异表征用户身份的行为特征作为行为特征库，使得待识别用户身份的识别率大大得到了提高。1. The method and system for identifying abnormal users of keystroke characteristics based on the support vector machine provided by the embodiment of the present invention, by obtaining the first sample behavior characteristic data of the preset sample input by the sample user, including the number of different error types of the keystroke input characters , keystroke speed, keystroke average speed, keystroke instantaneous speed, keystroke accuracy rate and keystroke stability, and use the behavioral characteristics that better reflect the individual differences of user keystrokes to characterize the user's identity as the behavioral feature database, so that the user to be identified can be identified. The identification rate of identity has been greatly improved.

2.本发明实施例提供的基于支持向量机的击键特征异常用户识别方法及系统，对行为特征数据中丢失的行为特征数据进行数据补全处理后作为预设的行为特征样本库，使得样本用户行为特征数据更加完整。将待识别的用户输入预设样本的行为特征数据，根据预设的分类模型及预设的行为特征样本库对所述待识别的用户的行为特征数据进行分类，生成分类结果确定其身份，进而判断是否为异常用户，使得识别率相比现有技术得到了很大的提高。2. The method and system for identifying abnormal users of keystroke features based on the support vector machine provided by the embodiment of the present invention, the behavioral feature data lost in the behavioral feature data is processed as a preset behavioral feature sample library after data completion processing, so that the sample User behavior characteristic data is more complete. Input the behavior feature data of the preset sample into the user to be identified, classify the behavior feature data of the user to be identified according to the preset classification model and the preset behavior feature sample library, generate a classification result to determine its identity, and then By judging whether it is an abnormal user, the recognition rate is greatly improved compared with the prior art.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the specific embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the specific embodiments or the prior art. Obviously, the accompanying drawings in the following description The drawings are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative efforts.

图1为本发明实施例中提供的基于支持向量机的击键特征异常用户识别方法的一个具体示例的流程图；1 is a flowchart of a specific example of a support vector machine-based keystroke feature abnormal user identification method provided in an embodiment of the present invention;

图2为本发明实施例中提供的建立预设的行为特征样本库的一个具体示例的流程图；2 is a flowchart of a specific example of establishing a preset behavioral feature sample library provided in an embodiment of the present invention;

图3为本发明实施例中提供的丢失的行为特征数据进行数据补全处理一个示例的流程图；3 is a flowchart of an example of performing data completion processing on lost behavior feature data provided in an embodiment of the present invention;

图4为本发明实施例中对补全处理后的数据进行降维处理的一个示例的流程图；4 is a flowchart of an example of performing dimension reduction processing on data after completion processing in an embodiment of the present invention;

图5为本发明实施例中根据预设的分类模型及预设的行为特征样本库对待识别的用户的行为特征数据进行识别，生成识别结果的一个具体示例的流程图；5 is a flowchart of a specific example of generating a recognition result according to a preset classification model and a preset behavioral feature sample library according to an embodiment of the present invention, identifying the behavioral feature data of the user to be identified;

图6为本发明实施例中根据预设的分类模型及预设的行为特征样本库对待识别的用户的行为特征数据进行识别，生成识别结果的另一个具体示例的流程图；6 is a flowchart of another specific example of generating a recognition result according to a preset classification model and a preset behavioral feature sample library according to an embodiment of the present invention, identifying the behavioral feature data of the user to be identified;

图7为本发明实施例中提供的基于支持向量机的击键特征异常用户识别系统的结构示意图；7 is a schematic structural diagram of a support vector machine-based keystroke feature abnormal user identification system provided in an embodiment of the present invention;

图8为本发明实施例中提供的计算机设备的结构示意图。FIG. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

此外，下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。In addition, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

实施例1Example 1

本发明实施例提供一种基于支持向量机的击键特征异常用户识别方法，如图1所示，该基于支持向量机的击键特征异常用户识别方法，包括如下步骤：An embodiment of the present invention provides a support vector machine-based method for identifying users with abnormal keystroke characteristics. As shown in FIG. 1 , the support vector machine-based method for identifying users with abnormal keystroke characteristics includes the following steps:

步骤S1：获取待识别的用户输入预设样本的行为特征数据。Step S1: Acquire behavior characteristic data of a preset sample input by a user to be identified.

本发明实施例中，该行为特征数据可以是用户通过键盘等输入设备输入预设样本的行为特征，该行为特征数据可以包括：击键输入字符不同错误类型的数量、击键速度、击键平均速度、击键瞬时速度、击键正确率及击键的稳定性。根据对行为特征数据的分析，将其中击键输入字符不同错误的类型分为以下几种类别：In this embodiment of the present invention, the behavior characteristic data may be behavior characteristics of a preset sample input by a user through an input device such as a keyboard, and the behavior characteristic data may include: the number of different error types of keystroke input characters, the keystroke speed, the keystroke average Speed, instantaneous keystroke speed, keystroke accuracy and keystroke stability. Based on the analysis of behavioral characteristic data, the types of errors in which keystroke input characters differ are classified into the following categories:

(1)Bad case:击键错误，比如将“。The”输入成了“。”；(1) Bad case: keystroke error, such as ".The" input into ".";

(2)Bad ordering:当击打一串字符时，过早输入某一个字符，例如将“house”输入成“houes”；(2) Bad ordering: When hitting a string of characters, enter a certain character too early, for example, enter "house" into "houes";

(3)Doublet:当敲击一串字符串时，同一个字母敲击两遍，例如将“home”输入成“homee”；(3) Doublet: When tapping a string of strings, tap the same letter twice, for example, enter "home" into "homee";

(4)Other:其他类型的敲击错误；(4) Other: other types of tapping errors;

(5)RED:敲击时有明显的错误但却没有修改。(5) RED: There is an obvious error in the tap but no modification.

上述的击键输入字符不同错误的分类仅为举例说明，实际应用中，关于击键输入字符不同的错误可根据实际情况进行调整，本发明并不以此为限。The above classification of different errors of keystroke input characters is only for illustration. In practical applications, the errors of different keystroke input characters can be adjusted according to the actual situation, and the present invention is not limited thereto.

在本发明实施例中，上述的击键速度可为用户输入的击键速度为每分钟输入正确字符个数；击键平均速度可以表示为在预设时间段内用户的输入正确字符个数；击键瞬时速度为当前时刻用户的击键速度；击键正确率为用输入正确的字符数与输入全部字符数之比；击键的稳定性可以表示为用户击键速度的方差以及用户击键准确率的方差。In the embodiment of the present invention, the above-mentioned keystroke speed can be the keystroke speed input by the user as the number of correct characters input per minute; the average keystroke speed can be expressed as the number of correct characters input by the user within a preset time period; The instantaneous speed of keystrokes is the user's keystroke speed at the current moment; the keystroke accuracy is the ratio of the number of correct characters to the total number of characters entered; the stability of keystrokes can be expressed as the variance of the user's keystroke speed and the user's keystrokes Variance of accuracy.

步骤S2：根据预设的分类模型及预设的行为特征样本库对待识别的用户的行为特征数据进行识别，生成识别结果。Step S2: Identify the behavior feature data of the user to be identified according to the preset classification model and the preset behavior feature sample library, and generate an identification result.

本发明实施例中的预设分类模型可为支持向量机分类模型，为了提高该支持向量机分类模型的鲁棒性，为每个样本引入松弛变量以控制对解的影响，即对于远离球心的样本点实施惩罚，其约束条件如公式(1)所示：The preset classification model in the embodiment of the present invention may be a support vector machine classification model. In order to improve the robustness of the support vector machine classification model, a slack variable is introduced for each sample. In order to control the influence on the solution, that is, to impose a penalty on the sample points far away from the center of the sphere, the constraints are shown in formula (1):

即样本用户的特征向量x_i到球心的欧式距离小于或等于半径加上松弛变量。That is, the Euclidean distance from the feature vector x _i of the sample user to the center of the sphere is less than or equal to the radius plus the slack variable.

在线性不可分的情况下，支持向量机首先在低维空间中完成计算，然后通过核函数将输入空间映射到高维特征空间，最终在高维特征空间中构造出最优分离超平面，从而把平面上本身不好分的非线性数据分开。本发明实施例中，是使用优化的高斯核函数来处理非线性的分类任务，如公式(2)所示：In the case of linear inseparability, the support vector machine first completes the calculation in the low-dimensional space, and then maps the input space to the high-dimensional feature space through the kernel function, and finally constructs the optimal separation hyperplane in the high-dimensional feature space, so that the Separating nonlinear data that is inherently indistinguishable on the plane. In the embodiment of the present invention, an optimized Gaussian kernel function is used to process nonlinear classification tasks, as shown in formula (2):

其中，K(x,y)为行为特征向量X与Y的相似度，‖x-y‖²为欧氏距离，σ为特征向量X与Y的方差。Among them, K(x, y) is the similarity between the behavior feature vectors X and Y, ‖xy‖ ² is the Euclidean distance, and σ is the variance between the feature vectors X and Y.

该支持向量机的决策函数如公式(3)所示：The decision function of the support vector machine is shown in formula (3):

f_SVD(z；α,R)＝I(‖φ(z)-φ(α)‖²≤R²)f _SVD (z;α,R)=I(‖φ(z)-φ(α)‖ ² ≤R ² )

＝I(K(z,z)-2∑_iα_iK(z,x_i)+∑_i,jα_iα_jK(x_i,x_i)≤R²) (3)=I(K(z,z)-2∑ _i α _i K(z,x _i )+∑ _i,j α _i α _j K(x _i ,x _i )≤R ² ) (3)

其中，K(z,z)为Z与Z的相似度即1，K(z,x_i)为Z与第i个样本的相似度，α_i是为权重，R²为半径平方。Among them, K(z, z) is the similarity between Z and Z, which is 1, K(z, x _i ) is the similarity between Z and the ith sample, α _i is the weight, and R ² is the radius squared.

在一较佳实施例中，如图2所示，通过以下步骤建立上述的预设的行为特征样本库：In a preferred embodiment, as shown in FIG. 2 , the above-mentioned preset behavioral feature sample library is established through the following steps:

步骤S3：获取样本用户输入预设样本的第一样本行为特征数据。Step S3: Obtain the first sample behavior characteristic data of the preset sample input by the sample user.

在本发明实施例中，该第一样本行为特征数据可以是用户通过键盘等输入设备输入预设样本的行为特征，该第一样本行为特征数据包括：上述的击键输入字符不同错误类型的数量、击键速度、击键平均速度、击键瞬时速度、击键正确率及击键的稳定性及对应的特征值。但是并不限于此，在其他实施例中还可以包括其他行为特征。In this embodiment of the present invention, the first sample behavior feature data may be the behavior feature of a preset sample input by a user through an input device such as a keyboard, and the first sample behavior feature data includes: the above-mentioned different error types of keystroke input characters The number of keystrokes, the keystroke speed, the average keystroke speed, the keystroke instantaneous speed, the keystroke accuracy rate, the keystroke stability and the corresponding characteristic values. However, it is not limited to this, and other behavioral features may also be included in other embodiments.

步骤S4：对第一样本行为特征数据中丢失的行为特征数据进行数据补全处理，形成第二样本行为特征数据，并将第二样本行为特征数据作为预设的行为特征样本库。Step S4: Perform data completion processing on the missing behavior feature data in the first sample behavior feature data to form second sample behavior feature data, and use the second sample behavior feature data as a preset behavior feature sample library.

在一较佳实施例中，如图3所示，上述对第一样本行为特征数据中丢失的行为特征数据进行数据补全处理的步骤，具体包括：In a preferred embodiment, as shown in FIG. 3 , the above-mentioned steps of performing data completion processing on the missing behavioral feature data in the first sample behavioral feature data specifically include:

步骤S5：对第一样本行为特征数据进行归一化处理。Step S5: Normalize the behavior characteristic data of the first sample.

本发明实施例中，将上述各个行为特征对应的特征值做归一化处理，使得其数值处于0～1的范围，以方便后续处理。In the embodiment of the present invention, the feature values corresponding to the above-mentioned behavior features are normalized, so that the values are in the range of 0 to 1, so as to facilitate subsequent processing.

步骤S6：根据归一化处理后的第一样本行为特征数据判断样本用户的行为特征数据是否丢失。在实际应用中，会存在部分数据丢失的情况，例如用户在输入字符串时，先将“home”打成“homme”后，又删除重新打成“hoem”，按照上述的错误的类型的划分，此时用户的Bad ordering指标累加一次，但是Doublet并没有累加，这就导致了数据的丢失。另外在实际情况中，为了做到实时统计数据，很大几率会出现部分数据的丢失。Step S6: Determine whether the behavior characteristic data of the sample user is lost according to the normalized first sample behavior characteristic data. In practical applications, some data may be lost. For example, when a user enters a string, he first typed "home" into "homme", then deleted it and retyped it into "hoem", according to the above-mentioned division of wrong types. , at this time, the user's Bad ordering indicator is accumulated once, but the Doublet is not accumulated, which leads to the loss of data. In addition, in actual situations, in order to achieve real-time statistical data, there is a high probability that some data will be lost.

步骤S7：将行为特征数据未丢失的样本用户及其第一样本行为特征数据作为训练集，行为特征数据丢失的样本用户及其第一样本行为特征数据作为测试集，根据训练集、测试集及lasso回归模型确定未丢失数据的行为特征与丢失数据的行为特征之间的权重。Step S7: Take the sample users whose behavior feature data is not lost and their first sample behavior feature data as the training set, and the sample users whose behavior feature data is lost and their first sample behavior feature data as the test set. The ensemble and lasso regression models determine the weights between the behavioral features of the non-missing data and the behavioral features of the missing data.

本发明实施例中，是使用Lasso回归对丢失数据进行补全，将行为特征数据未丢失的样本用户的行为特征数据作为训练集，将行为特征数据丢失的样本用户的行为特征数据作为测试集，测试集中行为特征丢失的特征值设为测试集中丢失数据的行为特征记为x_k，训练集中相应的x_k的值记为y，除去x_k以外的未丢失的行为特征的特征值记为X，通过公式(4)计算未丢失的行为特征与丢失的行为特征之间的权重：In the embodiment of the present invention, Lasso regression is used to complete the missing data, the behavior characteristic data of the sample users whose behavior characteristic data is not lost is used as the training set, and the behavior characteristic data of the sample users whose behavior characteristic data is lost is used as the test set, The missing feature values of behavioral features in the test set are set as The behavioral feature of the missing data in the test set is denoted as x _k , the corresponding value of x _k in the training set is denoted as y, and the eigenvalue of the behavioral feature that is not lost except for x _k is denoted as X, and the non-missing behavioral feature is calculated by formula (4). Weight between behavioral features and missing behavioral features:

其中，J(W)为损失函数，X为训练集未丢失行为特征的特征值，N为样本用户的数量，y为训练集丢失行为特征的特征值，w为未丢失的行为特征与丢失特征之间的权重，α为超参数。即在上述损失函数J(W)最小时，求得未丢失的行为特征与丢失特征之间的权重w，通过公式(5)计算丢失数据的行为特征的特征值:Among them, J(W) is the loss function, X is the eigenvalue of the behavioral feature that is not lost in the training set, N is the number of sample users, y is the eigenvalue of the missing behavioral feature of the training set, and w is the behavioral feature that is not lost and the missing feature The weight between, α is a hyperparameter. That is, when the above loss function J(W) is the smallest, the weight w between the unmissed behavioral features and the missing features is obtained, and the eigenvalues of the behavioral features of the missing data are calculated by formula (5):

其中，为测试集丢失的行为特征的特征值，w为未丢失的行为特征与丢失特征之间的权重，b为偏值。in, is the eigenvalue of the missing behavioral feature in the test set, w is the weight between the unmissed behavioral feature and the missing feature, and b is the bias value.

步骤S8：根据权重及训练集补全丢失的行为特征数据，形成第二样本行为特征数据。Step S8: Completing the missing behavior feature data according to the weight and the training set to form a second sample behavior feature data.

本发明实施例中，根据上述的权重及训练集补全丢失数据的行为特征的特征值，根据丢失数据的行为特征的特征值补全丢失的行为特征数据。In the embodiment of the present invention, the characteristic value of the behavior characteristic of the missing data is completed according to the above weight and the training set, and the missing behavior characteristic data is completed according to the characteristic value of the behavior characteristic of the missing data.

在一较佳实施例中，如图4所示，上述步骤S8对丢失的行为特征数据进行数据补全处理的步骤之后，该击键特征异常用户识别方法还包括：In a preferred embodiment, as shown in FIG. 4 , after the above-mentioned step S8 performs data completion processing on the lost behavior feature data, the method for identifying users with abnormal keystroke features further includes:

步骤S9：对第二样本行为特征数据进行相关性分析，得到分析结果。Step S9: Perform a correlation analysis on the behavior characteristic data of the second sample to obtain an analysis result.

步骤S10：根据分析结果，对第二样本行为特征进行筛选。Step S10: According to the analysis result, screen the behavior characteristics of the second sample.

在本发明实施例中，测试的样本用户为11个人共88个样本(每人8输入8段文本)作为训练数据，实际应用中较少的数据量和较多的特征会导致过拟合，为了缓解过拟合现象，构建样本的验证数，本发明实施例中，是使用相关性热图分析进行特征筛选，要利用交叉验证来验证数据是否过拟合。In the embodiment of the present invention, the sample users tested are 11 people, a total of 88 samples (each person 8 inputs 8 paragraphs of text) as training data, and in practical applications, a small amount of data and a large number of features will lead to overfitting, In order to alleviate the over-fitting phenomenon and construct the verification number of the samples, in the embodiment of the present invention, correlation heat map analysis is used for feature screening, and cross-validation is used to verify whether the data is over-fitting.

为了解决过拟合现象，本发明实施例中，对上述五种错误特征进行分析，进行样本用户测试时发现，某些特征之间具有很大的关联性，比如用户误把“home”敲击成“hoeem”，此时它即属于Bad ordering，又属于Doublet，且两者经常会同时发生。进一步地，对五种错误特征之间进行相关性分析。本发明实施例中，设定阈值0.5，当两种行为特征之间的关联性|α|≥0.5时，认为两种行为特征之间的关联性很大，α>0.5，说明两个特征有很高的相关性，可以互相取代，当α<-0.5时，认为两个特征相互抑制，也可以互相取代，通过这种方法，对行为特征进行筛选。需要说明的是，上述阈值的取值并不限于此，在其他实施例在其他实施例中也可以是其他数值。In order to solve the over-fitting phenomenon, in the embodiment of the present invention, the above five error features are analyzed, and it is found during the sample user test that some features are highly correlated, for example, the user accidentally taps "home" As "hoeem", it belongs to both Bad ordering and Doublet, and the two often occur at the same time. Further, a correlation analysis is performed between the five error features. In the embodiment of the present invention, a threshold value of 0.5 is set. When the correlation between the two behavioral features |α| ≥ 0.5, it is considered that the correlation between the two behavioral features is very large, and α>0.5, indicating that the two features have High correlation can replace each other. When α<-0.5, it is considered that the two features inhibit each other and can also replace each other. Through this method, the behavioral features are screened. It should be noted that the value of the foregoing threshold is not limited to this, and may also be other values in other embodiments.

步骤S11：通过主成分分析对筛选后的数据进行降维处理。Step S11: Perform dimension reduction processing on the filtered data through principal component analysis.

本发明实施例中，对通过上述步骤S10筛选好的特征使用主成分分析降低维度，提高运行效率。主成分分析是一种无监督的统计学方法，通常借助于正交变换，将分量相关的向量转化为分量不相关的向量，在几何上的直观表现是将原有的坐标系转化为正交坐标系，把样本点散布在多个方向，并对多维变量进行降维处理。具体地，可以是通过公式(6)对筛选后的数据进行降维处理：In the embodiment of the present invention, principal component analysis is used for the features screened through the above step S10 to reduce the dimension and improve the operation efficiency. Principal component analysis is an unsupervised statistical method, usually with the help of orthogonal transformation, the component-related vector is converted into a component-independent vector. The intuitive performance in geometry is to convert the original coordinate system into an orthogonal one. Coordinate system, spread the sample points in multiple directions, and perform dimensionality reduction processing on multidimensional variables. Specifically, the filtered data can be dimensionally reduced by formula (6):

其中，x⁽ⁱ⁾为当前维度的特征向量，x⁽ⁱ⁾ _approx是降维处理后的特征向量，α是预设的阈值，m代表样本用户的数量。本发明实施例中的阈值设为0.01，但是并不限于此，在其他实施例在也可以是其他数值。Among them, x ⁽ⁱ⁾ is the feature vector of the current dimension, x ⁽ⁱ⁾ _approx is the feature vector after dimension reduction, α is the preset threshold, and m represents the number of sample users. The threshold in this embodiment of the present invention is set to 0.01, but is not limited to this, and may also be other values in other embodiments.

步骤S12：将降维处理后的数据作为用户行为特征样本库。Step S12: Use the data after dimensionality reduction processing as a sample library of user behavior characteristics.

本发明实施例中，将降维处理后的第二样本行为特征数据作为预设的行为特征样本库。In the embodiment of the present invention, the behavior feature data of the second sample after dimensionality reduction processing is used as a preset behavior feature sample library.

在一较佳实施例中，如图5所示，上述步骤S2根据预设的分类模型及预设的行为特征样本库对待识别的用户的行为特征数据进行识别，生成识别结果的步骤，具体包括：In a preferred embodiment, as shown in FIG. 5 , the above step S2 identifies the behavior feature data of the user to be identified according to a preset classification model and a preset behavior feature sample library, and generates a recognition result, which specifically includes: :

步骤S211:获取预设的行为特征样本库的样本用户集合。Step S211: Obtain a sample user set of a preset behavioral feature sample library.

本发明实施例中的样本用户为11个，因此用户集合为U＝{u₁,u₂,……,u_n}(n＝11)。There are 11 sample users in the embodiment of the present invention, so the user set is U={u ₁ , u ₂ , . . . , u _n } (n=11).

步骤S212:依次把样本用户集合中的其中一个样本用户作为正集，其他样本用户作为负集。Step S212: Take one of the sample users in the sample user set as the positive set, and the other sample users as the negative set.

步骤S213:根据待识别的用户的经过降维处理后的行为特征数据及支持向量机分类模型得到识别结果。本发明实施例中，设x＝{t₁,t₂,……,t_m}为一个待识别用户，而每个t分别为一个行为特征，本发明实施例中，经过降维处理后的行为特征为6个，因此m＝6。Step S213: Obtain a recognition result according to the behavior feature data of the user to be recognized after the dimension reduction process and the support vector machine classification model. In the embodiment of the present invention, let x={t ₁ , t ₂ ,...,t _m } be a user to be identified, and each t is a behavioral feature. In the embodiment of the present invention, the There are 6 behavioral features, so m=6.

步骤S214:将识别结果进行排序，将识别结果的最大值对应的作为正集的样本用户的身份确定为待识别的用户的身份。Step S214: Sort the identification results, and determine the identity of the sample user corresponding to the maximum value of the identification result as the positive set as the identity of the user to be identified.

本发明实施例中，进行分类时依次把某个类别的样本用户的行为特征归为一类,其他剩余的样本用户的行为特征归为另一类，这样k个类别的样本就构造出了k个支持向量机分类器。本发明实施例中，共有11类要划分(也就是11个Label)，他们分别为U＝{u₁,u₂,……,u_n}(n＝11)，于是在抽取训练集的时，分别抽取：In the embodiment of the present invention, the behavior characteristics of sample users of a certain category are sequentially classified into one category, and the behavior characteristics of other remaining sample users are classified into another category, so that k categories of samples construct k A support vector machine classifier. In the embodiment of the present invention, there are 11 categories to be divided (that is, 11 Labels), and they are respectively U={u ₁ , u ₂ , ..., u _n } (n=11), so when extracting the training set , extract respectively:

1)u₁所对应的向量作为正集，U＝{u₂,……,u_n}(n＝11)所对应的向量作为负集；1) The vector corresponding to u ₁ is used as a positive set, and the vector corresponding to U={u ₂ ,...,u _n }(n=11) is used as a negative set;

2)u₂所对应的向量作为正集，U＝{u₁,u₃,……,u_n}(n＝11)所对应的向量作为负集；2) The vector corresponding to u ₂ is used as a positive set, and the vector corresponding to U={u ₁ , u ₃ , ..., u _n } (n=11) is used as a negative set;

3)u₃所对应的向量作为正集，U＝{u₁,u₂,……,u_n}(n＝11)所对应的向量作为负集；3) The vector corresponding to u ₃ is used as a positive set, and the vector corresponding to U={u ₁ , u ₂ , ..., u _n } (n=11) is used as a negative set;

4)u₄所对应的向量作为正集，U＝{u₁,u₂,……,u_n}(n＝11)所对应的向量作为负集，依此类推。4) The vector corresponding to u ₄ is regarded as a positive set, and the vector corresponding to U={u ₁ , u ₂ , ..., u _n } (n=11) is regarded as a negative set, and so on.

5)使用这十一个训练集分别进行训练，然后得到十一个分类结果f₁(x)，f₂(x)，f₃(x)……f₁₁(x)，十一个值中最大的一个作为分类结果，并将对应的作为正集的样本用户的身份确定为待识别的用户的身份。5) Use these eleven training sets to train separately, and then obtain eleven classification results f ₁ (x), f ₂ (x), f ₃ (x)...f ₁₁ (x), among the eleven values The largest one is used as the classification result, and the identity of the corresponding sample user as the positive set is determined as the identity of the user to be identified.

步骤S215:判断待识别的用户的身份是否属于样本用户集合。Step S215: Determine whether the identity of the user to be identified belongs to the sample user set.

步骤S216:当待识别的用户的身份不属于样本用户集合时，判定待识别的用户为异常用户。Step S216: when the identity of the user to be identified does not belong to the sample user set, determine that the user to be identified is an abnormal user.

本发明实施例中，当待识别的用户为上述样本用户的其中之一时，例如为样本用户u₁，在经过分类器进行分类后得到的用户身份为样本用户为u₂样本用户时，就说明待识别的用户的击键行为异常，判定为异常用户。In the embodiment of the present invention, when the user to be identified is one of the above-mentioned sample users, for example, the sample user u ₁ , and the user identity obtained after being classified by the classifier is that the sample user is the sample user u ₂ , it means that If the keystroke behavior of the user to be identified is abnormal, it is determined as an abnormal user.

在另一实施例中，如图6所示，上述步骤S2根据预设的分类模型及预设的行为特征样本库对待识别的用户的行为特征数据进行识别，生成识别结果的步骤，具体包括：In another embodiment, as shown in FIG. 6 , the above step S2 identifies the behavior feature data of the user to be identified according to a preset classification model and a preset behavior feature sample library, and generates a recognition result, which specifically includes:

步骤S221:获取用户行为特征样本库的样本用户集合。Step S221: Obtain a sample user set of the user behavior feature sample library.

在本发明实施例中，获取本发明实施例中的样本用户为11个，因此用户集合为U＝{u₁,u₂,……,u_n}(n＝11)。In the embodiment of the present invention, there are 11 sample users obtained in the embodiment of the present invention, so the set of users is U={u ₁ , u ₂ , . . . , u _n } (n=11).

步骤S222:依次把用户集合中的其中一个用户样本作为正集，其他用户样本作为负集；Step S222: successively take one of the user samples in the user set as a positive set, and other user samples as a negative set;

步骤S223:根据待识别的用户的经过降维处理后的行为特征数据及支持向量机分类模型中进行识别，得到识别结果。本发明实施例中，设x＝{t₁,t₂,……,t_m}为一个待识别用户，而每个t分别为一个行为特征，本发明实施例中，经过降维处理后的行为特征为6个，因此m＝6。Step S223: Recognize according to the behavior feature data of the user to be recognized after dimensionality reduction processing and the support vector machine classification model, and obtain the recognition result. In the embodiment of the present invention, let x={t ₁ , t ₂ ,...,t _m } be a user to be identified, and each t is a behavioral feature. In the embodiment of the present invention, the There are 6 behavioral features, so m=6.

步骤S224:根据识别结果进行排序，判断识别结果中的最大值是否大于一预设值。Step S224: Sort according to the identification results, and determine whether the maximum value in the identification results is greater than a preset value.

本发明实施例中，进行分类时依次把某个类别的样本用户的行为特征归为一类,其他剩余的样本用户的行为特征归为另一类，这样k个类别的样本就构造出了k个支持向量机分类器。本发明实施例中，共有11类要划分(也就是11个Label)，他们分别为U＝{u₁,u₂,……,u_n}(n＝11)于是在抽取训练集的时，分别抽取：In the embodiment of the present invention, the behavior characteristics of sample users of a certain category are sequentially classified into one category, and the behavior characteristics of other remaining sample users are classified into another category, so that k categories of samples construct k A support vector machine classifier. In the embodiment of the present invention, there are 11 categories to be divided (that is, 11 Labels), and they are U={u ₁ , u ₂ ,..., u _n } (n=11), so when extracting the training set, Extract separately:

5)使用这十一个训练集分别进行训练，然后得到十一个分类结果f₁(x)，f₂(x)，f₃(x)……f₁₁(x)，并对十一个识别结果进行排序，判断识别结果中的最大值是否大于一预设值，本发明实施例中的预设值为0.8，但是并不限于此，在其他实施例中也可以根据应用场景设定相应的数值。5) Use these eleven training sets to train separately, and then get eleven classification results f ₁ (x), f ₂ (x), f ₃ (x)...f ₁₁ (x), and then compare the eleven The identification results are sorted, and it is determined whether the maximum value in the identification results is greater than a preset value. The preset value in this embodiment of the present invention is 0.8, but it is not limited to this. In other embodiments, it can also be set according to application scenarios. value of .

步骤S225:当识别结果中的最大值大于预设值时，将作为正集的样本用户的身份确定为待识别的用户的身份。Step S225: when the maximum value in the identification result is greater than the preset value, determine the identity of the sample user as the positive set as the identity of the user to be identified.

步骤S226:当识别结果中的最大值小于预设值时，判定待识别的用户为异常用户。Step S226: when the maximum value in the identification result is smaller than the preset value, determine that the user to be identified is an abnormal user.

本发明实施例中，当待识别的用户的行为特征经过分类器与行为特征样本库进行分类后，需要对分类的结果设定预设值，只有当大于该预设值时，才能判定其身份所属样本用户中的哪一个，当小于预设值时，说明为异常用户。In this embodiment of the present invention, after the behavioral characteristics of the user to be identified are classified by the classifier and the behavioral characteristic sample library, a preset value needs to be set for the classification result, and its identity can be determined only when it is greater than the preset value. Which of the sample users belongs to, when it is less than the preset value, it is an abnormal user.

本发明实施例提供的基于支持向量机的击键特征异常用户识别方法，通过获取样本用户输入预设样本的第一样本行为特征数据，包括击键输入字符不同错误类型的数量、击键速度、击键平均速度、击键瞬时速度、击键正确率及击键的稳定性，将更能体现用户击键个体差异表征用户身份的行为特征作为行为特征库，另外对行为特征数据中丢失的行为特征数据进行数据补全处理后作为预设的行为特征样本库，使得样本用户行为特征数据更加完整，使得识别率相比现有技术得到了很大的提高。The method for identifying users with abnormal keystroke characteristics based on the support vector machine provided by the embodiment of the present invention obtains the first sample behavior characteristic data of the preset sample input by the sample user, including the number of different error types of keystroke input characters and the keystroke speed. , The average speed of keystrokes, the instantaneous speed of keystrokes, the accuracy of keystrokes, and the stability of keystrokes, and the behavioral features that better reflect the individual differences of users' keystrokes and characterize the user's identity are used as the behavioral feature database. The behavior feature data is used as a preset behavior feature sample library after data completion processing, so that the sample user behavior feature data is more complete, and the recognition rate is greatly improved compared with the prior art.

实施例2Example 2

本发明实施例提供一种基于支持向量机的击键特征异常用户识别系统，如图7所示，该基于支持向量机的击键特征异常用户识别系统包括：An embodiment of the present invention provides a support vector machine-based keystroke feature abnormal user identification system. As shown in FIG. 7 , the support vector machine-based keystroke feature abnormal user identification system includes:

待识别的用户行为特征提取模块1，用于获取待识别的用户输入预设样本的行为特征数据。此模块具体执行实施例1中步骤S1的方法，在此不再赘述。The behavior feature extraction module 1 of the user to be identified is configured to obtain behavior feature data of a preset sample input by the user to be identified. This module specifically executes the method of step S1 in Embodiment 1, and details are not repeated here.

待识别的用户分类模块2，用于根据预设的分类模型及预设的行为特征样本库对待识别的用户的行为特征数据进行分类，生成分类结果。此模块具体执行实施例1中步骤S2的方法，在此不再赘述。The to-be-identified user classification module 2 is configured to classify the behavioral feature data of the to-be-identified user according to a preset classification model and a preset behavioral feature sample library, and generate a classification result. This module specifically executes the method of step S2 in Embodiment 1, and details are not repeated here.

本发明实施例中，建立上述预设的行为特征库的方法，参见实施例1中记录的步骤S3～S12，在此不再赘述。In this embodiment of the present invention, for the method for establishing the above-mentioned preset behavior feature library, reference may be made to steps S3 to S12 recorded in Embodiment 1, and details are not described herein again.

本发明实施例提供的基于支持向量机的击键特征异常用户识别系统，通过获取样本用户输入预设样本的第一样本行为特征数据，包括击键输入字符不同错误类型的数量、击键速度、击键平均速度、击键瞬时速度、击键正确率及击键的稳定性，将更能体现用户击键个体差异表征用户身份的行为特征作为行为特征库，另外对行为特征数据中丢失的行为特征数据进行数据补全处理后作为预设的行为特征样本库，使得样本用户行为特征数据更加完整，使得识别率相比现有技术得到了很大的提高。The support vector machine-based user identification system for abnormal keystroke characteristics provided by the embodiment of the present invention obtains the first sample behavior characteristic data of the preset sample input by the sample user, including the number of different error types of keystroke input characters and the keystroke speed. , The average speed of keystrokes, the instantaneous speed of keystrokes, the accuracy of keystrokes, and the stability of keystrokes, and the behavioral features that better reflect the individual differences of users' keystrokes and characterize the user's identity are used as the behavioral feature database. The behavior feature data is used as a preset behavior feature sample library after data completion processing, so that the sample user behavior feature data is more complete, and the recognition rate is greatly improved compared with the prior art.

实施例3Example 3

本发明实施例提供一种计算机设备，如图8所示，包括：至少一个处理器401，例如CPU(Central Processing Unit，中央处理器)，至少一个通信接口403，存储器404，至少一个通信总线402。其中，通信总线402用于实现这些组件之间的连接通信。其中，通信接口403可以包括显示屏(Display)、键盘(Keyboard)，可选通信接口403还可以包括标准的有线接口、无线接口。存储器404可以是高速RAM存储器(Ramdom Access Memory，易挥发性随机存取存储器)，也可以是非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。存储器404可选的还可以是至少一个位于远离前述处理器401的存储装置。其中处理器401可以结合图3描述的基于支持向量机的击键特征异常用户识别系统，存储器404中存储一组程序代码，且处理器401调用存储器404中存储的程序代码，以用于执行基于支持向量机的击键特征异常用户识别方法，即用于执行如图1～图6实施例中的基于支持向量机的击键特征异常用户识别方法。An embodiment of the present invention provides a computer device, as shown in FIG. 8 , including: at least one processor 401 , such as a CPU (Central Processing Unit, central processing unit), at least one communication interface 403 , a memory 404 , and at least one communication bus 402 . Among them, the communication bus 402 is used to realize the connection and communication between these components. The communication interface 403 may include a display screen (Display) and a keyboard (Keyboard), and the optional communication interface 403 may also include a standard wired interface and a wireless interface. The memory 404 may be a high-speed RAM memory (Ramdom Access Memory, volatile random access memory), or may be a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 404 can optionally also be at least one storage device located away from the aforementioned processor 401 . The processor 401 can be combined with the support vector machine-based keystroke feature abnormal user identification system described in FIG. 3, a set of program codes are stored in the memory 404, and the processor 401 calls the program codes stored in the memory 404 for executing the system based on The method for identifying users with abnormal keystroke characteristics by a support vector machine is used to implement the method for identifying users with abnormal keystroke characteristics based on the support vector machine in the embodiments of FIG. 1 to FIG. 6 .

其中，通信总线402可以是外设部件互连标准(peripheral componentinterconnect，简称PCI)总线或扩展工业标准结构(extended industry standardarchitecture，简称EISA)总线等。通信总线402可以分为地址总线、数据总线、控制总线等。为便于表示，图8中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。The communication bus 402 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like. The communication bus 402 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 8, but it does not mean that there is only one bus or one type of bus.

其中，存储器404可以包括易失性存储器(英文：volatile memory)，例如随机存取存储器(英文：random-access memory，缩写：RAM)；存储器也可以包括非易失性存储器(英文：non-volatile memory)，例如快闪存储器(英文：flash memory)，硬盘(英文：hard diskdrive，缩写：HDD)或固态硬盘(英文：solid-state drive，缩写：SSD)；存储器404还可以包括上述种类的存储器的组合。The memory 404 may include volatile memory (English: volatile memory), such as random-access memory (English: random-access memory, abbreviation: RAM); the memory may also include non-volatile memory (English: non-volatile memory) memory), such as flash memory (English: flash memory), hard disk (English: hard diskdrive, abbreviation: HDD) or solid-state drive (English: solid-state drive, abbreviation: SSD); the memory 404 may also include the above-mentioned types of memory The combination.

其中，处理器401可以是中央处理器(英文：central processing unit，缩写：CPU)，网络处理器(英文：network processor，缩写：NP)或者CPU和NP的组合。The processor 401 may be a central processing unit (English: central processing unit, abbreviation: CPU), a network processor (English: network processor, abbreviation: NP), or a combination of CPU and NP.

其中，处理器401还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(英文：application-specific integrated circuit，缩写：ASIC)，可编程逻辑器件(英文：programmable logic device，缩写：PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(英文：complex programmable logic device，缩写：CPLD)，现场可编程逻辑门阵列(英文：field-programmable gate array，缩写：FPGA)，通用阵列逻辑(英文：generic arraylogic,缩写：GAL)或其任意组合。The processor 401 may further include a hardware chip. The above-mentioned hardware chip may be an application-specific integrated circuit (English: application-specific integrated circuit, abbreviation: ASIC), a programmable logic device (English: programmable logic device, abbreviation: PLD) or a combination thereof. The above-mentioned PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviation: CPLD), a field programmable gate array (English: field-programmable gate array, abbreviation: FPGA), a general-purpose array logic (English: generic arraylogic , abbreviation: GAL) or any combination thereof.

可选地，存储器404还用于存储程序指令。处理器401可以调用程序指令，实现如本申请1～图6实施例中的基于支持向量机的击键特征异常用户识别方法。Optionally, memory 404 is also used to store program instructions. The processor 401 may invoke program instructions to implement the support vector machine-based method for identifying abnormal users of keystroke characteristics as in the embodiments 1 to 6 of the present application.

本发明实施例还提供一种计算机可读存储介质，计算机可读存储介质上存储有计算机可执行指令，该计算机可执行指令可执行上述任意方法实施例中的基于支持向量机的击键特征异常用户识别方法。其中，所述存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)、随机存储记忆体(Random Access Memory，RAM)、快闪存储器(FlashMemory)、硬盘(Hard Disk Drive，缩写：HDD)或固态硬盘(Solid-State Drive，SSD)等；所述存储介质还可以包括上述种类的存储器的组合。Embodiments of the present invention further provide a computer-readable storage medium, where computer-executable instructions are stored on the computer-readable storage medium, and the computer-executable instructions can execute the support vector machine-based keystroke feature exception in any of the foregoing method embodiments User identification method. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (FlashMemory), a hard disk (Hard Disk) Drive, abbreviation: HDD) or solid-state drive (Solid-State Drive, SSD), etc.; the storage medium may also include a combination of the above-mentioned types of memories.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

显然，上述实施例仅仅是为清楚地说明所作的举例，而并非对实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。Obviously, the above-mentioned embodiments are only examples for clear description, and are not intended to limit the implementation manner. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. And the obvious changes or changes derived from this are still within the protection scope of the present invention.

Claims

1. A keystroke characteristic abnormal user identification method based on a support vector machine is characterized by comprising the following steps:

acquiring behavior characteristic data of a user input preset sample to be identified;

identifying the behavior feature data of the user to be identified according to a preset classification model and a preset behavior feature sample library to generate an identification result;

establishing the preset behavior characteristic sample library by the following steps:

acquiring first sample behavior characteristic data of a preset sample input by a sample user;

and performing data completion processing on the behavior characteristic data lost in the first sample behavior characteristic data to form second sample behavior characteristic data, and taking the second sample behavior characteristic data as the preset behavior characteristic sample library.

2. The support vector machine-based keystroke characteristic exception user identification method of claim 1, wherein said first sample behavior characteristic data comprises at least one of: the number of different error types of the keystroke input characters, the keystroke speed, the average keystroke speed, the instantaneous keystroke speed, the keystroke accuracy and the stability of the keystroke.

3. The method for identifying a keystroke characteristic abnormality based on a support vector machine according to claim 1, wherein the step of performing data completion processing on the behavior characteristic data lost in the first sample behavior characteristic data specifically comprises:

normalizing the first sample behavior feature data;

judging whether the behavior characteristic data of the sample user is lost or not according to the first sample behavior characteristic data after normalization processing;

taking the sample user without losing the behavior characteristic data and the first sample behavior characteristic data as a training set, taking the sample user with losing the behavior characteristic data and the first sample behavior characteristic data as a testing set, and determining the weight between the behavior characteristic of the data without losing and the behavior characteristic of the data losing according to the training set, the testing set and a lasso regression model;

and completing the lost behavior characteristic data according to the weights and the training set to form second sample behavior characteristic data.

4. The method according to claim 3, wherein the step of completing the missing behavior feature data according to the weights and the training set specifically comprises:

completing the characteristic value of the behavior characteristic of the lost data according to the weight and the training set;

and completing the lost behavior characteristic data according to the characteristic value of the behavior characteristic of the lost data.

5. The method of claim 4, wherein the weight between the behavior feature of the non-lost data and the behavior feature of the lost data is calculated by the following formula:

wherein j (w) is a loss function, X is a feature value of the behavior feature of the non-lost data in the training set, N is the number of sample users, y is a feature value of the behavior feature of the lost data in the training set, w is a weight between the behavior feature of the non-lost data and the behavior feature of the lost data, and α is a hyperparameter.

6. The method of claim 5, wherein the characteristic value of the behavior characteristic of the missing data is calculated by the following formula:

wherein,and the characteristic value of the behavior characteristic of the lost data in the test set is W, the weight between the behavior characteristic of the unreleased data and the behavior characteristic of the lost data is W, and b is a bias value.

7. The method of claim 6, further comprising, after the step of completing the missing behavior feature data according to the weights and training set to form a second sample behavior feature data:

performing correlation analysis on the second sample behavior characteristic data to obtain an analysis result;

screening the second sample behavior characteristics according to the analysis result;

performing dimensionality reduction on the screened data through principal component analysis;

and taking the data subjected to the dimensionality reduction processing as a user behavior characteristic sample library.

8. The method for identifying abnormal user of keystroke characteristics based on support vector machine of claim 7, wherein the screened data is processed by dimension reduction according to the following formula:

wherein x is⁽ⁱ⁾Is a feature vector of the current dimension, x⁽ⁱ⁾ _approxIs the feature vector after the dimension reduction process, α is a preset threshold, and m represents the number of the sample users.

9. The method for identifying users with abnormal keystroke characteristics based on a support vector machine as claimed in claim 8, wherein the step of identifying the behavior characteristic data of the user to be identified according to a preset classification model and a preset behavior characteristic sample library to generate an identification result specifically comprises:

acquiring a sample user set of the preset behavior characteristic sample library;

one sample user in the sample user set is used as a positive set, and other sample users are used as negative sets in sequence;

obtaining an identification result according to the behavior characteristic data of the user to be identified after the dimension reduction processing and a support vector machine classification model;

sorting the recognition results, and determining the identity of the sample user as a positive set corresponding to the maximum value of the recognition results as the identity of the user to be recognized;

judging whether the identity of the user to be identified belongs to the sample user set;

and when the identity of the user to be identified does not belong to the sample user set, judging that the user to be identified is an abnormal user.

10. The method for identifying users with abnormal keystroke characteristics based on a support vector machine as claimed in claim 9, wherein the step of identifying the behavior characteristic data of the user to be identified according to a preset classification model and a preset behavior characteristic sample library to generate an identification result specifically comprises:

acquiring a sample user set of the user behavior characteristic sample library;

one user sample in the user set is used as a positive set, and other user samples are used as negative sets in sequence;

identifying according to the behavior feature data of the user to be identified after the dimension reduction processing and a support vector machine classification model to obtain an identification result;

sorting according to the identification results, and judging whether the maximum value in the identification results is greater than a preset value;

when the maximum value in the identification result is larger than the preset value, determining the identity of the sample user as a positive set as the identity of the user to be identified;

and when the maximum value in the identification result is smaller than the preset value, judging that the user to be identified is an abnormal user.

11. A keystroke characteristic abnormal user identification system based on a support vector machine is characterized by comprising:

the user behavior feature extraction module is used for acquiring behavior feature data of a user input preset sample to be identified;

the user classification module to be recognized is used for classifying the behavior characteristic data of the user to be recognized according to a preset classification model and a preset behavior characteristic sample library to generate a classification result;

12. A computer device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to cause the at least one processor to perform the method for identifying keystroke characteristic anomalies based on a support vector machine of any of claims 1-10 above.

13. A computer-readable storage medium storing computer instructions for causing a computer to perform the method for identifying abnormal user characteristics of keystrokes on a support vector machine according to any one of claims 1 to 10.