CN116432125A

CN116432125A - Code Classification Method Based on Hash Algorithm

Info

Publication number: CN116432125A
Application number: CN202310637656.3A
Authority: CN
Inventors: 吴炜; 江谜; 罗川; 钟萍
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-07-14
Anticipated expiration: 2043-06-01
Also published as: CN116432125B

Abstract

The invention discloses a code classification method based on a hash algorithm, which comprises the steps of obtaining existing code data and corresponding classification results to extract corresponding abstract syntax trees; calculating to obtain the expression of each abstract syntax tree; constructing a kernel matrix based on the similarity between abstract syntax trees; constructing an abstract syntax tree classification preliminary model and training to obtain an abstract syntax tree classification model; the abstract syntax tree classification model is adopted to conduct actual code classification. The invention avoids complex mathematical calculation and massive parameter learning in the neural network, does not depend on expensive high-end hardware, can obviously reduce time expenditure on the premise of meeting code classification precision, is particularly suitable for code classification tasks in a large-scale code dataset scene, and has high reliability, good accuracy and higher efficiency.

Description

Code Classification Method Based on Hash Algorithm

技术领域technical field

本发明属于数据挖掘技术领域，具体涉及一种基于哈希算法的代码分类方法。The invention belongs to the technical field of data mining, and in particular relates to a code classification method based on a hash algorithm.

背景技术Background technique

随着互联网技术的快速发展，用户的需求也在飞速增长。为了满足用户的海量需求，开发者就需要开发出大量的应用。同时，大部分的开发者们，都倾向于将开发的应用源代码（部分或全部）分享到开源社区，以便进行交流、完善或者二次开发。因此，近年来，开源社区的源代码数据有了飞速的增长；这不仅丰富了软件的生态系统，也为日益复杂的软件开发提供了更多的选择和灵活性。With the rapid development of Internet technology, the needs of users are also growing rapidly. In order to meet the massive demands of users, developers need to develop a large number of applications. At the same time, most developers tend to share the developed application source code (part or all) with the open source community for communication, improvement or secondary development. Therefore, in recent years, the source code data of the open source community has grown rapidly; this not only enriches the software ecosystem, but also provides more choices and flexibility for increasingly complex software development.

但是，随着开源社区源代码数据的飞速增长，代码库的分类变得越来越困难。如果不能对海量且复杂的代码数据进行有效的分类，那么这将直接导致代码库的体积极速增大，从而降低代码库或开源社区的运行效率。因此，如何快速且准确地提取代码数据的特征，并根据特征对代码数据进行分类，就变得极为重要。However, with the rapid growth of source code data in the open source community, the classification of code bases has become increasingly difficult. If the massive and complex code data cannot be effectively classified, it will directly lead to a rapid increase in the size of the code base, thereby reducing the operating efficiency of the code base or the open source community. Therefore, how to quickly and accurately extract the features of the code data and classify the code data according to the features becomes extremely important.

代码数据是程序语言的一种，而针对代码数据的分类，其关键就在于如何高效且准确的提取得到代码数据中的语法和语义信息，从而实现代码的分类。传统的代码分类方案，其将代码视为自然语言，并基于令牌进行建模和分类；但是由于代码数据与自然语言数据之间具有的本质性差别，这使得这类分类方案的准确性较差。Code data is a kind of programming language, and for the classification of code data, the key lies in how to efficiently and accurately extract the grammatical and semantic information in the code data, so as to realize the code classification. Traditional code classification schemes treat codes as natural language and model and classify them based on tokens; however, due to the essential difference between code data and natural language data, this type of classification scheme is less accurate. Difference.

近年来，抽象语法树技术已经开始逐步应用于提取代码数据的语义信息和结构信息，并结合机器学习模型来完成对代码的分类。但是，这类方案依旧存在一些缺陷：由于代码程序所对应的抽象语法树的规模通常较大且深度较深，基于该类抽象语法树的深度学习模型非常容易出现远端特征难以提取的问题，从而降低了代码分类的精确和可靠性；同时，现有的抽象语法树与机器学习组合的代码分类方案，其在运行时存在分类方案极其复杂，效率较差且成本较高的缺陷。In recent years, abstract syntax tree technology has been gradually applied to extract semantic information and structural information of code data, and combine machine learning models to complete code classification. However, there are still some defects in this type of scheme: because the abstract syntax tree corresponding to the code program is usually large in scale and deep in depth, the deep learning model based on this type of abstract syntax tree is very prone to the problem that remote features are difficult to extract, Therefore, the accuracy and reliability of code classification are reduced; at the same time, the existing code classification scheme combined with abstract syntax tree and machine learning has the defects of extremely complex classification scheme, poor efficiency and high cost at runtime.

发明内容Contents of the invention

本发明的目的在于提供一种可靠性高、准确性好且效率较高的基于哈希算法的代码分类方法。The purpose of the present invention is to provide a code classification method based on hash algorithm with high reliability, good accuracy and high efficiency.

本发明提供的这种基于哈希算法的代码分类方法，包括如下步骤：This hash algorithm-based code classification method provided by the present invention comprises the following steps:

S1. 获取已有的代码数据和对应的分类结果，并提取对应的抽象语法树；S1. Obtain the existing code data and corresponding classification results, and extract the corresponding abstract syntax tree;

S2. 针对步骤S1获取的抽象语法树，基于其中各个节点的属性信息和哈希算法，计算得到各个抽象语法树的表达；S2. For the abstract syntax tree obtained in step S1, calculate the expression of each abstract syntax tree based on the attribute information and hash algorithm of each node;

S3. 根据步骤S2得到的各个抽象语法树的表达，计算任意两个抽象语法树之间的相似度，从而构建核矩阵；S3. According to the expression of each abstract syntax tree obtained in step S2, calculate the similarity between any two abstract syntax trees, thereby constructing a kernel matrix;

S4. 基于支持向量机，构建抽象语法树分类初步模型；S4. Based on the support vector machine, construct a preliminary model of abstract syntax tree classification;

S5. 采用步骤S3得到的核矩阵，对步骤S4构建的抽象语法树分类初步模型进行训练，得到抽象语法树分类模型；S5. adopt the kernel matrix that step S3 obtains, train the abstract syntax tree classification preliminary model that step S4 builds, obtain the abstract syntax tree classification model;

S6. 采用步骤S5得到的抽象语法树分类模型，进行实际的代码的分类。S6. Use the abstract syntax tree classification model obtained in step S5 to classify the actual codes.

所述的步骤S2，具体包括如下步骤：The step S2 specifically includes the following steps:

A. 根据步骤S1获取的抽象语法树，提取对应的节点的属性信息，并初始化抽象语法树中各个节点的向量表达；A. According to the abstract syntax tree obtained in step S1, extract the attribute information of the corresponding node, and initialize the vector expression of each node in the abstract syntax tree;

B. 针对抽象语法树中的每个节点，提取以当前节点为根节点的各个深度的子树，并将根节点的向量表达与对应的所有子节点的向量表达进行拼接，以表示每个子树；B. For each node in the abstract syntax tree, extract the subtrees of each depth with the current node as the root node, and splice the vector expression of the root node with the vector expressions of all corresponding child nodes to represent each subtree ;

C. 为步骤B得到的每棵子树赋予唯一的整数识别号，并将所有子树的整数识别号组成的集合用于表示整棵抽象语法树；C. Assign a unique integer identification number to each subtree obtained in step B, and use the set of integer identification numbers of all subtrees to represent the entire abstract syntax tree;

D. 基于哈希算法，计算当前的抽象语法树的向量表达；D. Based on the hash algorithm, calculate the vector expression of the current abstract syntax tree;

E. 重复步骤A~D直至得到每一棵抽象语法树的向量表达。E. Repeat steps A~D until the vector representation of each abstract syntax tree is obtained.

所述的步骤A，具体包括如下步骤：Described step A specifically comprises the following steps:

针对当前的抽象语法树，提取得到抽象语法树中所有节点的属性信息，得到属性信息集合A；For the current abstract syntax tree, extract the attribute information of all nodes in the abstract syntax tree, and obtain the attribute information set A ;

根据提取得到的属性信息，采用如下算式构造每个节点的属性特征向量a：

其中，属性特征向量a的每一个维度表示抽象语法树中的一个节点；属性特征向量a中的第zz个元素/>

满足/>

；According to the extracted attribute information, use the following formula to construct the attribute feature vector a of each node :

Among them, each dimension of the attribute feature vector a represents a node in the abstract syntax tree; the zzth element in the attribute feature vector a />

meet />

;

若当前节点不是第i个节点的父节点，则

，否则/>

取值为第i个节点的属性；/>

为当前的抽象语法树中节点的总数。If the current node is not the parent node of the i-th node, then

, otherwise />

The value is the property of the i- th node; />

is the total number of nodes in the current abstract syntax tree.

所述的步骤B，具体包括如下步骤：Described step B specifically comprises the following steps:

针对抽象语法树中的每个节点，将节点的关系矩阵进行行压缩存储，通过索引操作提取以当前节点为根节点的各个深度的子树，并将子树中节点的向量表达按照抽象语法树的层次遍历次序进行拼接，以表示每棵子树。For each node in the abstract syntax tree, the relationship matrix of the node is compressed and stored, and the subtrees of each depth with the current node as the root node are extracted through the index operation, and the vector expression of the nodes in the subtree is expressed according to the abstract syntax tree The hierarchical traversal order is spliced to represent each subtree.

所述的步骤C，具体包括如下步骤：Described step C specifically comprises the following steps:

采用str2hash函数为步骤B得到的每棵子树赋予唯一的整数识别号，从而将以子树为元素的集合表示为整数集合；进而，将所有子树的整数识别号组成的集合用于表示整棵抽象语法树。The str2hash function is used to assign a unique integer identification number to each subtree obtained in step B, so that the set with subtrees as elements is represented as an integer set; furthermore, the set of integer identification numbers of all subtrees is used to represent the whole tree Abstract syntax tree.

所述的步骤D，具体包括如下步骤：Described step D specifically comprises the following steps:

将所有抽象语法树组成的集合表示为U，S为表示任意一个抽象语法树的集合，则

；生成K个随机置换的集合/>

，/>

为集合/>

中的第k个随机置换；Denote the set of all abstract syntax trees as U , and S is the set representing any abstract syntax tree, then

; Generate a set of K random permutations />

, />

for collection />

The kth random permutation in ;

采用如下算式计算得到第K维哈希码

，从而得到每棵抽象语法树的向量表达：/>

式中/>

为求最小hash值所对应的元素操作；/>

为生成S的一个随机置换。Use the following formula to calculate the K -th dimension hash code

, so as to obtain the vector representation of each abstract syntax tree: />

In the formula />

To find the element operation corresponding to the minimum hash value; />

is to generate a random permutation of S.

所述的步骤S3，具体包括如下步骤：The step S3 specifically includes the following steps:

根据步骤S2得到的各个抽象语法树的向量表达，计算任意两棵抽象语法树之间的海明相似度；Calculate the Hamming similarity between any two abstract syntax trees according to the vector expression of each abstract syntax tree obtained in step S2;

采用如下算式计算得到核矩阵：

式中

为核矩阵中第i1行第j1列的元素，用于表示树/>

与树/>

之间的海明相似度；

为抽象语法树/>

的第/>

维值；/>

为抽象语法树/>

的第/>

维值；/>

为判断函数，且取值规则为：若AA=BB则/>

，否则/>

。The kernel matrix is calculated by the following formula:

In the formula

is the element of row i 1 and column j 1 in the kernel matrix, used to represent the tree />

with tree />

Hamming similarity between

abstract syntax tree />

No. />

dimension value; />

abstract syntax tree />

No. />

dimension value; />

It is a judgment function, and the value rule is: if AA = BB then />

, otherwise />

.

所述的步骤S4，具体包括如下内容：采用一对一支持向量机，将M类分类问题转换为

个二分类问题；将第/>

类数据视为+1类，将第/>

类数据视为-1类，并对第

类数据和第/>

类数据，采用支持向量机来训练代码分类器；The described step S4 specifically includes the following content: using a one-to-one support vector machine, the M class classification problem is converted into

a binary classification problem; the second />

class data as +1 class, class />

Class data is treated as -1 class, and for the first

class data and section />

Class data, using support vector machines to train code classifiers;

求解对应的二次规划问题，得到代码分类器；Solve the corresponding quadratic programming problem to obtain the code classifier;

基于抽象语法树之间的海明相似度，计算得到最终的抽象语法树的分类结果；抽象语法树的分类结果对应着代码数据的分类结果。Based on the Hamming similarity between the abstract syntax trees, the final classification result of the abstract syntax tree is calculated; the classification result of the abstract syntax tree corresponds to the classification result of the code data.

所述的步骤S4，具体包括如下步骤：The step S4 specifically includes the following steps:

采用一对一支持向量机，将M类分类问题转换为

个二分类问题；将第

类数据视为+1类，将第/>

类数据视为-1类，并对第/>

类数据和第/>

类数据，采用支持向量机来训练代码分类器：/>

式中/>

为超平面法向量；/>

为代码片段的向量表达；/>

为输入空间到特征空间的非线性映射；/>

为对应的截距；Using a one-to-one support vector machine, the M -class classification problem is transformed into

a binary classification problem;

class data as +1 class, class />

Class data is treated as -1 class, and for the />

class data and section />

class data, using support vector machines to train code classifiers: />

In the formula />

is the hyperplane normal vector; />

is the vector representation of the code fragment; />

is the nonlinear mapping from the input space to the feature space; />

is the corresponding intercept;

求解如下二次规划问题：

；Solve the following quadratic programming problem:

;

约束条件1：

；Constraint 1:

;

约束条件2：

;Constraint 2:

;

约束条件3：

式中/>

为松弛变量；C为惩罚因子，且C>0；t为/>

类数据和

类数据的并集中的样本索引变量；/>

为样本t的松弛变量；/>

为样本t的类别；Constraint 3:

In the formula />

is the slack variable; C is the penalty factor, and C >0; t is />

class data and

The sample index variable in the union of class data; />

is the slack variable of sample t ; />

is the category of sample t ;

得到代码分类器：对于一个代码片段所对应的抽象语法树

，该代码片段的类别标签预测结果为/>

式中/>

为新数据的类别；/>为拉格朗日乘子，且/>

；/>

为抽象语法树/>

和抽象语法树/>

之间的海明相似度；/>

为二值函数，且若Z>0则/>

，若Z<0则/>

；/>

表示抽象语法树/>

所对应的代码片段属于第/>

类，

表示抽象语法树/>

所对应的代码片段属于第/>

类；Get the code classifier: the abstract syntax tree corresponding to a code fragment

, the category label prediction result of this code snippet is />

In the formula />

is the category of the new data; /> is the Lagrangian multiplier, and />

;/>

abstract syntax tree />

and Abstract Syntax Tree />

Hamming similarity between

is a binary function, and if Z > 0 then />

, if Z <0 then />

;/>

Represents an abstract syntax tree />

The corresponding code fragment belongs to the />

kind,

Represents an abstract syntax tree />

The corresponding code fragment belongs to the />

kind;

最后，得到最终的分类决策函数为：

式中/>

表示取/>

为最大值时所对应的类别为最终预测的类别。Finally, the final classification decision function is obtained as:

In the formula />

means to take />

The category corresponding to the maximum value is the final predicted category.

本发明提供的这种基于哈希算法的代码分类方法，通过随机生成若干组哈希函数来高效表达代码片段的抽象语法树，不仅取得了关于代码数量的线性时间和空间复杂度，同时也保留了抽象语法树的结构和语义信息，在保证相似度的前提下对数据进行降维；本发明能够生成用于支持向量机的核矩阵，并将核矩阵输入到支持向量机中训练得到代码分类器，从而有效地分类出各种类型的代码；因此本发明避免了神经网络中复杂的数学计算和海量的参数学习，同时不再依赖昂贵的高端硬件，能够在满足代码分类精度的前提下，明显降低时间开销，尤其适用于大规模代码数据集场景下的代码分类任务，而且本发明的可靠性高、准确性好且效率较高。The code classification method based on the hash algorithm provided by the present invention efficiently expresses the abstract syntax tree of code fragments by randomly generating several groups of hash functions, which not only achieves the linear time and space complexity of the number of codes, but also retains The structure and semantic information of the abstract syntax tree are obtained, and the dimensionality reduction is performed on the data under the premise of ensuring the similarity; the present invention can generate a kernel matrix for a support vector machine, and input the kernel matrix into the support vector machine for training to obtain code classification device, thereby effectively classifying various types of codes; therefore, the present invention avoids complex mathematical calculations and massive parameter learning in the neural network, and does not rely on expensive high-end hardware at the same time, and can meet the premise of code classification accuracy. The time cost is obviously reduced, and it is especially suitable for the code classification task in the scene of a large-scale code data set, and the present invention has high reliability, good accuracy and high efficiency.

附图说明Description of drawings

图1为本发明方法的方法流程示意图。Fig. 1 is a schematic flow chart of the method of the present invention.

图2为本发明方法中计算得到抽象语法树的表达的过程示意图。Fig. 2 is a schematic diagram of the process of calculating and obtaining the expression of the abstract syntax tree in the method of the present invention.

具体实施方式Detailed ways

如图1所示为本发明方法的方法流程示意图：本发明提供的这种基于哈希算法的代码分类方法，包括如下步骤：As shown in Figure 1, it is a schematic flow diagram of the method of the present invention: this hash algorithm-based code classification method provided by the present invention includes the following steps:

S2. 针对步骤S1获取的抽象语法树，基于其中各个节点的属性信息和哈希算法，计算得到各个抽象语法树的表达（该部分的示例图，如图2所示）；具体包括如下步骤：S2. For the abstract syntax tree obtained in step S1, calculate the expression of each abstract syntax tree based on the attribute information and hash algorithm of each node (the example diagram of this part is shown in Figure 2); specifically include the following steps:

A. 根据步骤S1获取的抽象语法树，提取对应的节点的属性信息，并初始化抽象语法树中各个节点的向量表达；具体包括如下步骤：A. According to the abstract syntax tree obtained in step S1, extract the attribute information of the corresponding node, and initialize the vector expression of each node in the abstract syntax tree; specifically include the following steps:

一般场合下，每个代码片段都可以建模为一棵抽象语法树

，其中V为该树的节点集，E为该树的边集，A为该树中节点的属性集合，f为将节点表示成/>

维的属性特征向量，且/>

；同时，该抽象语法树对应一个类别标签y，表示该代码的某种功能；In general, each code fragment can be modeled as an abstract syntax tree

, where V is the node set of the tree, E is the edge set of the tree, A is the attribute set of the nodes in the tree, f represents the node as />

attribute feature vector of dimension, and />

; At the same time, the abstract syntax tree corresponds to a category label y , indicating a certain function of the code;

满足/>

meet />

;

若当前节点不是第i个节点的父节点，则

，否则/>

取值为第i个节点的属性；/>

为当前的抽象语法树中节点的总数；If the current node is not the parent node of the i-th node, then

, otherwise />

The value is the property of the i- th node; />

is the total number of nodes in the current abstract syntax tree;

B. 针对抽象语法树中的每个节点，提取以当前节点为根节点的各个深度的子树，并将根节点的向量表达与对应的所有子节点的向量表达进行拼接，以表示每个子树；具体包括如下步骤：B. For each node in the abstract syntax tree, extract the subtrees of each depth with the current node as the root node, and splice the vector expression of the root node with the vector expressions of all corresponding child nodes to represent each subtree ; Concretely include the following steps:

针对抽象语法树中的每个节点，将节点的关系矩阵进行行压缩存储，并通过索引操作提取以当前节点为根节点的各个深度的子树，并将子树中节点的向量表达按照抽象语法树的层次遍历次序进行拼接，以表示每棵子树；For each node in the abstract syntax tree, the relationship matrix of the node is compressed and stored, and the subtrees of each depth with the current node as the root node are extracted through the index operation, and the vector expression of the nodes in the subtree is expressed according to the abstract syntax The hierarchical traversal order of the tree is spliced to represent each subtree;

C. 为步骤B得到的每棵子树赋予唯一的整数识别号，并将所有子树的整数识别号组成的集合用于表示整棵抽象语法树；具体包括如下步骤：C. Assign a unique integer identification number to each subtree obtained in step B, and use the set of integer identification numbers of all subtrees to represent the entire abstract syntax tree; specifically include the following steps:

采用str2hash函数为步骤B得到的每棵子树赋予唯一的整数识别号，从而将以子树为元素的集合表示为整数集合；进而，将所有子树的整数识别号组成的集合用于表示整棵抽象语法树；The str2hash function is used to assign a unique integer identification number to each subtree obtained in step B, so that the set with subtrees as elements is represented as an integer set; furthermore, the set of integer identification numbers of all subtrees is used to represent the whole tree abstract syntax tree;

D. 基于哈希算法，计算当前的抽象语法树的向量表达；具体包括如下步骤：D. Based on the hash algorithm, calculate the vector expression of the current abstract syntax tree; specifically include the following steps:

；生成K个随机置换的集合/>

，/>

为集合/>

中的第k个随机置换；Denote the set of all abstract syntax trees as U , and S is the set representing any abstract syntax tree, then

; Generate a set of K random permutations />

, />

for collection />

The kth random permutation in ;

采用如下算式计算得到第K维哈希码

，从而得到每棵抽象语法树的向量表达：/>

式中/>

为求最小hash值所对应的元素操作；/>

为生成S的一个随机置换，/>

表示S的一个MinHash值；Use the following formula to calculate the K -th dimension hash code

, so as to obtain the vector representation of each abstract syntax tree: />

In the formula />

To find the element operation corresponding to the minimum hash value; />

To generate a random permutation of S , />

Indicates a MinHash value of S;

E. 重复步骤A~D直至得到每一棵抽象语法树的向量表达；具体实施时，重复步骤A~D共N次，N为代码数据的个数，以得到每棵抽象语法树的向量表达；E. Repeat steps A~D until the vector expression of each abstract syntax tree is obtained; in practice, repeat steps A~D a total of N times, N is the number of code data, to obtain the vector expression of each abstract syntax tree ;

在具体重复时，在重复步骤A时，在初始化抽象语法树中各个节点的向量表达的过程中，生成

个/>

维向量；When repeating specifically, when repeating step A, in the process of initializing the vector representation of each node in the abstract syntax tree, generate

a />

dimension vector;

在重复步骤C时，生成一个由整数识别号组成的集合，集合中元素的个数为抽象语法树中子树的个数；When step C is repeated, a set consisting of integer identification numbers is generated, and the number of elements in the set is the number of subtrees in the abstract syntax tree;

在重复步骤D时，对每棵抽象语法树生成设定维度的向量表达；When step D is repeated, a vector representation of a set dimension is generated for each abstract syntax tree;

S3. 根据步骤S2得到的各个抽象语法树的表达，计算任意两个抽象语法树之间的相似度，从而构建核矩阵；具体包括如下步骤：S3. According to the expression of each abstract syntax tree obtained in step S2, calculate the similarity between any two abstract syntax trees, thereby constructing a kernel matrix; specifically include the following steps:

采用如下算式计算得到核矩阵：

式中

为核矩阵中第i1行第j1列的元素，用于表示树/>

与树/>

之间的海明相似度；

为抽象语法树/>

的第/>

维值；/>

为抽象语法树/>

的第/>

维值；/>

为判断函数，且取值规则为：若AA=BB则/>

，否则/>

；The kernel matrix is calculated by the following formula:

In the formula

with tree />

Hamming similarity between

abstract syntax tree />

No. />

dimension value; />

abstract syntax tree />

No. />

dimension value; />

It is a judgment function, and the value rule is: if AA = BB then />

, otherwise />

;

S4. 基于支持向量机，构建抽象语法树分类初步模型；具体包括如下内容：S4. Based on the support vector machine, build a preliminary model of abstract syntax tree classification; specifically include the following:

由于所需要划分的类别不止两类，因此采用一对一支持向量机，将M类分类问题转换为

个二分类问题；将第/>

类数据视为+1类，将第/>

类数据视为-1类，并对第/>

类数据和第/>

类数据，采用支持向量机来训练代码分类器；Since there are more than two categories that need to be divided, a one-to-one support vector machine is used to convert the M -class classification problem into

a binary classification problem; the second />

class data as +1 class, class />

Class data is treated as -1 class, and for the />

class data and section />

Class data, using support vector machines to train code classifiers;

基于抽象语法树之间的海明相似度，计算得到最终的抽象语法树的分类结果；抽象语法树的分类结果对应着代码数据的分类结果；Based on the Hamming similarity between the abstract syntax trees, the final classification result of the abstract syntax tree is calculated; the classification result of the abstract syntax tree corresponds to the classification result of the code data;

具体实施时，采用如下步骤构建模型：In the specific implementation, the following steps are used to build the model:

采用一对一支持向量机，将M类分类问题转换为

个二分类问题；将第

类数据视为+1类，将第/>

类数据视为-1类，并对第/>

类数据和第/>

类数据，采用支持向量机来训练代码分类器：/>

式中/>

为超平面法向量；/>

为代码片段的向量表达；/>

为输入空间到特征空间的非线性映射；/>

a binary classification problem;

class data as +1 class, class />

Class data is treated as -1 class, and for the />

class data and section />

class data, using support vector machines to train code classifiers: />

In the formula />

is the hyperplane normal vector; />

is the vector representation of the code fragment; />

is the nonlinear mapping from the input space to the feature space; />

is the corresponding intercept;

求解如下二次规划问题：

；Solve the following quadratic programming problem:

;

约束条件1：

;Constraint 1:

;

约束条件2：

;Constraint 2:

;

约束条件3：

式中/>

为松弛变量；C为惩罚因子，且C>0；t为/>

类数据和

类数据的并集中的样本索引变量；/>

为样本t的松弛变量；/>

为样本t的类别；Constraint 3:

In the formula />

is the slack variable; C is the penalty factor, and C >0; t is />

class data and

The sample index variable in the union of class data; />

is the slack variable of sample t ; />

is the category of sample t ;

得到代码分类器：对于一个代码片段所对应的抽象语法树

，该代码片段的类别标签预测结果为/>

式中/>

为新数据的类别；/>

为拉格朗日乘子，且/>

；/>

为抽象语法树/>

和抽象语法树/>

之间的海明相似度；/>

为二值函数，且若Z>0则/>

，若Z<0则/>

；/>

表示抽象语法树/>

所对应的代码片段属于第/>

类，

表示抽象语法树/>

所对应的代码片段属于第/>

, the category label prediction result of this code snippet is />

In the formula />

is the category of the new data; />

is the Lagrangian multiplier, and />

;/>

abstract syntax tree />

and Abstract Syntax Tree />

Hamming similarity between

is a binary function, and if Z > 0 then />

, if Z <0 then />

;/>

Represents an abstract syntax tree />

The corresponding code fragment belongs to the />

kind,

Represents an abstract syntax tree />

The corresponding code fragment belongs to the />

kind;

最后，得到最终的分类决策函数为：

式中/>

表示取/>

为最大值时所对应的类别为最终预测的类别；Finally, the final classification decision function is obtained as:

In the formula />

means to take />

The category corresponding to the maximum value is the final predicted category;

本发明提供的这种分类方法，具有很强的扩展性，适用于各种编程语言的代码分类。本发明方法的主要目标是对给定的代码片段按照所实现的功能进行分类，比如对github，StackOverflow等网站上的源代码按照其功能进行分类。The classification method provided by the present invention has strong expansibility and is suitable for code classification of various programming languages. The main goal of the method of the present invention is to classify given code fragments according to their functions, such as classifying source codes on github, StackOverflow and other websites according to their functions.

以下集合具体实施例，对本发明方法进行进一步说明：The following specific examples are assembled to further illustrate the method of the present invention:

本实施例采用公开数据集进行代码分类的测试；公开数据集为OJ数据集和GCJ数据集。其中，OJ数据集由从OJ编程练习系统收集的C程序组成，数据集包括学生提交的编程任务和相应的源代码，一共有52000条数据，分别包含104类，每个类别均有500条数据；GCJ数据集是从Google每年举办的在线程序竞赛中收集的，由8647 个Java文件组成，分为6类。OJ数据集和GCJ数据集的具体说明如表1所示：In this embodiment, a public data set is used to test code classification; the public data sets are OJ data set and GCJ data set. Among them, the OJ data set is composed of C programs collected from the OJ programming practice system. The data set includes programming tasks submitted by students and the corresponding source code. There are a total of 52,000 pieces of data, including 104 categories, and each category has 500 pieces of data. ; The GCJ dataset is collected from the online programming competition held by Google every year, and consists of 8647 Java files, which are divided into 6 categories. The specific description of OJ dataset and GCJ dataset is shown in Table 1:

表1 OJ数据集和GCJ数据集说明示意表Table 1 Description of OJ dataset and GCJ dataset

然后，分别采用现有的InferCode分类方法、DACL分类方法和本申请的分类方法，对以上公开数据集中的代码进行分类；Then, use the existing InferCode classification method, DACL classification method and the classification method of this application to classify the codes in the above public data sets;

其中，InferCode是基于深度学习的代码分类方法，利用自监督学习从代码的抽象语法树中学习代码表示，采用预训练-微调范式对生成的预训练模型进行微调从而实现代码分类，是现今较为热门的代码分类方法；DACL分类方法采用数据增强和课程式学习的微调模式来桥接已有的预训练模型和代码分类任务，这种代码数据增强方法目前已经得到了广泛的应用和研究；Among them, InferCode is a code classification method based on deep learning. It uses self-supervised learning to learn code representation from the abstract syntax tree of the code, and uses the pre-training-fine-tuning paradigm to fine-tune the generated pre-training model to achieve code classification. It is more popular nowadays. The code classification method; the DACL classification method uses data enhancement and course-based learning fine-tuning mode to bridge the existing pre-training model and code classification tasks. This code data enhancement method has been widely used and researched;

分类结束后，得到对应的分类方法的性能数据如表2所示：After the classification, the performance data of the corresponding classification method is obtained as shown in Table 2:

表2 分类方法性能对比示意表Table 2 Schematic diagram of performance comparison of classification methods

由表2可以看到，在这两个不同的数据集上，本申请方法相较于其他方法具有非常快的运行时间，这是因为本申请提出了一种基于哈希算法的代码分类方法，通过随机生成若干组哈希函数来高效表达代码片段的抽象语法树，进而对代码片段进行分类，一方面，由于此方法避免了复杂的数学计算和海量参数的学习过程，取得了关于代码数量的线性时间和空间复杂度，明显降低了时间和存储开销，由上述性能评估结果可知，随着数据集规模的增大，本申请方法在运行时间上表现出更好的性能，另一方面，此方法保留了抽象语法树的结构和语义信息，因此在准确度方面也具有很好的竞争力。As can be seen from Table 2, on these two different data sets, the method of this application has a very fast running time compared with other methods, because this application proposes a code classification method based on a hash algorithm, By randomly generating several groups of hash functions to efficiently express the abstract syntax tree of code fragments, and then classify code fragments, on the one hand, because this method avoids complex mathematical calculations and the learning process of massive parameters, it has achieved a large number of codes. Linear time and space complexity, which significantly reduces the time and storage overhead. From the above performance evaluation results, it can be seen that as the size of the data set increases, the method of this application shows better performance in terms of running time. On the other hand, this The method preserves the structural and semantic information of the abstract syntax tree, so it is also very competitive in terms of accuracy.

同时，本申请方法在准确度方便也表现出较为优异的性能；其中，本申请方法在OJ数据集上的分类准确度要明显优于现有的两种方法；同时，虽然本申请方法在GCJ数据集上的分类准确度要略低于现有的两种方法，但是由于本申请方法的耗时极短（耗时约为InferCode分类方法的十分之一，DACL分类方法的二十分之一），而且分类的准确度也基本达到了工业上大范围应用的要求（超过了95%）；因此本申请方法的综合性能无疑要明显优于现有的分类方法。At the same time, the method of this application also shows excellent performance in terms of accuracy and convenience; among them, the classification accuracy of the method of this application on the OJ data set is obviously better than the two existing methods; at the same time, although the method of this application is in GCJ The classification accuracy on the data set is slightly lower than the existing two methods, but due to the extremely short time-consuming of the application method (about one-tenth of the InferCode classification method and one-twentieth of the DACL classification method) 1), and the classification accuracy has basically met the requirements of large-scale industrial applications (more than 95%); therefore, the overall performance of the method of this application is undoubtedly significantly better than the existing classification methods.

Claims

1. a code classification method based on hash algorithm, it is characterized in that comprising the steps:

S1. Obtain the existing code data and corresponding classification results, and extract the corresponding abstract syntax tree;

S2. For the abstract syntax tree obtained in step S1, calculate the expression of each abstract syntax tree based on the attribute information and hash algorithm of each node;

S3. According to the expression of each abstract syntax tree obtained in step S2, calculate the similarity between any two abstract syntax trees, thereby constructing a kernel matrix;

S4. Based on the support vector machine, construct a preliminary model of abstract syntax tree classification;

S5. adopt the kernel matrix that step S3 obtains, train the abstract syntax tree classification preliminary model that step S4 builds, obtain the abstract syntax tree classification model;

S6. Use the abstract syntax tree classification model obtained in step S5 to classify the actual codes.

2. the code classification method based on hash algorithm according to claim 1, is characterized in that described step S2, specifically comprises the steps:

A. According to the abstract syntax tree obtained in step S1, extract the attribute information of the corresponding node, and initialize the vector expression of each node in the abstract syntax tree;

B. For each node in the abstract syntax tree, extract the subtrees of each depth with the current node as the root node, and splice the vector expression of the root node with the vector expressions of all corresponding child nodes to represent each subtree ;

C. Assign a unique integer identification number to each subtree obtained in step B, and use the set of integer identification numbers of all subtrees to represent the entire abstract syntax tree;

D. Based on the hash algorithm, calculate the vector expression of the current abstract syntax tree;

E. Repeat steps A~D until the vector representation of each abstract syntax tree is obtained.

3. the code classification method based on hash algorithm according to claim 2, is characterized in that described step A, specifically comprises the steps:

For the current abstract syntax tree, extract the attribute information of all nodes in the abstract syntax tree, and obtain the attribute information set A ;

According to the extracted attribute information, use the following formula to construct the attribute feature vector a of each node :

meet />

;

If the current node is not the parent node of the i-th node, then

, otherwise />

The value is the property of the i- th node; />

is the total number of nodes in the current abstract syntax tree.

4. the code classification method based on hash algorithm according to claim 3, is characterized in that described step B, specifically comprises the steps:

For each node in the abstract syntax tree, the relationship matrix of the node is compressed and stored, and the subtrees of each depth with the current node as the root node are extracted through the index operation, and the vector expression of the nodes in the subtree is expressed according to the abstract syntax tree The hierarchical traversal order is spliced to represent each subtree.

5. the code classification method based on hash algorithm according to claim 4, is characterized in that described step C, specifically comprises the steps:

The str2hash function is used to assign a unique integer identification number to each subtree obtained in step B, so that the set with subtrees as elements is represented as an integer set; furthermore, the set of integer identification numbers of all subtrees is used to represent the whole tree Abstract syntax tree.

6. the code classification method based on hash algorithm according to claim 5, is characterized in that described step D, specifically comprises the steps:

Denote the set of all abstract syntax trees as U , and S is the set representing any abstract syntax tree, then

; Generate a set of K random permutations />

, />

for collection />

The kth random permutation in ;

Use the following formula to calculate the K -th dimension hash code

, so as to obtain the vector representation of each abstract syntax tree:

In the formula />

To find the element operation corresponding to the minimum hash value;

is to generate a random permutation of S.

7. the code classification method based on hash algorithm according to claim 6, is characterized in that described step S3, specifically comprises the following steps:

Calculate the Hamming similarity between any two abstract syntax trees according to the vector expression of each abstract syntax tree obtained in step S2;

The kernel matrix is calculated by the following formula:

In the formula

with tree />

Hamming similarity between

abstract syntax tree />

No. />

dimension value; />

abstract syntax tree />

No. />

dimension value; />

It is a judgment function, and the value rule is: if AA = BB then />

, otherwise />

.

8. the code classification method based on hash algorithm according to claim 7, is characterized in that described step S4, specifically comprises the following content:

Using a one-to-one support vector machine, the M -class classification problem is transformed into

a binary classification problem; the second />

class data as +1 class, class />

Class data is treated as -1 class, and for the />

class data and section />

Class data, using support vector machines to train code classifiers;

Solve the corresponding quadratic programming problem to obtain the code classifier;

Based on the Hamming similarity between the abstract syntax trees, the final classification result of the abstract syntax tree is calculated; the classification result of the abstract syntax tree corresponds to the classification result of the code data.

9. the code classification method based on hash algorithm according to claim 8, is characterized in that described step S4, specifically comprises the following steps: