[go: up one dir, main page]

CN116432125A - Code Classification Method Based on Hash Algorithm - Google Patents

Code Classification Method Based on Hash Algorithm Download PDF

Info

Publication number
CN116432125A
CN116432125A CN202310637656.3A CN202310637656A CN116432125A CN 116432125 A CN116432125 A CN 116432125A CN 202310637656 A CN202310637656 A CN 202310637656A CN 116432125 A CN116432125 A CN 116432125A
Authority
CN
China
Prior art keywords
abstract syntax
syntax tree
code
classification
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310637656.3A
Other languages
Chinese (zh)
Other versions
CN116432125B (en
Inventor
吴炜
江谜
罗川
钟萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202310637656.3A priority Critical patent/CN116432125B/en
Publication of CN116432125A publication Critical patent/CN116432125A/en
Application granted granted Critical
Publication of CN116432125B publication Critical patent/CN116432125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a code classification method based on a hash algorithm, which comprises the steps of obtaining existing code data and corresponding classification results to extract corresponding abstract syntax trees; calculating to obtain the expression of each abstract syntax tree; constructing a kernel matrix based on the similarity between abstract syntax trees; constructing an abstract syntax tree classification preliminary model and training to obtain an abstract syntax tree classification model; the abstract syntax tree classification model is adopted to conduct actual code classification. The invention avoids complex mathematical calculation and massive parameter learning in the neural network, does not depend on expensive high-end hardware, can obviously reduce time expenditure on the premise of meeting code classification precision, is particularly suitable for code classification tasks in a large-scale code dataset scene, and has high reliability, good accuracy and higher efficiency.

Description

基于哈希算法的代码分类方法Code Classification Method Based on Hash Algorithm

技术领域technical field

本发明属于数据挖掘技术领域,具体涉及一种基于哈希算法的代码分类方法。The invention belongs to the technical field of data mining, and in particular relates to a code classification method based on a hash algorithm.

背景技术Background technique

随着互联网技术的快速发展,用户的需求也在飞速增长。为了满足用户的海量需求,开发者就需要开发出大量的应用。同时,大部分的开发者们,都倾向于将开发的应用源代码(部分或全部)分享到开源社区,以便进行交流、完善或者二次开发。因此,近年来,开源社区的源代码数据有了飞速的增长;这不仅丰富了软件的生态系统,也为日益复杂的软件开发提供了更多的选择和灵活性。With the rapid development of Internet technology, the needs of users are also growing rapidly. In order to meet the massive demands of users, developers need to develop a large number of applications. At the same time, most developers tend to share the developed application source code (part or all) with the open source community for communication, improvement or secondary development. Therefore, in recent years, the source code data of the open source community has grown rapidly; this not only enriches the software ecosystem, but also provides more choices and flexibility for increasingly complex software development.

但是,随着开源社区源代码数据的飞速增长,代码库的分类变得越来越困难。如果不能对海量且复杂的代码数据进行有效的分类,那么这将直接导致代码库的体积极速增大,从而降低代码库或开源社区的运行效率。因此,如何快速且准确地提取代码数据的特征,并根据特征对代码数据进行分类,就变得极为重要。However, with the rapid growth of source code data in the open source community, the classification of code bases has become increasingly difficult. If the massive and complex code data cannot be effectively classified, it will directly lead to a rapid increase in the size of the code base, thereby reducing the operating efficiency of the code base or the open source community. Therefore, how to quickly and accurately extract the features of the code data and classify the code data according to the features becomes extremely important.

代码数据是程序语言的一种,而针对代码数据的分类,其关键就在于如何高效且准确的提取得到代码数据中的语法和语义信息,从而实现代码的分类。传统的代码分类方案,其将代码视为自然语言,并基于令牌进行建模和分类;但是由于代码数据与自然语言数据之间具有的本质性差别,这使得这类分类方案的准确性较差。Code data is a kind of programming language, and for the classification of code data, the key lies in how to efficiently and accurately extract the grammatical and semantic information in the code data, so as to realize the code classification. Traditional code classification schemes treat codes as natural language and model and classify them based on tokens; however, due to the essential difference between code data and natural language data, this type of classification scheme is less accurate. Difference.

近年来,抽象语法树技术已经开始逐步应用于提取代码数据的语义信息和结构信息,并结合机器学习模型来完成对代码的分类。但是,这类方案依旧存在一些缺陷:由于代码程序所对应的抽象语法树的规模通常较大且深度较深,基于该类抽象语法树的深度学习模型非常容易出现远端特征难以提取的问题,从而降低了代码分类的精确和可靠性;同时,现有的抽象语法树与机器学习组合的代码分类方案,其在运行时存在分类方案极其复杂,效率较差且成本较高的缺陷。In recent years, abstract syntax tree technology has been gradually applied to extract semantic information and structural information of code data, and combine machine learning models to complete code classification. However, there are still some defects in this type of scheme: because the abstract syntax tree corresponding to the code program is usually large in scale and deep in depth, the deep learning model based on this type of abstract syntax tree is very prone to the problem that remote features are difficult to extract, Therefore, the accuracy and reliability of code classification are reduced; at the same time, the existing code classification scheme combined with abstract syntax tree and machine learning has the defects of extremely complex classification scheme, poor efficiency and high cost at runtime.

发明内容Contents of the invention

本发明的目的在于提供一种可靠性高、准确性好且效率较高的基于哈希算法的代码分类方法。The purpose of the present invention is to provide a code classification method based on hash algorithm with high reliability, good accuracy and high efficiency.

本发明提供的这种基于哈希算法的代码分类方法,包括如下步骤:This hash algorithm-based code classification method provided by the present invention comprises the following steps:

S1. 获取已有的代码数据和对应的分类结果,并提取对应的抽象语法树;S1. Obtain the existing code data and corresponding classification results, and extract the corresponding abstract syntax tree;

S2. 针对步骤S1获取的抽象语法树,基于其中各个节点的属性信息和哈希算法,计算得到各个抽象语法树的表达;S2. For the abstract syntax tree obtained in step S1, calculate the expression of each abstract syntax tree based on the attribute information and hash algorithm of each node;

S3. 根据步骤S2得到的各个抽象语法树的表达,计算任意两个抽象语法树之间的相似度,从而构建核矩阵;S3. According to the expression of each abstract syntax tree obtained in step S2, calculate the similarity between any two abstract syntax trees, thereby constructing a kernel matrix;

S4. 基于支持向量机,构建抽象语法树分类初步模型;S4. Based on the support vector machine, construct a preliminary model of abstract syntax tree classification;

S5. 采用步骤S3得到的核矩阵,对步骤S4构建的抽象语法树分类初步模型进行训练,得到抽象语法树分类模型;S5. adopt the kernel matrix that step S3 obtains, train the abstract syntax tree classification preliminary model that step S4 builds, obtain the abstract syntax tree classification model;

S6. 采用步骤S5得到的抽象语法树分类模型,进行实际的代码的分类。S6. Use the abstract syntax tree classification model obtained in step S5 to classify the actual codes.

所述的步骤S2,具体包括如下步骤:The step S2 specifically includes the following steps:

A. 根据步骤S1获取的抽象语法树,提取对应的节点的属性信息,并初始化抽象语法树中各个节点的向量表达;A. According to the abstract syntax tree obtained in step S1, extract the attribute information of the corresponding node, and initialize the vector expression of each node in the abstract syntax tree;

B. 针对抽象语法树中的每个节点,提取以当前节点为根节点的各个深度的子树,并将根节点的向量表达与对应的所有子节点的向量表达进行拼接,以表示每个子树;B. For each node in the abstract syntax tree, extract the subtrees of each depth with the current node as the root node, and splice the vector expression of the root node with the vector expressions of all corresponding child nodes to represent each subtree ;

C. 为步骤B得到的每棵子树赋予唯一的整数识别号,并将所有子树的整数识别号组成的集合用于表示整棵抽象语法树;C. Assign a unique integer identification number to each subtree obtained in step B, and use the set of integer identification numbers of all subtrees to represent the entire abstract syntax tree;

D. 基于哈希算法,计算当前的抽象语法树的向量表达;D. Based on the hash algorithm, calculate the vector expression of the current abstract syntax tree;

E. 重复步骤A~D直至得到每一棵抽象语法树的向量表达。E. Repeat steps A~D until the vector representation of each abstract syntax tree is obtained.

所述的步骤A,具体包括如下步骤:Described step A specifically comprises the following steps:

针对当前的抽象语法树,提取得到抽象语法树中所有节点的属性信息,得到属性信息集合AFor the current abstract syntax tree, extract the attribute information of all nodes in the abstract syntax tree, and obtain the attribute information set A ;

根据提取得到的属性信息,采用如下算式构造每个节点的属性特征向量a

Figure SMS_1
其中,属性特征向量a的每一个维度表示抽象语法树中的一个节点;属性特征向量a中的第zz个元素/>
Figure SMS_2
满足/>
Figure SMS_3
;According to the extracted attribute information, use the following formula to construct the attribute feature vector a of each node :
Figure SMS_1
Among them, each dimension of the attribute feature vector a represents a node in the abstract syntax tree; the zzth element in the attribute feature vector a />
Figure SMS_2
meet />
Figure SMS_3
;

若当前节点不是第i个节点的父节点,则

Figure SMS_4
,否则/>
Figure SMS_5
取值为第i个节点的属性;/>
Figure SMS_6
为当前的抽象语法树中节点的总数。If the current node is not the parent node of the i-th node, then
Figure SMS_4
, otherwise />
Figure SMS_5
The value is the property of the i- th node; />
Figure SMS_6
is the total number of nodes in the current abstract syntax tree.

所述的步骤B,具体包括如下步骤:Described step B specifically comprises the following steps:

针对抽象语法树中的每个节点,将节点的关系矩阵进行行压缩存储,通过索引操作提取以当前节点为根节点的各个深度的子树,并将子树中节点的向量表达按照抽象语法树的层次遍历次序进行拼接,以表示每棵子树。For each node in the abstract syntax tree, the relationship matrix of the node is compressed and stored, and the subtrees of each depth with the current node as the root node are extracted through the index operation, and the vector expression of the nodes in the subtree is expressed according to the abstract syntax tree The hierarchical traversal order is spliced to represent each subtree.

所述的步骤C,具体包括如下步骤:Described step C specifically comprises the following steps:

采用str2hash函数为步骤B得到的每棵子树赋予唯一的整数识别号,从而将以子树为元素的集合表示为整数集合;进而,将所有子树的整数识别号组成的集合用于表示整棵抽象语法树。The str2hash function is used to assign a unique integer identification number to each subtree obtained in step B, so that the set with subtrees as elements is represented as an integer set; furthermore, the set of integer identification numbers of all subtrees is used to represent the whole tree Abstract syntax tree.

所述的步骤D,具体包括如下步骤:Described step D specifically comprises the following steps:

将所有抽象语法树组成的集合表示为US为表示任意一个抽象语法树的集合,则

Figure SMS_7
;生成K个随机置换的集合/>
Figure SMS_8
,/>
Figure SMS_9
为集合/>
Figure SMS_10
中的第k个随机置换;Denote the set of all abstract syntax trees as U , and S is the set representing any abstract syntax tree, then
Figure SMS_7
; Generate a set of K random permutations />
Figure SMS_8
, />
Figure SMS_9
for collection />
Figure SMS_10
The kth random permutation in ;

采用如下算式计算得到第K维哈希码

Figure SMS_11
,从而得到每棵抽象语法树的向量表达:/>
Figure SMS_12
式中/>
Figure SMS_13
为求最小hash值所对应的元素操作;/>
Figure SMS_14
为生成S的一个随机置换。Use the following formula to calculate the K -th dimension hash code
Figure SMS_11
, so as to obtain the vector representation of each abstract syntax tree: />
Figure SMS_12
In the formula />
Figure SMS_13
To find the element operation corresponding to the minimum hash value; />
Figure SMS_14
is to generate a random permutation of S.

所述的步骤S3,具体包括如下步骤:The step S3 specifically includes the following steps:

根据步骤S2得到的各个抽象语法树的向量表达,计算任意两棵抽象语法树之间的海明相似度;Calculate the Hamming similarity between any two abstract syntax trees according to the vector expression of each abstract syntax tree obtained in step S2;

采用如下算式计算得到核矩阵:

Figure SMS_16
式中
Figure SMS_17
为核矩阵中第i1行第j1列的元素,用于表示树/>
Figure SMS_19
与树/>
Figure SMS_21
之间的海明相似度;
Figure SMS_23
为抽象语法树/>
Figure SMS_25
的第/>
Figure SMS_27
维值;/>
Figure SMS_15
为抽象语法树/>
Figure SMS_18
的第/>
Figure SMS_20
维值;/>
Figure SMS_22
为判断函数,且取值规则为:若AA=BB则/>
Figure SMS_24
,否则/>
Figure SMS_26
。The kernel matrix is calculated by the following formula:
Figure SMS_16
In the formula
Figure SMS_17
is the element of row i 1 and column j 1 in the kernel matrix, used to represent the tree />
Figure SMS_19
with tree />
Figure SMS_21
Hamming similarity between
Figure SMS_23
abstract syntax tree />
Figure SMS_25
No. />
Figure SMS_27
dimension value; />
Figure SMS_15
abstract syntax tree />
Figure SMS_18
No. />
Figure SMS_20
dimension value; />
Figure SMS_22
It is a judgment function, and the value rule is: if AA = BB then />
Figure SMS_24
, otherwise />
Figure SMS_26
.

所述的步骤S4,具体包括如下内容:采用一对一支持向量机,将M类分类问题转换为

Figure SMS_28
个二分类问题;将第/>
Figure SMS_29
类数据视为+1类,将第/>
Figure SMS_30
类数据视为-1类,并对第
Figure SMS_31
类数据和第/>
Figure SMS_32
类数据,采用支持向量机来训练代码分类器;The described step S4 specifically includes the following content: using a one-to-one support vector machine, the M class classification problem is converted into
Figure SMS_28
a binary classification problem; the second />
Figure SMS_29
class data as +1 class, class />
Figure SMS_30
Class data is treated as -1 class, and for the first
Figure SMS_31
class data and section />
Figure SMS_32
Class data, using support vector machines to train code classifiers;

求解对应的二次规划问题,得到代码分类器;Solve the corresponding quadratic programming problem to obtain the code classifier;

基于抽象语法树之间的海明相似度,计算得到最终的抽象语法树的分类结果;抽象语法树的分类结果对应着代码数据的分类结果。Based on the Hamming similarity between the abstract syntax trees, the final classification result of the abstract syntax tree is calculated; the classification result of the abstract syntax tree corresponds to the classification result of the code data.

所述的步骤S4,具体包括如下步骤:The step S4 specifically includes the following steps:

采用一对一支持向量机,将M类分类问题转换为

Figure SMS_34
个二分类问题;将第
Figure SMS_36
类数据视为+1类,将第/>
Figure SMS_38
类数据视为-1类,并对第/>
Figure SMS_39
类数据和第/>
Figure SMS_40
类数据,采用支持向量机来训练代码分类器:/>
Figure SMS_41
式中/>
Figure SMS_42
为超平面法向量;/>
Figure SMS_33
为代码片段的向量表达;/>
Figure SMS_35
为输入空间到特征空间的非线性映射;/>
Figure SMS_37
为对应的截距;Using a one-to-one support vector machine, the M -class classification problem is transformed into
Figure SMS_34
a binary classification problem;
Figure SMS_36
class data as +1 class, class />
Figure SMS_38
Class data is treated as -1 class, and for the />
Figure SMS_39
class data and section />
Figure SMS_40
class data, using support vector machines to train code classifiers: />
Figure SMS_41
In the formula />
Figure SMS_42
is the hyperplane normal vector; />
Figure SMS_33
is the vector representation of the code fragment; />
Figure SMS_35
is the nonlinear mapping from the input space to the feature space; />
Figure SMS_37
is the corresponding intercept;

求解如下二次规划问题:

Figure SMS_43
;Solve the following quadratic programming problem:
Figure SMS_43
;

约束条件1:

Figure SMS_44
;Constraint 1:
Figure SMS_44
;

约束条件2:

Figure SMS_45
;Constraint 2:
Figure SMS_45
;

约束条件3:

Figure SMS_46
式中/>
Figure SMS_47
为松弛变量;C为惩罚因子,且C>0;t为/>
Figure SMS_48
类数据和
Figure SMS_49
类数据的并集中的样本索引变量;/>
Figure SMS_50
为样本t的松弛变量;/>
Figure SMS_51
为样本t的类别;Constraint 3:
Figure SMS_46
In the formula />
Figure SMS_47
is the slack variable; C is the penalty factor, and C >0; t is />
Figure SMS_48
class data and
Figure SMS_49
The sample index variable in the union of class data; />
Figure SMS_50
is the slack variable of sample t ; />
Figure SMS_51
is the category of sample t ;

得到代码分类器:对于一个代码片段所对应的抽象语法树

Figure SMS_56
,该代码片段的类别标签预测结果为/>
Figure SMS_58
式中/>
Figure SMS_60
为新数据的类别;/>为拉格朗日乘子,且/>
Figure SMS_63
;/>
Figure SMS_65
为抽象语法树/>
Figure SMS_67
和抽象语法树/>
Figure SMS_52
之间的海明相似度;/>
Figure SMS_55
为二值函数,且若Z>0则/>
Figure SMS_57
,若Z<0则/>
Figure SMS_59
;/>
Figure SMS_62
表示抽象语法树/>
Figure SMS_64
所对应的代码片段属于第/>
Figure SMS_66
类,
Figure SMS_68
表示抽象语法树/>
Figure SMS_53
所对应的代码片段属于第/>
Figure SMS_54
类;Get the code classifier: the abstract syntax tree corresponding to a code fragment
Figure SMS_56
, the category label prediction result of this code snippet is />
Figure SMS_58
In the formula />
Figure SMS_60
is the category of the new data; /> is the Lagrangian multiplier, and />
Figure SMS_63
;/>
Figure SMS_65
abstract syntax tree />
Figure SMS_67
and Abstract Syntax Tree />
Figure SMS_52
Hamming similarity between
Figure SMS_55
is a binary function, and if Z > 0 then />
Figure SMS_57
, if Z <0 then />
Figure SMS_59
;/>
Figure SMS_62
Represents an abstract syntax tree />
Figure SMS_64
The corresponding code fragment belongs to the />
Figure SMS_66
kind,
Figure SMS_68
Represents an abstract syntax tree />
Figure SMS_53
The corresponding code fragment belongs to the />
Figure SMS_54
kind;

最后,得到最终的分类决策函数为:

Figure SMS_69
式中/>
Figure SMS_70
表示取/>
Figure SMS_71
为最大值时所对应的类别为最终预测的类别。Finally, the final classification decision function is obtained as:
Figure SMS_69
In the formula />
Figure SMS_70
means to take />
Figure SMS_71
The category corresponding to the maximum value is the final predicted category.

本发明提供的这种基于哈希算法的代码分类方法,通过随机生成若干组哈希函数来高效表达代码片段的抽象语法树,不仅取得了关于代码数量的线性时间和空间复杂度,同时也保留了抽象语法树的结构和语义信息,在保证相似度的前提下对数据进行降维;本发明能够生成用于支持向量机的核矩阵,并将核矩阵输入到支持向量机中训练得到代码分类器,从而有效地分类出各种类型的代码;因此本发明避免了神经网络中复杂的数学计算和海量的参数学习,同时不再依赖昂贵的高端硬件,能够在满足代码分类精度的前提下,明显降低时间开销,尤其适用于大规模代码数据集场景下的代码分类任务,而且本发明的可靠性高、准确性好且效率较高。The code classification method based on the hash algorithm provided by the present invention efficiently expresses the abstract syntax tree of code fragments by randomly generating several groups of hash functions, which not only achieves the linear time and space complexity of the number of codes, but also retains The structure and semantic information of the abstract syntax tree are obtained, and the dimensionality reduction is performed on the data under the premise of ensuring the similarity; the present invention can generate a kernel matrix for a support vector machine, and input the kernel matrix into the support vector machine for training to obtain code classification device, thereby effectively classifying various types of codes; therefore, the present invention avoids complex mathematical calculations and massive parameter learning in the neural network, and does not rely on expensive high-end hardware at the same time, and can meet the premise of code classification accuracy. The time cost is obviously reduced, and it is especially suitable for the code classification task in the scene of a large-scale code data set, and the present invention has high reliability, good accuracy and high efficiency.

附图说明Description of drawings

图1为本发明方法的方法流程示意图。Fig. 1 is a schematic flow chart of the method of the present invention.

图2为本发明方法中计算得到抽象语法树的表达的过程示意图。Fig. 2 is a schematic diagram of the process of calculating and obtaining the expression of the abstract syntax tree in the method of the present invention.

具体实施方式Detailed ways

如图1所示为本发明方法的方法流程示意图:本发明提供的这种基于哈希算法的代码分类方法,包括如下步骤:As shown in Figure 1, it is a schematic flow diagram of the method of the present invention: this hash algorithm-based code classification method provided by the present invention includes the following steps:

S1. 获取已有的代码数据和对应的分类结果,并提取对应的抽象语法树;S1. Obtain the existing code data and corresponding classification results, and extract the corresponding abstract syntax tree;

S2. 针对步骤S1获取的抽象语法树,基于其中各个节点的属性信息和哈希算法,计算得到各个抽象语法树的表达(该部分的示例图,如图2所示);具体包括如下步骤:S2. For the abstract syntax tree obtained in step S1, calculate the expression of each abstract syntax tree based on the attribute information and hash algorithm of each node (the example diagram of this part is shown in Figure 2); specifically include the following steps:

A. 根据步骤S1获取的抽象语法树,提取对应的节点的属性信息,并初始化抽象语法树中各个节点的向量表达;具体包括如下步骤:A. According to the abstract syntax tree obtained in step S1, extract the attribute information of the corresponding node, and initialize the vector expression of each node in the abstract syntax tree; specifically include the following steps:

一般场合下,每个代码片段都可以建模为一棵抽象语法树

Figure SMS_72
,其中V为该树的节点集,E为该树的边集,A为该树中节点的属性集合,f为将节点表示成/>
Figure SMS_73
维的属性特征向量,且/>
Figure SMS_74
;同时,该抽象语法树对应一个类别标签y,表示该代码的某种功能;In general, each code fragment can be modeled as an abstract syntax tree
Figure SMS_72
, where V is the node set of the tree, E is the edge set of the tree, A is the attribute set of the nodes in the tree, f represents the node as />
Figure SMS_73
attribute feature vector of dimension, and />
Figure SMS_74
; At the same time, the abstract syntax tree corresponds to a category label y , indicating a certain function of the code;

针对当前的抽象语法树,提取得到抽象语法树中所有节点的属性信息,得到属性信息集合AFor the current abstract syntax tree, extract the attribute information of all nodes in the abstract syntax tree, and obtain the attribute information set A ;

根据提取得到的属性信息,采用如下算式构造每个节点的属性特征向量a

Figure SMS_75
其中,属性特征向量a的每一个维度表示抽象语法树中的一个节点;属性特征向量a中的第zz个元素/>
Figure SMS_76
满足/>
Figure SMS_77
;According to the extracted attribute information, use the following formula to construct the attribute feature vector a of each node :
Figure SMS_75
Among them, each dimension of the attribute feature vector a represents a node in the abstract syntax tree; the zzth element in the attribute feature vector a />
Figure SMS_76
meet />
Figure SMS_77
;

若当前节点不是第i个节点的父节点,则

Figure SMS_78
,否则/>
Figure SMS_79
取值为第i个节点的属性;/>
Figure SMS_80
为当前的抽象语法树中节点的总数;If the current node is not the parent node of the i-th node, then
Figure SMS_78
, otherwise />
Figure SMS_79
The value is the property of the i- th node; />
Figure SMS_80
is the total number of nodes in the current abstract syntax tree;

B. 针对抽象语法树中的每个节点,提取以当前节点为根节点的各个深度的子树,并将根节点的向量表达与对应的所有子节点的向量表达进行拼接,以表示每个子树;具体包括如下步骤:B. For each node in the abstract syntax tree, extract the subtrees of each depth with the current node as the root node, and splice the vector expression of the root node with the vector expressions of all corresponding child nodes to represent each subtree ; Concretely include the following steps:

针对抽象语法树中的每个节点,将节点的关系矩阵进行行压缩存储,并通过索引操作提取以当前节点为根节点的各个深度的子树,并将子树中节点的向量表达按照抽象语法树的层次遍历次序进行拼接,以表示每棵子树;For each node in the abstract syntax tree, the relationship matrix of the node is compressed and stored, and the subtrees of each depth with the current node as the root node are extracted through the index operation, and the vector expression of the nodes in the subtree is expressed according to the abstract syntax The hierarchical traversal order of the tree is spliced to represent each subtree;

C. 为步骤B得到的每棵子树赋予唯一的整数识别号,并将所有子树的整数识别号组成的集合用于表示整棵抽象语法树;具体包括如下步骤:C. Assign a unique integer identification number to each subtree obtained in step B, and use the set of integer identification numbers of all subtrees to represent the entire abstract syntax tree; specifically include the following steps:

采用str2hash函数为步骤B得到的每棵子树赋予唯一的整数识别号,从而将以子树为元素的集合表示为整数集合;进而,将所有子树的整数识别号组成的集合用于表示整棵抽象语法树;The str2hash function is used to assign a unique integer identification number to each subtree obtained in step B, so that the set with subtrees as elements is represented as an integer set; furthermore, the set of integer identification numbers of all subtrees is used to represent the whole tree abstract syntax tree;

D. 基于哈希算法,计算当前的抽象语法树的向量表达;具体包括如下步骤:D. Based on the hash algorithm, calculate the vector expression of the current abstract syntax tree; specifically include the following steps:

将所有抽象语法树组成的集合表示为US为表示任意一个抽象语法树的集合,则

Figure SMS_81
;生成K个随机置换的集合/>
Figure SMS_82
,/>
Figure SMS_83
为集合/>
Figure SMS_84
中的第k个随机置换;Denote the set of all abstract syntax trees as U , and S is the set representing any abstract syntax tree, then
Figure SMS_81
; Generate a set of K random permutations />
Figure SMS_82
, />
Figure SMS_83
for collection />
Figure SMS_84
The kth random permutation in ;

采用如下算式计算得到第K维哈希码

Figure SMS_85
,从而得到每棵抽象语法树的向量表达:/>
Figure SMS_86
式中/>
Figure SMS_87
为求最小hash值所对应的元素操作;/>
Figure SMS_88
为生成S的一个随机置换,/>
Figure SMS_89
表示S的一个MinHash值;Use the following formula to calculate the K -th dimension hash code
Figure SMS_85
, so as to obtain the vector representation of each abstract syntax tree: />
Figure SMS_86
In the formula />
Figure SMS_87
To find the element operation corresponding to the minimum hash value; />
Figure SMS_88
To generate a random permutation of S , />
Figure SMS_89
Indicates a MinHash value of S;

E. 重复步骤A~D直至得到每一棵抽象语法树的向量表达;具体实施时,重复步骤A~D共N次,N为代码数据的个数,以得到每棵抽象语法树的向量表达;E. Repeat steps A~D until the vector expression of each abstract syntax tree is obtained; in practice, repeat steps A~D a total of N times, N is the number of code data, to obtain the vector expression of each abstract syntax tree ;

在具体重复时,在重复步骤A时,在初始化抽象语法树中各个节点的向量表达的过程中,生成

Figure SMS_90
个/>
Figure SMS_91
维向量;When repeating specifically, when repeating step A, in the process of initializing the vector representation of each node in the abstract syntax tree, generate
Figure SMS_90
a />
Figure SMS_91
dimension vector;

在重复步骤C时,生成一个由整数识别号组成的集合,集合中元素的个数为抽象语法树中子树的个数;When step C is repeated, a set consisting of integer identification numbers is generated, and the number of elements in the set is the number of subtrees in the abstract syntax tree;

在重复步骤D时,对每棵抽象语法树生成设定维度的向量表达;When step D is repeated, a vector representation of a set dimension is generated for each abstract syntax tree;

S3. 根据步骤S2得到的各个抽象语法树的表达,计算任意两个抽象语法树之间的相似度,从而构建核矩阵;具体包括如下步骤:S3. According to the expression of each abstract syntax tree obtained in step S2, calculate the similarity between any two abstract syntax trees, thereby constructing a kernel matrix; specifically include the following steps:

根据步骤S2得到的各个抽象语法树的向量表达,计算任意两棵抽象语法树之间的海明相似度;Calculate the Hamming similarity between any two abstract syntax trees according to the vector expression of each abstract syntax tree obtained in step S2;

采用如下算式计算得到核矩阵:

Figure SMS_93
式中
Figure SMS_95
为核矩阵中第i1行第j1列的元素,用于表示树/>
Figure SMS_97
与树/>
Figure SMS_99
之间的海明相似度;
Figure SMS_101
为抽象语法树/>
Figure SMS_103
的第/>
Figure SMS_104
维值;/>
Figure SMS_92
为抽象语法树/>
Figure SMS_94
的第/>
Figure SMS_96
维值;/>
Figure SMS_98
为判断函数,且取值规则为:若AA=BB则/>
Figure SMS_100
,否则/>
Figure SMS_102
;The kernel matrix is calculated by the following formula:
Figure SMS_93
In the formula
Figure SMS_95
is the element of row i 1 and column j 1 in the kernel matrix, used to represent the tree />
Figure SMS_97
with tree />
Figure SMS_99
Hamming similarity between
Figure SMS_101
abstract syntax tree />
Figure SMS_103
No. />
Figure SMS_104
dimension value; />
Figure SMS_92
abstract syntax tree />
Figure SMS_94
No. />
Figure SMS_96
dimension value; />
Figure SMS_98
It is a judgment function, and the value rule is: if AA = BB then />
Figure SMS_100
, otherwise />
Figure SMS_102
;

S4. 基于支持向量机,构建抽象语法树分类初步模型;具体包括如下内容:S4. Based on the support vector machine, build a preliminary model of abstract syntax tree classification; specifically include the following:

由于所需要划分的类别不止两类,因此采用一对一支持向量机,将M类分类问题转换为

Figure SMS_105
个二分类问题;将第/>
Figure SMS_106
类数据视为+1类,将第/>
Figure SMS_107
类数据视为-1类,并对第/>
Figure SMS_108
类数据和第/>
Figure SMS_109
类数据,采用支持向量机来训练代码分类器;Since there are more than two categories that need to be divided, a one-to-one support vector machine is used to convert the M -class classification problem into
Figure SMS_105
a binary classification problem; the second />
Figure SMS_106
class data as +1 class, class />
Figure SMS_107
Class data is treated as -1 class, and for the />
Figure SMS_108
class data and section />
Figure SMS_109
Class data, using support vector machines to train code classifiers;

求解对应的二次规划问题,得到代码分类器;Solve the corresponding quadratic programming problem to obtain the code classifier;

基于抽象语法树之间的海明相似度,计算得到最终的抽象语法树的分类结果;抽象语法树的分类结果对应着代码数据的分类结果;Based on the Hamming similarity between the abstract syntax trees, the final classification result of the abstract syntax tree is calculated; the classification result of the abstract syntax tree corresponds to the classification result of the code data;

具体实施时,采用如下步骤构建模型:In the specific implementation, the following steps are used to build the model:

采用一对一支持向量机,将M类分类问题转换为

Figure SMS_111
个二分类问题;将第
Figure SMS_113
类数据视为+1类,将第/>
Figure SMS_115
类数据视为-1类,并对第/>
Figure SMS_116
类数据和第/>
Figure SMS_117
类数据,采用支持向量机来训练代码分类器:/>
Figure SMS_118
式中/>
Figure SMS_119
为超平面法向量;/>
Figure SMS_110
为代码片段的向量表达;/>
Figure SMS_112
为输入空间到特征空间的非线性映射;/>
Figure SMS_114
为对应的截距;Using a one-to-one support vector machine, the M -class classification problem is transformed into
Figure SMS_111
a binary classification problem;
Figure SMS_113
class data as +1 class, class />
Figure SMS_115
Class data is treated as -1 class, and for the />
Figure SMS_116
class data and section />
Figure SMS_117
class data, using support vector machines to train code classifiers: />
Figure SMS_118
In the formula />
Figure SMS_119
is the hyperplane normal vector; />
Figure SMS_110
is the vector representation of the code fragment; />
Figure SMS_112
is the nonlinear mapping from the input space to the feature space; />
Figure SMS_114
is the corresponding intercept;

求解如下二次规划问题:

Figure SMS_120
;Solve the following quadratic programming problem:
Figure SMS_120
;

约束条件1:

Figure SMS_121
;Constraint 1:
Figure SMS_121
;

约束条件2:

Figure SMS_122
;Constraint 2:
Figure SMS_122
;

约束条件3:

Figure SMS_123
式中/>
Figure SMS_124
为松弛变量;C为惩罚因子,且C>0;t为/>
Figure SMS_125
类数据和
Figure SMS_126
类数据的并集中的样本索引变量;/>
Figure SMS_127
为样本t的松弛变量;/>
Figure SMS_128
为样本t的类别;Constraint 3:
Figure SMS_123
In the formula />
Figure SMS_124
is the slack variable; C is the penalty factor, and C >0; t is />
Figure SMS_125
class data and
Figure SMS_126
The sample index variable in the union of class data; />
Figure SMS_127
is the slack variable of sample t ; />
Figure SMS_128
is the category of sample t ;

得到代码分类器:对于一个代码片段所对应的抽象语法树

Figure SMS_134
,该代码片段的类别标签预测结果为/>
Figure SMS_136
式中/>
Figure SMS_138
为新数据的类别;/>
Figure SMS_140
为拉格朗日乘子,且/>
Figure SMS_142
;/>
Figure SMS_144
为抽象语法树/>
Figure SMS_145
和抽象语法树/>
Figure SMS_129
之间的海明相似度;/>
Figure SMS_131
为二值函数,且若Z>0则/>
Figure SMS_133
,若Z<0则/>
Figure SMS_135
;/>
Figure SMS_137
表示抽象语法树/>
Figure SMS_139
所对应的代码片段属于第/>
Figure SMS_141
类,
Figure SMS_143
表示抽象语法树/>
Figure SMS_130
所对应的代码片段属于第/>
Figure SMS_132
类;Get the code classifier: the abstract syntax tree corresponding to a code fragment
Figure SMS_134
, the category label prediction result of this code snippet is />
Figure SMS_136
In the formula />
Figure SMS_138
is the category of the new data; />
Figure SMS_140
is the Lagrangian multiplier, and />
Figure SMS_142
;/>
Figure SMS_144
abstract syntax tree />
Figure SMS_145
and Abstract Syntax Tree />
Figure SMS_129
Hamming similarity between
Figure SMS_131
is a binary function, and if Z > 0 then />
Figure SMS_133
, if Z <0 then />
Figure SMS_135
;/>
Figure SMS_137
Represents an abstract syntax tree />
Figure SMS_139
The corresponding code fragment belongs to the />
Figure SMS_141
kind,
Figure SMS_143
Represents an abstract syntax tree />
Figure SMS_130
The corresponding code fragment belongs to the />
Figure SMS_132
kind;

最后,得到最终的分类决策函数为:

Figure SMS_146
式中/>
Figure SMS_147
表示取/>
Figure SMS_148
为最大值时所对应的类别为最终预测的类别;Finally, the final classification decision function is obtained as:
Figure SMS_146
In the formula />
Figure SMS_147
means to take />
Figure SMS_148
The category corresponding to the maximum value is the final predicted category;

S5. 采用步骤S3得到的核矩阵,对步骤S4构建的抽象语法树分类初步模型进行训练,得到抽象语法树分类模型;S5. adopt the kernel matrix that step S3 obtains, train the abstract syntax tree classification preliminary model that step S4 builds, obtain the abstract syntax tree classification model;

S6. 采用步骤S5得到的抽象语法树分类模型,进行实际的代码的分类。S6. Use the abstract syntax tree classification model obtained in step S5 to classify the actual codes.

本发明提供的这种分类方法,具有很强的扩展性,适用于各种编程语言的代码分类。本发明方法的主要目标是对给定的代码片段按照所实现的功能进行分类,比如对github,StackOverflow等网站上的源代码按照其功能进行分类。The classification method provided by the present invention has strong expansibility and is suitable for code classification of various programming languages. The main goal of the method of the present invention is to classify given code fragments according to their functions, such as classifying source codes on github, StackOverflow and other websites according to their functions.

以下集合具体实施例,对本发明方法进行进一步说明:The following specific examples are assembled to further illustrate the method of the present invention:

本实施例采用公开数据集进行代码分类的测试;公开数据集为OJ数据集和GCJ数据集。其中,OJ数据集由从OJ编程练习系统收集的C程序组成,数据集包括学生提交的编程任务和相应的源代码,一共有52000条数据,分别包含104类,每个类别均有500条数据;GCJ数据集是从Google每年举办的在线程序竞赛中收集的,由8647 个Java文件组成,分为6类。OJ数据集和GCJ数据集的具体说明如表1所示:In this embodiment, a public data set is used to test code classification; the public data sets are OJ data set and GCJ data set. Among them, the OJ data set is composed of C programs collected from the OJ programming practice system. The data set includes programming tasks submitted by students and the corresponding source code. There are a total of 52,000 pieces of data, including 104 categories, and each category has 500 pieces of data. ; The GCJ dataset is collected from the online programming competition held by Google every year, and consists of 8647 Java files, which are divided into 6 categories. The specific description of OJ dataset and GCJ dataset is shown in Table 1:

表1 OJ数据集和GCJ数据集说明示意表Table 1 Description of OJ dataset and GCJ dataset

Figure SMS_149
Figure SMS_149

然后,分别采用现有的InferCode分类方法、DACL分类方法和本申请的分类方法,对以上公开数据集中的代码进行分类;Then, use the existing InferCode classification method, DACL classification method and the classification method of this application to classify the codes in the above public data sets;

其中,InferCode是基于深度学习的代码分类方法,利用自监督学习从代码的抽象语法树中学习代码表示,采用预训练-微调范式对生成的预训练模型进行微调从而实现代码分类,是现今较为热门的代码分类方法;DACL分类方法采用数据增强和课程式学习的微调模式来桥接已有的预训练模型和代码分类任务,这种代码数据增强方法目前已经得到了广泛的应用和研究;Among them, InferCode is a code classification method based on deep learning. It uses self-supervised learning to learn code representation from the abstract syntax tree of the code, and uses the pre-training-fine-tuning paradigm to fine-tune the generated pre-training model to achieve code classification. It is more popular nowadays. The code classification method; the DACL classification method uses data enhancement and course-based learning fine-tuning mode to bridge the existing pre-training model and code classification tasks. This code data enhancement method has been widely used and researched;

分类结束后,得到对应的分类方法的性能数据如表2所示:After the classification, the performance data of the corresponding classification method is obtained as shown in Table 2:

表2 分类方法性能对比示意表Table 2 Schematic diagram of performance comparison of classification methods

Figure SMS_150
Figure SMS_150

由表2可以看到,在这两个不同的数据集上,本申请方法相较于其他方法具有非常快的运行时间,这是因为本申请提出了一种基于哈希算法的代码分类方法,通过随机生成若干组哈希函数来高效表达代码片段的抽象语法树,进而对代码片段进行分类,一方面,由于此方法避免了复杂的数学计算和海量参数的学习过程,取得了关于代码数量的线性时间和空间复杂度,明显降低了时间和存储开销,由上述性能评估结果可知,随着数据集规模的增大,本申请方法在运行时间上表现出更好的性能,另一方面,此方法保留了抽象语法树的结构和语义信息,因此在准确度方面也具有很好的竞争力。As can be seen from Table 2, on these two different data sets, the method of this application has a very fast running time compared with other methods, because this application proposes a code classification method based on a hash algorithm, By randomly generating several groups of hash functions to efficiently express the abstract syntax tree of code fragments, and then classify code fragments, on the one hand, because this method avoids complex mathematical calculations and the learning process of massive parameters, it has achieved a large number of codes. Linear time and space complexity, which significantly reduces the time and storage overhead. From the above performance evaluation results, it can be seen that as the size of the data set increases, the method of this application shows better performance in terms of running time. On the other hand, this The method preserves the structural and semantic information of the abstract syntax tree, so it is also very competitive in terms of accuracy.

同时,本申请方法在准确度方便也表现出较为优异的性能;其中,本申请方法在OJ数据集上的分类准确度要明显优于现有的两种方法;同时,虽然本申请方法在GCJ数据集上的分类准确度要略低于现有的两种方法,但是由于本申请方法的耗时极短(耗时约为InferCode分类方法的十分之一,DACL分类方法的二十分之一),而且分类的准确度也基本达到了工业上大范围应用的要求(超过了95%);因此本申请方法的综合性能无疑要明显优于现有的分类方法。At the same time, the method of this application also shows excellent performance in terms of accuracy and convenience; among them, the classification accuracy of the method of this application on the OJ data set is obviously better than the two existing methods; at the same time, although the method of this application is in GCJ The classification accuracy on the data set is slightly lower than the existing two methods, but due to the extremely short time-consuming of the application method (about one-tenth of the InferCode classification method and one-twentieth of the DACL classification method) 1), and the classification accuracy has basically met the requirements of large-scale industrial applications (more than 95%); therefore, the overall performance of the method of this application is undoubtedly significantly better than the existing classification methods.

Claims (9)

1.一种基于哈希算法的代码分类方法,其特征在于包括如下步骤:1. a code classification method based on hash algorithm, it is characterized in that comprising the steps: S1. 获取已有的代码数据和对应的分类结果,并提取对应的抽象语法树;S1. Obtain the existing code data and corresponding classification results, and extract the corresponding abstract syntax tree; S2. 针对步骤S1获取的抽象语法树,基于其中各个节点的属性信息和哈希算法,计算得到各个抽象语法树的表达;S2. For the abstract syntax tree obtained in step S1, calculate the expression of each abstract syntax tree based on the attribute information and hash algorithm of each node; S3. 根据步骤S2得到的各个抽象语法树的表达,计算任意两个抽象语法树之间的相似度,从而构建核矩阵;S3. According to the expression of each abstract syntax tree obtained in step S2, calculate the similarity between any two abstract syntax trees, thereby constructing a kernel matrix; S4. 基于支持向量机,构建抽象语法树分类初步模型;S4. Based on the support vector machine, construct a preliminary model of abstract syntax tree classification; S5. 采用步骤S3得到的核矩阵,对步骤S4构建的抽象语法树分类初步模型进行训练,得到抽象语法树分类模型;S5. adopt the kernel matrix that step S3 obtains, train the abstract syntax tree classification preliminary model that step S4 builds, obtain the abstract syntax tree classification model; S6. 采用步骤S5得到的抽象语法树分类模型,进行实际的代码的分类。S6. Use the abstract syntax tree classification model obtained in step S5 to classify the actual codes. 2.根据权利要求1所述的基于哈希算法的代码分类方法,其特征在于所述的步骤S2,具体包括如下步骤:2. the code classification method based on hash algorithm according to claim 1, is characterized in that described step S2, specifically comprises the steps: A. 根据步骤S1获取的抽象语法树,提取对应的节点的属性信息,并初始化抽象语法树中各个节点的向量表达;A. According to the abstract syntax tree obtained in step S1, extract the attribute information of the corresponding node, and initialize the vector expression of each node in the abstract syntax tree; B. 针对抽象语法树中的每个节点,提取以当前节点为根节点的各个深度的子树,并将根节点的向量表达与对应的所有子节点的向量表达进行拼接,以表示每个子树;B. For each node in the abstract syntax tree, extract the subtrees of each depth with the current node as the root node, and splice the vector expression of the root node with the vector expressions of all corresponding child nodes to represent each subtree ; C. 为步骤B得到的每棵子树赋予唯一的整数识别号,并将所有子树的整数识别号组成的集合用于表示整棵抽象语法树;C. Assign a unique integer identification number to each subtree obtained in step B, and use the set of integer identification numbers of all subtrees to represent the entire abstract syntax tree; D. 基于哈希算法,计算当前的抽象语法树的向量表达;D. Based on the hash algorithm, calculate the vector expression of the current abstract syntax tree; E. 重复步骤A~D直至得到每一棵抽象语法树的向量表达。E. Repeat steps A~D until the vector representation of each abstract syntax tree is obtained. 3.根据权利要求2所述的基于哈希算法的代码分类方法,其特征在于所述的步骤A,具体包括如下步骤:3. the code classification method based on hash algorithm according to claim 2, is characterized in that described step A, specifically comprises the steps: 针对当前的抽象语法树,提取得到抽象语法树中所有节点的属性信息,得到属性信息集合AFor the current abstract syntax tree, extract the attribute information of all nodes in the abstract syntax tree, and obtain the attribute information set A ; 根据提取得到的属性信息,采用如下算式构造每个节点的属性特征向量a
Figure QLYQS_1
其中,属性特征向量a的每一个维度表示抽象语法树中的一个节点;属性特征向量a中的第zz个元素/>
Figure QLYQS_2
满足/>
Figure QLYQS_3
According to the extracted attribute information, use the following formula to construct the attribute feature vector a of each node :
Figure QLYQS_1
Among them, each dimension of the attribute feature vector a represents a node in the abstract syntax tree; the zzth element in the attribute feature vector a />
Figure QLYQS_2
meet />
Figure QLYQS_3
;
若当前节点不是第i个节点的父节点,则
Figure QLYQS_4
,否则/>
Figure QLYQS_5
取值为第i个节点的属性;/>
Figure QLYQS_6
为当前的抽象语法树中节点的总数。
If the current node is not the parent node of the i-th node, then
Figure QLYQS_4
, otherwise />
Figure QLYQS_5
The value is the property of the i- th node; />
Figure QLYQS_6
is the total number of nodes in the current abstract syntax tree.
4.根据权利要求3所述的基于哈希算法的代码分类方法,其特征在于所述的步骤B,具体包括如下步骤:4. the code classification method based on hash algorithm according to claim 3, is characterized in that described step B, specifically comprises the steps: 针对抽象语法树中的每个节点,将节点的关系矩阵进行行压缩存储,通过索引操作提取以当前节点为根节点的各个深度的子树,并将子树中节点的向量表达按照抽象语法树的层次遍历次序进行拼接,以表示每棵子树。For each node in the abstract syntax tree, the relationship matrix of the node is compressed and stored, and the subtrees of each depth with the current node as the root node are extracted through the index operation, and the vector expression of the nodes in the subtree is expressed according to the abstract syntax tree The hierarchical traversal order is spliced to represent each subtree. 5.根据权利要求4所述的基于哈希算法的代码分类方法,其特征在于所述的步骤C,具体包括如下步骤:5. the code classification method based on hash algorithm according to claim 4, is characterized in that described step C, specifically comprises the steps: 采用str2hash函数为步骤B得到的每棵子树赋予唯一的整数识别号,从而将以子树为元素的集合表示为整数集合;进而,将所有子树的整数识别号组成的集合用于表示整棵抽象语法树。The str2hash function is used to assign a unique integer identification number to each subtree obtained in step B, so that the set with subtrees as elements is represented as an integer set; furthermore, the set of integer identification numbers of all subtrees is used to represent the whole tree Abstract syntax tree. 6.根据权利要求5所述的基于哈希算法的代码分类方法,其特征在于所述的步骤D,具体包括如下步骤:6. the code classification method based on hash algorithm according to claim 5, is characterized in that described step D, specifically comprises the steps: 将所有抽象语法树组成的集合表示为US为表示任意一个抽象语法树的集合,则
Figure QLYQS_7
;生成K个随机置换的集合/>
Figure QLYQS_8
,/>
Figure QLYQS_9
为集合/>
Figure QLYQS_10
中的第k个随机置换;
Denote the set of all abstract syntax trees as U , and S is the set representing any abstract syntax tree, then
Figure QLYQS_7
; Generate a set of K random permutations />
Figure QLYQS_8
, />
Figure QLYQS_9
for collection />
Figure QLYQS_10
The kth random permutation in ;
采用如下算式计算得到第K维哈希码
Figure QLYQS_11
,从而得到每棵抽象语法树的向量表达:
Figure QLYQS_12
式中/>
Figure QLYQS_13
为求最小hash值所对应的元素操作;
Figure QLYQS_14
为生成S的一个随机置换。
Use the following formula to calculate the K -th dimension hash code
Figure QLYQS_11
, so as to obtain the vector representation of each abstract syntax tree:
Figure QLYQS_12
In the formula />
Figure QLYQS_13
To find the element operation corresponding to the minimum hash value;
Figure QLYQS_14
is to generate a random permutation of S.
7.根据权利要求6所述的基于哈希算法的代码分类方法,其特征在于所述的步骤S3,具体包括如下步骤:7. the code classification method based on hash algorithm according to claim 6, is characterized in that described step S3, specifically comprises the following steps: 根据步骤S2得到的各个抽象语法树的向量表达,计算任意两棵抽象语法树之间的海明相似度;Calculate the Hamming similarity between any two abstract syntax trees according to the vector expression of each abstract syntax tree obtained in step S2; 采用如下算式计算得到核矩阵:
Figure QLYQS_16
式中
Figure QLYQS_19
为核矩阵中第i1行第j1列的元素,用于表示树/>
Figure QLYQS_20
与树/>
Figure QLYQS_22
之间的海明相似度;
Figure QLYQS_24
为抽象语法树/>
Figure QLYQS_26
的第/>
Figure QLYQS_27
维值;/>
Figure QLYQS_15
为抽象语法树/>
Figure QLYQS_17
的第/>
Figure QLYQS_18
维值;/>
Figure QLYQS_21
为判断函数,且取值规则为:若AA=BB则/>
Figure QLYQS_23
,否则/>
Figure QLYQS_25
The kernel matrix is calculated by the following formula:
Figure QLYQS_16
In the formula
Figure QLYQS_19
is the element of row i 1 and column j 1 in the kernel matrix, used to represent the tree />
Figure QLYQS_20
with tree />
Figure QLYQS_22
Hamming similarity between
Figure QLYQS_24
abstract syntax tree />
Figure QLYQS_26
No. />
Figure QLYQS_27
dimension value; />
Figure QLYQS_15
abstract syntax tree />
Figure QLYQS_17
No. />
Figure QLYQS_18
dimension value; />
Figure QLYQS_21
It is a judgment function, and the value rule is: if AA = BB then />
Figure QLYQS_23
, otherwise />
Figure QLYQS_25
.
8.根据权利要求7所述的基于哈希算法的代码分类方法,其特征在于所述的步骤S4,具体包括如下内容:8. the code classification method based on hash algorithm according to claim 7, is characterized in that described step S4, specifically comprises the following content: 采用一对一支持向量机,将M类分类问题转换为
Figure QLYQS_28
个二分类问题;将第/>
Figure QLYQS_29
类数据视为+1类,将第/>
Figure QLYQS_30
类数据视为-1类,并对第/>
Figure QLYQS_31
类数据和第/>
Figure QLYQS_32
类数据,采用支持向量机来训练代码分类器;
Using a one-to-one support vector machine, the M -class classification problem is transformed into
Figure QLYQS_28
a binary classification problem; the second />
Figure QLYQS_29
class data as +1 class, class />
Figure QLYQS_30
Class data is treated as -1 class, and for the />
Figure QLYQS_31
class data and section />
Figure QLYQS_32
Class data, using support vector machines to train code classifiers;
求解对应的二次规划问题,得到代码分类器;Solve the corresponding quadratic programming problem to obtain the code classifier; 基于抽象语法树之间的海明相似度,计算得到最终的抽象语法树的分类结果;抽象语法树的分类结果对应着代码数据的分类结果。Based on the Hamming similarity between the abstract syntax trees, the final classification result of the abstract syntax tree is calculated; the classification result of the abstract syntax tree corresponds to the classification result of the code data.
9.根据权利要求8所述的基于哈希算法的代码分类方法,其特征在于所述的步骤S4,具体包括如下步骤:9. the code classification method based on hash algorithm according to claim 8, is characterized in that described step S4, specifically comprises the following steps: 采用一对一支持向量机,将M类分类问题转换为
Figure QLYQS_33
个二分类问题;将第/>
Figure QLYQS_34
类数据视为+1类,将第/>
Figure QLYQS_35
类数据视为-1类,并对第/>
Figure QLYQS_36
类数据和第/>
Figure QLYQS_37
类数据,采用支持向量机来训练代码分类器:
Using a one-to-one support vector machine, the M -class classification problem is transformed into
Figure QLYQS_33
a binary classification problem; the second />
Figure QLYQS_34
class data as +1 class, class />
Figure QLYQS_35
Class data is treated as -1 class, and for the />
Figure QLYQS_36
class data and section />
Figure QLYQS_37
Class data, using support vector machines to train code classifiers:
Figure QLYQS_38
式中/>
Figure QLYQS_39
为超平面法向量;/>
Figure QLYQS_40
为代码片段的向量表达;
Figure QLYQS_41
为输入空间到特征空间的非线性映射;/>
Figure QLYQS_42
为对应的截距;
Figure QLYQS_38
In the formula />
Figure QLYQS_39
is the hyperplane normal vector; />
Figure QLYQS_40
is the vector representation of the code fragment;
Figure QLYQS_41
is the nonlinear mapping from the input space to the feature space; />
Figure QLYQS_42
is the corresponding intercept;
求解如下二次规划问题:
Figure QLYQS_43
Solve the following quadratic programming problem:
Figure QLYQS_43
;
约束条件1:
Figure QLYQS_44
Constraint 1:
Figure QLYQS_44
;
约束条件2:
Figure QLYQS_45
Constraint 2:
Figure QLYQS_45
;
约束条件3:
Figure QLYQS_46
式中/>
Figure QLYQS_47
为松弛变量;C为惩罚因子,且C>0;t为/>
Figure QLYQS_48
类数据和/>
Figure QLYQS_49
类数据的并集中的样本索引变量;/>
Figure QLYQS_50
为样本t的松弛变量;/>
Figure QLYQS_51
为样本t的类别;
Constraint 3:
Figure QLYQS_46
In the formula />
Figure QLYQS_47
is the slack variable; C is the penalty factor, and C >0; t is />
Figure QLYQS_48
class data and />
Figure QLYQS_49
The sample index variable in the union of class data; />
Figure QLYQS_50
is the slack variable of sample t ; />
Figure QLYQS_51
is the category of sample t ;
得到代码分类器:对于一个代码片段所对应的抽象语法树
Figure QLYQS_58
,该代码片段的类别标签预测结果为/>
Figure QLYQS_60
式中/>
Figure QLYQS_63
为新数据的类别;/>
Figure QLYQS_65
为拉格朗日乘子,且/>
Figure QLYQS_66
;/>
Figure QLYQS_67
为抽象语法树/>
Figure QLYQS_68
和抽象语法树/>
Figure QLYQS_52
之间的海明相似度;/>
Figure QLYQS_54
为二值函数,且若Z>0则/>
Figure QLYQS_55
,若Z<0则
Figure QLYQS_57
;/>
Figure QLYQS_59
表示抽象语法树/>
Figure QLYQS_61
所对应的代码片段属于第/>
Figure QLYQS_62
类,
Figure QLYQS_64
表示抽象语法树/>
Figure QLYQS_53
所对应的代码片段属于第/>
Figure QLYQS_56
类;
Get the code classifier: the abstract syntax tree corresponding to a code fragment
Figure QLYQS_58
, the category label prediction result of this code snippet is />
Figure QLYQS_60
In the formula />
Figure QLYQS_63
is the category of the new data; />
Figure QLYQS_65
is the Lagrangian multiplier, and />
Figure QLYQS_66
;/>
Figure QLYQS_67
abstract syntax tree />
Figure QLYQS_68
and Abstract Syntax Tree />
Figure QLYQS_52
Hamming similarity between
Figure QLYQS_54
is a binary function, and if Z > 0 then />
Figure QLYQS_55
, if Z <0 then
Figure QLYQS_57
;/>
Figure QLYQS_59
Represents an abstract syntax tree />
Figure QLYQS_61
The corresponding code fragment belongs to the />
Figure QLYQS_62
kind,
Figure QLYQS_64
Represents an abstract syntax tree />
Figure QLYQS_53
The corresponding code fragment belongs to the />
Figure QLYQS_56
kind;
最后,得到最终的分类决策函数为:
Figure QLYQS_69
式中
Figure QLYQS_70
表示取/>
Figure QLYQS_71
为最大值时所对应的类别为最终预测的类别。
Finally, the final classification decision function is obtained as:
Figure QLYQS_69
In the formula
Figure QLYQS_70
means to take />
Figure QLYQS_71
The category corresponding to the maximum value is the final predicted category.
CN202310637656.3A 2023-06-01 2023-06-01 Code Classification Method Based on Hash Algorithm Active CN116432125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310637656.3A CN116432125B (en) 2023-06-01 2023-06-01 Code Classification Method Based on Hash Algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310637656.3A CN116432125B (en) 2023-06-01 2023-06-01 Code Classification Method Based on Hash Algorithm

Publications (2)

Publication Number Publication Date
CN116432125A true CN116432125A (en) 2023-07-14
CN116432125B CN116432125B (en) 2023-09-05

Family

ID=87087463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310637656.3A Active CN116432125B (en) 2023-06-01 2023-06-01 Code Classification Method Based on Hash Algorithm

Country Status (1)

Country Link
CN (1) CN116432125B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116933106A (en) * 2023-07-20 2023-10-24 中国海洋大学 Code blocking method, storage medium and device based on unsupervised clustering

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975392A (en) * 2016-04-29 2016-09-28 国家计算机网络与信息安全管理中心 Duplicated code detection method and device based on abstract syntax tree
CN108664512A (en) * 2017-03-31 2018-10-16 华为技术有限公司 Text object sorting technique and device
WO2019156706A1 (en) * 2018-02-12 2019-08-15 Oracle International Corporation Automated code generation
CN110795732A (en) * 2019-10-10 2020-02-14 南京航空航天大学 SVM-based dynamic and static combination detection method for malicious codes of Android mobile network terminal
US20220283803A1 (en) * 2021-03-04 2022-09-08 Oracle International Corporation Language agnostic code classification
CN115331754A (en) * 2022-08-19 2022-11-11 中南大学 Molecular Classification Method Based on Hash Algorithm
CN115587318A (en) * 2022-10-24 2023-01-10 中国人民解放军战略支援部队信息工程大学 A Source Code Classification Method Based on Neural Network
CN115630368A (en) * 2022-10-20 2023-01-20 昆明理工大学 Java vulnerability classification method based on natural language processing and deep forest
CN116028112A (en) * 2023-01-30 2023-04-28 西安交通大学 Small program clone detection method based on complex network analysis

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975392A (en) * 2016-04-29 2016-09-28 国家计算机网络与信息安全管理中心 Duplicated code detection method and device based on abstract syntax tree
CN108664512A (en) * 2017-03-31 2018-10-16 华为技术有限公司 Text object sorting technique and device
WO2019156706A1 (en) * 2018-02-12 2019-08-15 Oracle International Corporation Automated code generation
CN110795732A (en) * 2019-10-10 2020-02-14 南京航空航天大学 SVM-based dynamic and static combination detection method for malicious codes of Android mobile network terminal
US20220283803A1 (en) * 2021-03-04 2022-09-08 Oracle International Corporation Language agnostic code classification
CN115331754A (en) * 2022-08-19 2022-11-11 中南大学 Molecular Classification Method Based on Hash Algorithm
CN115630368A (en) * 2022-10-20 2023-01-20 昆明理工大学 Java vulnerability classification method based on natural language processing and deep forest
CN115587318A (en) * 2022-10-24 2023-01-10 中国人民解放军战略支援部队信息工程大学 A Source Code Classification Method Based on Neural Network
CN116028112A (en) * 2023-01-30 2023-04-28 西安交通大学 Small program clone detection method based on complex network analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FARHAN ULLAH, ET AL.: "programmers‘ de-anonybization using a hybrid approach of abstract syntax tree and deep learning", TECHNOLOGICAL FORECASTING & SOCIAL CHANGE *
李杨阳等: "人工智能技术在嵌入式代码审查中的应用与展望", 空间控制技术与应用, no. 03 *
胡海峰等: "哈希快速多标记学习算法", 信号处理, no. 08 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116933106A (en) * 2023-07-20 2023-10-24 中国海洋大学 Code blocking method, storage medium and device based on unsupervised clustering
CN116933106B (en) * 2023-07-20 2024-01-26 中国海洋大学 Code chunking method, storage medium and device based on unsupervised clustering

Also Published As

Publication number Publication date
CN116432125B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
Bui et al. Infercode: Self-supervised learning of code representations by predicting subtrees
CN109445834B (en) Program code similarity rapid comparison method based on abstract syntax tree
CN111898364B (en) Neural network relation extraction method, computer equipment and readable storage medium
CN109885824B (en) Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium
CN110532353B (en) Text entity matching method, system and device based on deep learning
CN110941716B (en) An automatic construction method of information security knowledge graph based on deep learning
CN110569033B (en) Method for generating basic codes of digital transaction type intelligent contracts
Sethi et al. DLPaper2Code: Auto-generation of code from deep learning research papers
CN111124487B (en) Code clone detection method and device and electronic equipment
CN115495755B (en) Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN114547619B (en) Vulnerability restoration system and restoration method based on tree
CN116151132A (en) Intelligent code completion method, system and storage medium for programming learning scene
CN116861269A (en) Multi-source heterogeneous data fusion and analysis method in engineering field
CN116432125B (en) Code Classification Method Based on Hash Algorithm
CN114153839A (en) Integration method, device, equipment and storage medium of multi-source heterogeneous data
CN118627080A (en) A software vulnerability detection method based on data quality enhancement and code attribute graph simplification
CN115408506B (en) NL2SQL method combining semantic analysis and semantic component matching
Ma et al. Acceleration algorithms in gnns: A survey
CN118278519B (en) Knowledge graph completion method and related equipment
CN112256838B (en) Similar domain name searching method and device and electronic equipment
CN117573096B (en) Intelligent code completion method integrating abstract syntax tree structure information
CN113378946A (en) Robust multi-label feature selection method considering feature label dependency
CN112148879B (en) Computer readable storage medium for automatically labeling code with data structure
CN109815475B (en) Text matching method and device, computing equipment and system
CN113342982B (en) Enterprise industry classification method integrating Roberta and external knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant