CN112115326B

CN112115326B - A multi-label classification and vulnerability detection method for Ethereum smart contracts

Info

Publication number: CN112115326B
Application number: CN202010836902.4A
Authority: CN
Inventors: 王伟; 李浥东; 宋晶晶
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2022-07-29
Anticipated expiration: 2040-08-19
Also published as: CN112115326A

Abstract

The invention provides a multi-label classification and vulnerability detection method for an Ether house intelligent contract. The method comprises the following steps: forming a sample data set by using the verified intelligent contract, extracting the characteristics of the samples in the sample data set, and expressing the samples by using the characteristic vector; training various multi-label classification models based on the feature vectors of the samples, evaluating the classification effect of the multi-label classification models, and selecting the multi-label classification model with the best classification effect; and inputting the Ether house intelligent contracts to be classified into the selected multi-label classification model, and outputting the vulnerability detection results of the Ether house intelligent contracts to be classified by the multi-label classification model. The method realizes automatic and efficient detection of the intelligent contract vulnerability of the Etheng by extracting the static characteristics and utilizing the machine learning algorithm, and is more suitable for application scenes of large-batch contract vulnerability detection.

Description

A multi-label classification and vulnerability detection method for Ethereum smart contracts

技术领域technical field

本发明涉及区块链的分布式应用漏洞检测技术领域，尤其涉及一种以太坊智能合约的多标签分类和漏洞检测方法。The invention relates to the technical field of distributed application vulnerability detection of blockchain, in particular to a multi-label classification and vulnerability detection method of an Ethereum smart contract.

背景技术Background technique

随着社会经济的发展和新一轮技术的变革，区块链作为一种新兴技术，通过集成多种技术，包括加密算法、共识机制和分布式数据存储和点对点传输机制等，保证了交易数据的不可篡改性和去中心化存储，从而营造了一种可信的交易环境。With the development of social economy and a new round of technological changes, blockchain, as an emerging technology, ensures transaction data by integrating multiple technologies, including encryption algorithms, consensus mechanisms, distributed data storage and point-to-point transmission mechanisms, etc. The untamperable and decentralized storage of the token creates a credible trading environment.

作为一个开放的公有链平台，以太坊通过支持去中心化的以太坊虚拟机来实现智能合约功能，然后通过智能合约功能来处理点对点的交易。以太坊智能合约被广泛应用于许多领域，例如金融服务、基础设施、物联网和医疗保健等，这使得区块链技术的产业应用价值逐渐明确。近年来频频爆发的智能合约安全事件不仅导致了巨大的经济损失，还严重降低了人们对区块链智能合约的信任程度。As an open public chain platform, Ethereum implements smart contract functions by supporting the decentralized Ethereum virtual machine, and then processes peer-to-peer transactions through smart contract functions. Ethereum smart contracts are widely used in many fields, such as financial services, infrastructure, Internet of Things, and healthcare, etc., which makes the industrial application value of blockchain technology gradually clear. The frequent outbreak of smart contract security incidents in recent years has not only resulted in huge economic losses, but also seriously reduced people's trust in blockchain smart contracts.

目前智能合约漏洞检测的主要方法有形式化验证、符号执行或符号分析、模糊测试等。然而，形式化验证有不能完全自动化的缺点；符号执行或符号分析往往需要探索合约中所有的可执行路径或符号化地分析合约中的依赖关系图，因此时间开销大，执行效率低，不适合大批量合约漏洞检测；模糊测试方法生成的测试样例具有较强的随机性，易导致代码覆盖率低，往往无法有效检测出智能合约代码中的所有漏洞，且同样具有检测周期长的缺点。面对与日俱增的智能合约数量，现有方法不堪重负。At present, the main methods of smart contract vulnerability detection include formal verification, symbolic execution or symbolic analysis, and fuzzing. However, formal verification has the disadvantage that it cannot be fully automated; symbolic execution or symbolic analysis often needs to explore all executable paths in the contract or symbolically analyze the dependency graph in the contract, so the time overhead is large and the execution efficiency is low, which is not suitable for Large-scale contract vulnerability detection; the test samples generated by the fuzzing method have strong randomness, which easily leads to low code coverage, and often cannot effectively detect all the vulnerabilities in the smart contract code, and also has the disadvantage of a long detection cycle. Faced with the ever-increasing number of smart contracts, existing methods are overwhelmed.

机器学习是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。近年来，机器学习算法在各个领域都得到了广泛的应用。Machine learning is a multi-domain interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. In recent years, machine learning algorithms have been widely used in various fields.

因此，利用机器学习技术开发一种准确且高效到的以太坊智能合约的漏洞检测方法具有重要现实意义。Therefore, it is of great practical significance to use machine learning technology to develop an accurate and efficient vulnerability detection method for Ethereum smart contracts.

发明内容SUMMARY OF THE INVENTION

本发明的实施提供了一种以太坊智能合约的漏洞检测方法，以实现对太坊智能合约进行准确且高效地分类和漏洞检测。The implementation of the present invention provides a vulnerability detection method for Ethereum smart contracts, so as to realize accurate and efficient classification and vulnerability detection of Ethereum smart contracts.

为了实现上述目的，本发明采取了如下技术方案。In order to achieve the above objects, the present invention adopts the following technical solutions.

一种以太坊智能合约的漏洞检测方法，包括：A vulnerability detection method for an Ethereum smart contract, including:

利用已验证的智能合约构成样本数据集，对样本数据集中的样本进行特征提取，使用特征向量来表示样本；Use verified smart contracts to form a sample data set, extract features from the samples in the sample data set, and use feature vectors to represent the samples;

基于各个样本的特征向量训练各种多标签分类模型，对各个多标签分类模型的分类效果进行评价，选取分类效果最好的多标签分类模型；Train various multi-label classification models based on the feature vector of each sample, evaluate the classification effect of each multi-label classification model, and select the multi-label classification model with the best classification effect;

将待分类的以太坊智能合约输入到选取的多标签分类模型中，该多标签分类模型输出所述待分类的以太坊智能合约的漏洞检测结果。Input the Ethereum smart contract to be classified into the selected multi-label classification model, and the multi-label classification model outputs the vulnerability detection result of the Ethereum smart contract to be classified.

优选地，所述的利用已验证的智能合约构成样本数据集，包括：Preferably, the use of verified smart contracts to form a sample data set includes:

从Etherscan网站爬取一定数量的已验证的智能合约数据,利用所有的智能合约数据构成样本数据集。Crawl a certain amount of verified smart contract data from the Etherscan website, and use all the smart contract data to form a sample data set.

优选地，所述的对样本数据集中的样本进行特征提取，使用特征向量来表示样本，包括：Preferably, the feature extraction is performed on the samples in the sample data set, and a feature vector is used to represent the samples, including:

对样本数据集中的样本有无漏洞进行标定，并对有漏洞的样本进行进一步的细化分类；Calibrate whether the samples in the sample data set have loopholes, and further refine and classify the loopholes;

通过对漏洞标定后的样本的合约源码进行编译和字节码解析，将编写智能合约的Solidity高级语言转化为操作码流，利用设定的操作码抽象规则对样本的操作码流进行抽象化处理，采用n-gram算法把抽象化的操作码数据流分割成一系列bigram特征片段，对所有的bigram特征片段共提取1619维bigram特征；By compiling and analyzing the bytecode of the contract source code of the sample after vulnerability calibration, the Solidity high-level language for writing smart contracts is converted into an opcode stream, and the opcode stream of the sample is abstracted using the set opcode abstraction rules. , using the n-gram algorithm to divide the abstract opcode data stream into a series of bigram feature segments, and extract a total of 1619-dimensional bigram features for all bigram feature segments;

通过定义特征计算公式计算bigram特征的特征值，将所有的特征值组成特征集合，将所述特征集合格式化处理成向量格式，得到样本的特征向量集合，每一个特征向量代表一个样本，每个特征向量中包含样本的分类和特征数据。Calculate the eigenvalues of the bigram feature by defining the feature calculation formula, form all the eigenvalues into a feature set, format the feature set into a vector format, and obtain the feature vector set of the sample, each eigenvector represents a sample, and each eigenvector represents a sample. The feature vector contains the classification and feature data of the sample.

优选地，所述的对样本数据集中的样本有无漏洞进行标定，并对有漏洞的样本进行进一步的细化分类，包括：Preferably, the calibration is performed on whether the samples in the sample data set have loopholes, and the samples with loopholes are further refined and classified, including:

通过Oyente、Securify和Mythril三种工具扫描样本数据集中的样本的合约源码，判别合约源码是否具有漏洞，以及具有哪几种漏洞，得到样本的初步数据标定结果，对标定有漏洞的样本的合约源码通过编写合约交易测试用例，并在Remix IDE中部署调试交易，验证合约是否具有漏洞，得到样本的细化分类的数据标定结果。Use Oyente, Securify and Mythril to scan the contract source code of the samples in the sample data set, determine whether the contract source code has loopholes, and what kinds of loopholes it has, get the preliminary data calibration results of the sample, and calibrate the contract source code of the samples with loopholes. By writing a contract transaction test case and deploying the debug transaction in the Remix IDE, verify whether the contract has loopholes, and obtain the data calibration results of the refined classification of the sample.

优选地，所述的设定的操作码抽象规则包括：表2所示的操作码抽象规则：Preferably, the set opcode abstraction rules include: the opcode abstraction rules shown in Table 2:

表2操作码抽象规则Table 2 Opcode abstraction rules

优选地，所述的基于各个样本的特征向量训练各种多标签分类模型，对各个多标签分类模型的分类效果进行评价，选取分类效果最好的多标签分类模型，包括：Preferably, various multi-label classification models are trained based on the feature vector of each sample, the classification effect of each multi-label classification model is evaluated, and the multi-label classification model with the best classification effect is selected, including:

基于各个样本的特征向量数据采用机器学习分类算法，训练样本的各个多标签分类模型，所述各个多标签分类模型包括XGBoost、AdaBoost、随机森林，支持向量机和k近邻5种样本的多标签分类模型，将各个样本的特征向量分别输入到各个样本的多标签分类模型中，每个样本的多标签分类模型输出样本否具有漏洞及具有哪几种漏洞的分类结果，样本的漏洞包括整数下溢漏洞、整数上溢漏洞、交易顺序依赖漏洞、时间戳依赖漏洞、返回值漏洞和代码重入漏洞；Based on the feature vector data of each sample, a machine learning classification algorithm is used to train each multi-label classification model of the sample. The various multi-label classification models include XGBoost, AdaBoost, random forest, support vector machine and k-nearest neighbors. Multi-label classification of 5 kinds of samples Model, the feature vector of each sample is input into the multi-label classification model of each sample, and the multi-label classification model of each sample outputs the classification results of whether the sample has loopholes and what kinds of loopholes it has. The loopholes of the samples include integer underflow. Vulnerability, integer overflow vulnerability, transaction order dependency vulnerability, timestamp dependency vulnerability, return value vulnerability and code reentrancy vulnerability;

通过micro-F1、macro-F1和F1-score评价指标将各个多标签分类模型输出的样本的分类结果与样本的漏洞的细化分类结果进行比较，根据比较结果对各个多标签分类模型的分类效果进行评价，选取训练好的分类效果最好的多标签分类模型。Through micro-F1, macro-F1 and F1-score evaluation indicators, the classification results of the samples output by each multi-label classification model are compared with the refined classification results of the vulnerabilities of the samples, and the classification effect of each multi-label classification model is compared according to the comparison results. For evaluation, select the multi-label classification model with the best classification effect after training.

优选地，所述样本否具有漏洞及具有哪几种漏洞的分类结果通过分类标签来表示，分类标签中的每一项代表一种漏洞，每一项的值为1代表具有该种漏洞，为0代表不具有该种漏洞。Preferably, the classification result of whether the sample has a vulnerability and what kinds of vulnerabilities it has are represented by a classification label, each item in the classification label represents a type of vulnerability, and the value of each item is 1 to represent that type of vulnerability, which is 0 means no such vulnerability.

由上述本发明的实施例提供的技术方案可以看出，本发明实施例提出的多标签分类和漏洞检测方法通过提取静态特征和利用机器学习算法，实现了对6种合约漏洞准确且高效地自动检测，本方法更适用于大批量合约漏洞检测的应用场景。It can be seen from the technical solutions provided by the above embodiments of the present invention that the multi-label classification and vulnerability detection method proposed by the embodiments of the present invention realizes accurate and efficient automatic detection of six types of contract vulnerabilities by extracting static features and utilizing machine learning algorithms. This method is more suitable for application scenarios of large-scale contract vulnerability detection.

本发明附加的方面和优点将在下面的描述中部分给出，这些将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth in part in the following description, which will be apparent from the following description, or may be learned by practice of the present invention.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的一种基于机器学习算法的以太坊智能合约漏洞检测方法的处理流程图。FIG. 1 is a process flow chart of a method for detecting vulnerabilities in an Ethereum smart contract based on a machine learning algorithm provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施方式，所述实施方式的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的，仅用于解释本发明，而不能解释为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, but not to be construed as a limitation of the present invention.

本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件，但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解，当我们称元件被“连接”或“耦接”到另一元件时，它可以直接连接或耦接到其他元件，或者也可以存在中间元件。此外，这里使用的“连接”或“耦接”可以包括无线连接或耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的任一单元和全部组合。It will be understood by those skilled in the art that the singular forms "a", "an", "the" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof. It will be understood that when we refer to an element as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Furthermore, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

本技术领域技术人员可以理解，除非另外定义，这里使用的所有术语(包括技术术语和科学术语)具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是，诸如通用字典中定义的那些术语应该被理解为具有与现有技术的上下文中的意义一致的意义，并且除非像这里一样定义，不会用理想化或过于正式的含义来解释。It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms such as those defined in general dictionaries should be understood to have meanings consistent with their meanings in the context of the prior art and, unless defined as herein, are not to be taken in an idealized or overly formal sense. explain.

为便于对本发明实施例的理解，下面将结合附图以几个具体实施例为例做进一步的解释说明，且各个实施例并不构成对本发明实施例的限定。In order to facilitate the understanding of the embodiments of the present invention, the following will take several specific embodiments as examples for further explanation and description in conjunction with the accompanying drawings, and each embodiment does not constitute a limitation to the embodiments of the present invention.

本发明实施例提供了一种基于机器学习算法的以太坊智能合约漏洞自动检测方法，该方法可以充分利用机器学习算法的优势，提高检测的效率。本发明采用了较全面的特征集，因此可以有效地描述合约的静态特性。The embodiment of the present invention provides an automatic detection method for Ethereum smart contract vulnerabilities based on a machine learning algorithm, which can make full use of the advantages of the machine learning algorithm and improve the detection efficiency. The present invention adopts a more comprehensive feature set, so it can effectively describe the static characteristics of the contract.

本发明提供的一个基于机器学习算法的以太坊智能合约漏洞检测方法的处理流程如图1所示，包括如下的处理步骤：The processing flow of a method for detecting vulnerabilities in an Ethereum smart contract based on a machine learning algorithm provided by the present invention is shown in Figure 1, including the following processing steps:

步骤S110：从Etherscan网站爬取已验证的智能合约数据,利用智能合约数据构成样本数据集。Step S110: Crawl the verified smart contract data from the Etherscan website, and use the smart contract data to form a sample data set.

Etherscan是最著名的以太坊浏览器之一，其域名为etherscan.io。它实质上是一个搜索引擎，可让用户在以太坊分布式智能合约平台上查找、确认和验证交易。在Etherscan上有大量的开源智能合约代码，包括智能合约的源码、对应Token名称和Solidity版本等有用的信息，为研究者对区块链的研究提供了便利。但是Etherscan不提供公开的智能合约下载接口，需要编写脚本来进行爬取。在实际中，我们爬取了49502个以太坊智能合约数据。Etherscan is one of the most famous Ethereum browsers with the domain name etherscan.io. It is essentially a search engine that allows users to find, confirm and verify transactions on the Ethereum distributed smart contract platform. There are a large number of open source smart contract codes on Etherscan, including the source code of smart contracts, the corresponding Token name and Solidity version and other useful information, which provides convenience for researchers to study the blockchain. However, Etherscan does not provide a public smart contract download interface, and requires scripting to crawl. In practice, we scraped 49,502 Ethereum smart contract data.

本发明实施例从Etherscan网站爬取已验证的智能合约数据,利用智能合约数据构成样本数据集。The embodiment of the present invention crawls verified smart contract data from the Etherscan website, and uses the smart contract data to form a sample data set.

步骤S120：对样本数据集中的样本有无漏洞进行标定，并对有漏洞的样本进行进一步的细化分类。Step S120: Calibrate whether the samples in the sample data set have loopholes, and further refine and classify the samples with loopholes.

(1)样本标定(1) Sample calibration

首先，通过Oyente、Securify和Mythril三种工具扫描样本数据集中的样本的合约源码，判别合约源码是否具有漏洞，以及具有哪几种漏洞，得到样本的初步数据标定结果。接着对标定有漏洞的样本的合约源码通过编写合约交易测试用例，并在Remix IDE中部署调试交易，人工验证合约是否具有漏洞，以此得到最终的表1所示的样本的数据标定结果。First, scan the contract source code of the sample in the sample data set with three tools, Oyente, Securify and Mythril, to determine whether the contract source code has loopholes, and what kinds of loopholes it has, and obtain the preliminary data calibration results of the sample. Then, the contract source code of the sample with loopholes is calibrated by writing a contract transaction test case, and deploying the debugging transaction in the Remix IDE, and manually verifying whether the contract has loopholes, so as to obtain the final data calibration results of the samples shown in Table 1.

表1有漏洞样本的细化类别标定Table 1. Refinement category calibration of vulnerable samples

编号Numbering 漏洞类型Vulnerability Type 样本数量Number of samples 11 整数上溢漏洞(C1)Integer Overflow Vulnerability (C1) 2212822128 22 整数下溢漏洞(C2)Integer underflow vulnerability (C2) 96999699 33 交易顺序依赖漏洞(C3)Transaction Order Dependency Vulnerability (C3) 14361436 44 未检查返回值漏洞(C4)Unchecked Return Value Vulnerability (C4) 192192 55 时间戳依赖漏洞(C5)Timestamp dependency vulnerability (C5) 477477 66 代码重入漏洞(C6)Code reentrancy vulnerability (C6) 100100

步骤S130：对样本进行特征提取，得到操作码静态特征。Step S130: Perform feature extraction on the sample to obtain the static feature of the opcode.

通过对样本的合约源码进行编译和字节码解析，将编写智能合约的Solidity高级语言转化为操作码流，再利用表2所示的操作码抽象规则对操作码流进行抽象化处理，避免了由于特征数目过多而引发的维度灾难。然后采用n-gram算法把抽象化的操作码数据流分割成一系列bigram特征片段，对所有的bigram特征片段共提取了1619维bigram特征，用以刻画样本的行为。然后通过定义特征计算公式计算bigram特征的特征值，将所有的特征值组成特征集合。By compiling the contract source code of the sample and parsing the bytecode, the Solidity high-level language for writing smart contracts is converted into an opcode stream, and then the opcode abstraction rules shown in Table 2 are used to abstract the opcode stream, avoiding the need for Dimensional disaster caused by too many features. Then, the abstract opcode data stream is divided into a series of bigram feature segments by the n-gram algorithm, and a total of 1619-dimensional bigram features are extracted from all bigram feature segments to describe the behavior of the sample. Then, the eigenvalues of the bigram feature are calculated by defining the feature calculation formula, and all the eigenvalues are formed into a feature set.

表2操作码抽象规则Table 2 Opcode abstraction rules

步骤S140：对特征集合进行向量化，使用特征向量来表示应用样本。Step S140: Vectorize the feature set, and use the feature vector to represent the application sample.

对上述特征集合格式化处理成向量格式，得到样本的特征向量集合。每一个特征向量代表一个样本，每个特征向量中包含样本的分类和特征数据。The above feature set is formatted into a vector format to obtain a feature vector set of the sample. Each feature vector represents a sample, and each feature vector contains the classification and feature data of the sample.

步骤S150：基于样本的特征向量集合，训练各种多标签分类模型，对各个多标签分类模型的分类效果进行评价，选取训练好的分类效果最好的多标签分类模型。Step S150: Based on the feature vector set of the samples, various multi-label classification models are trained, the classification effect of each multi-label classification model is evaluated, and the trained multi-label classification model with the best classification effect is selected.

基于样本的特征向量集合中的特征向量数据采用机器学习分类算法，训练样本的多标签分类模型，利用多标签分类模型对样本是否具有漏洞及具有哪几种漏洞进行判别。本发明中采用了XGBoost、AdaBoost、随机森林(Random Forest，RF)，支持向量机(SVM)和k近邻(KNN)5种样本的多标签分类模型，将上述样本的特征向量集合分别输入到各个样本的多标签分类模型中，每个样本的多标签分类模型输出样本否具有漏洞及具有哪几种漏洞的分类结果，样本的漏洞包括整数下溢漏洞、整数上溢漏洞、交易顺序依赖漏洞、时间戳依赖漏洞、返回值漏洞和代码重入漏洞等。上述分类结果可以通过分类标签来表示，分类标签中的每一项代表一种漏洞，每一项的值为1代表具有该种漏洞，为0代表不具有该种漏洞。例如一个智能合约被分类的标签为[0，1，1，0，1，0]，则说明该合约具有整数下溢漏洞、交易顺序依赖漏洞和时间戳依赖漏洞，不具有整数上溢漏洞、未检查返回值漏洞和代码重入漏洞。The feature vector data in the sample-based feature vector set adopts the machine learning classification algorithm, trains the multi-label classification model of the sample, and uses the multi-label classification model to discriminate whether the sample has loopholes and which kinds of loopholes it has. In the present invention, five kinds of multi-label classification models of samples including XGBoost, AdaBoost, Random Forest (RF), Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) are adopted, and the feature vector sets of the above samples are respectively input into each In the multi-label classification model of the sample, the multi-label classification model of each sample outputs the classification results of whether the sample has loopholes and which kinds of loopholes it has. The loopholes of the samples include integer underflow vulnerability, integer overflow vulnerability, transaction order dependency vulnerability, Timestamp dependency vulnerability, return value vulnerability and code reentrancy vulnerability, etc. The above classification result can be represented by a classification label. Each item in the classification label represents a type of vulnerability, and the value of each item is 1 to represent the vulnerability, and 0 to represent the vulnerability. For example, if a smart contract is classified as [0, 1, 1, 0, 1, 0], it means that the contract has integer underflow vulnerability, transaction order dependency vulnerability and timestamp dependency vulnerability, and does not have integer overflow vulnerability, Unchecked return value vulnerabilities and code reentrancy vulnerabilities.

然后，通过micro-F1、macro-F1和F1-score评价指标将各个多标签分类模型输出的样本的分类结果与样本的漏洞的细化分类结果进行比较，根据比较结果对各个多标签分类模型的分类效果进行评价，选取训练好的分类效果最好的多标签分类模型。Then, through the micro-F1, macro-F1 and F1-score evaluation indicators, the classification results of the samples output by each multi-label classification model are compared with the refined classification results of the vulnerabilities of the samples. The classification effect is evaluated, and the trained multi-label classification model with the best classification effect is selected.

经过实验验证，如表3所示，基于XGBoost多标签分类模型检测智能合约漏洞效果最好。如表4所示，XGBoost多标签分类模型检测一份合约大约需要4秒，Oyente大约需要28秒，Securify大约需要18秒。由此可得，XGBoost多标签分类模型对合约漏洞检测的准确性和高效性，更适用于大批量检测智能合约漏洞的应用场景。After experimental verification, as shown in Table 3, the detection of smart contract vulnerabilities based on the XGBoost multi-label classification model is the best. As shown in Table 4, the XGBoost multi-label classification model takes about 4 seconds to detect a contract, Oyente takes about 28 seconds, and Securify takes about 18 seconds. It can be seen that the accuracy and efficiency of the XGBoost multi-label classification model for contract vulnerability detection is more suitable for the application scenario of large-scale detection of smart contract vulnerabilities.

表3 5种分类模型分类性能比较Table 3 Classification performance comparison of five classification models

表4 XGBoost多标签分类模型、Oyente和Securify漏洞检测时间对比Table 4 XGBoost multi-label classification model, Oyente and Securify vulnerability detection time comparison

步骤S160：将待分类的以太坊智能合约输入到训练好的分类效果最好的多标签分类模型中，该多标签分类模型输出上述待分类的以太坊智能合约的漏洞检测结果，该漏洞检测结果中包括以太坊智能合约具有或者不具有那些漏洞，该漏洞包括整数下溢漏洞、整数上溢漏洞、交易顺序依赖漏洞、时间戳依赖漏洞、返回值漏洞和代码重入漏洞等。Step S160: Input the Ethereum smart contract to be classified into the trained multi-label classification model with the best classification effect, and the multi-label classification model outputs the vulnerability detection result of the Ethereum smart contract to be classified, the vulnerability detection result Including those vulnerabilities that Ethereum smart contracts have or do not have, the vulnerabilities include integer underflow vulnerability, integer overflow vulnerability, transaction order dependency vulnerability, timestamp dependency vulnerability, return value vulnerability and code reentrancy vulnerability.

综上所述，本发明实施例提出的多标签分类和漏洞检测方法通过提取静态特征和利用机器学习算法，实现了对6种合约漏洞准确且高效地自动检测，本方法更适用于大批量合约漏洞检测的应用场景。To sum up, the multi-label classification and vulnerability detection method proposed in the embodiment of the present invention realizes accurate and efficient automatic detection of 6 types of contract vulnerabilities by extracting static features and using machine learning algorithms. This method is more suitable for large-scale contracts. Application scenarios of vulnerability detection.

本发明首次提出了以太坊智能合约的静态特征并首次应用这些静态特征在合约漏洞检测方面，通过机器学习算法对以太坊智能合约漏洞进行自动检测，实现以太坊智能合约漏洞自动且高效地检测，本方法更适用于大批量合约漏洞检测的应用场景。The present invention proposes the static features of the Ethereum smart contract for the first time and applies these static features for the first time to the detection of contract vulnerabilities. The machine learning algorithm is used to automatically detect the vulnerabilities of the Ethereum smart contracts, so as to realize the automatic and efficient detection of the vulnerabilities of the Ethereum smart contracts. This method is more suitable for application scenarios of large-scale contract vulnerability detection.

本领域普通技术人员可以理解：附图只是一个实施例的示意图，附图中的模块或流程并不一定是实施本发明所必须的。Those of ordinary skill in the art can understand that the accompanying drawing is only a schematic diagram of an embodiment, and the modules or processes in the accompanying drawing are not necessarily necessary to implement the present invention.

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art. The computer software products can be stored in storage media, such as ROM/RAM, magnetic disks, etc. , CD, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of the present invention.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置或系统实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的装置及系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus or system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts. The apparatus and system embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A vulnerability detection method for an Ethernet intelligent contract is characterized by comprising the following steps:

forming a sample data set by using the verified intelligent contract, extracting the characteristics of the samples in the sample data set, and expressing the samples by using the characteristic vector;

training various multi-label classification models based on the feature vectors of the samples, evaluating the classification effect of the multi-label classification models, and selecting the multi-label classification model with the best classification effect;

inputting the Ether house intelligent contracts to be classified into a selected multi-label classification model, and outputting the vulnerability detection results of the Ether house intelligent contracts to be classified by the multi-label classification model;

the performing feature extraction on the samples in the sample data set, using the feature vector to represent the samples, includes:

calibrating whether the samples in the sample data set have the holes or not, and further performing detailed classification on the samples with the holes;

compiling and byte code analyzing contract source codes of samples after vulnerability calibration, converting a Solidity high-level language for compiling an intelligent contract into an operation code stream, abstracting the operation code stream of the samples by using a set operation code abstraction rule, segmenting the abstracted operation code data stream into a series of bigram characteristic segments by adopting an n-gram algorithm, and extracting 1619-dimensional bigram characteristics from all bigram characteristic segments;

Calculating the characteristic values of bigram characteristics by defining a characteristic calculation formula, forming a characteristic set by all the characteristic values, formatting the characteristic set into a vector format to obtain a characteristic vector set of samples, wherein each characteristic vector represents one sample, and each characteristic vector comprises the classification and characteristic data of the sample;

the calibrating of the samples with the holes in the sample data set and the further refining and classifying of the samples with the holes comprise:

scanning a contract source code of a sample in a sample data set through an OYETE tool, a Security tool and a Mythril tool, judging whether the contract source code has a bug and what kinds of bugs, obtaining a preliminary data calibration result of the sample, compiling a contract transaction test case for the contract source code of the sample with the bug calibrated, deploying debugging transaction in a Remix IDE, verifying whether the contract has the bug, and obtaining a data calibration result of the sample in a refined classification mode;

the method comprises the following steps of training various multi-label classification models based on the characteristic vectors of all samples, evaluating the classification effect of each multi-label classification model, and selecting the multi-label classification model with the best classification effect, wherein the method comprises the following steps:

Training each multi-label classification model of the samples by adopting a machine learning classification algorithm based on feature vector data of each sample, wherein each multi-label classification model comprises XGboost, Adaboost, random forest, a multi-label classification model supporting a vector machine and 5 types of k adjacent samples, respectively inputting the feature vector of each sample into the multi-label classification model of each sample, outputting whether the sample has a vulnerability and classification results of which types of vulnerabilities by the multi-label classification model of each sample, and the vulnerabilities of the samples comprise integer underflow vulnerabilities, integer overflow vulnerabilities, transaction sequence dependence vulnerabilities, timestamp dependence vulnerabilities, return value vulnerabilities and code reentry vulnerabilities;

and comparing the classification result of the sample output by each multi-label classification model with the refined classification result of the vulnerability of the sample through micro-F1, macro-F1 and F1-score evaluation indexes, evaluating the classification effect of each multi-label classification model according to the comparison result, and selecting the multi-label classification model with the best trained classification effect.

2. The method of claim 1, wherein constructing a sample data set using the validated intelligent contract comprises:

And crawling a certain amount of verified intelligent contract data from the Etherscan website, and forming a sample data set by using all the intelligent contract data.

3. The method of claim 1, wherein the set opcode abstraction rule comprises:

4. the method of claim 1, wherein the classification of whether the sample has a vulnerability and which types of vulnerabilities are represented by classification tags, each entry in the classification tags represents a vulnerability, a value of 1 for each entry represents that the sample has a vulnerability, and a value of 0 represents that the sample does not have a vulnerability.