[go: up one dir, main page]

CN118211221A - Multi-programming language software back door detection method based on intermediate representation - Google Patents

Multi-programming language software back door detection method based on intermediate representation Download PDF

Info

Publication number
CN118211221A
CN118211221A CN202410341938.3A CN202410341938A CN118211221A CN 118211221 A CN118211221 A CN 118211221A CN 202410341938 A CN202410341938 A CN 202410341938A CN 118211221 A CN118211221 A CN 118211221A
Authority
CN
China
Prior art keywords
variables
backdoor
statements
vfg
data structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410341938.3A
Other languages
Chinese (zh)
Inventor
吴志勇
宋晓斌
岳贯集
饶金龙
马陈城
刘磊
黄天纵
朱怀东
张俊
俞仁涵
刘茂强
王耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
UNIT 61660 OF PLA
Original Assignee
UNIT 61660 OF PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by UNIT 61660 OF PLA filed Critical UNIT 61660 OF PLA
Priority to CN202410341938.3A priority Critical patent/CN118211221A/en
Publication of CN118211221A publication Critical patent/CN118211221A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention relates to a multi-programming language software back door detection method based on intermediate representation, belonging to the field of software security. In order to solve the problem that the detection efficiency and precision are poor when the traditional technology discovers a deeper software backdoor, the invention can realize cross-language analysis of software internal logic comprising a plurality of programming languages by establishing a unified intermediate representation model, thereby discovering the deeper and more concealed backdoor problem and making up the defect of precision deficiency caused by the need of processing different languages respectively in the traditional analysis means.

Description

一种基于中间表示的多编程语言软件后门检测方法A multi-programming language software backdoor detection method based on intermediate representation

技术领域Technical Field

本发明属于软件安全领域,具体涉及一种基于中间表示的多编程语言软件后门检测方法。The invention belongs to the field of software security, and in particular relates to a multi-programming language software backdoor detection method based on intermediate representation.

背景技术Background technique

软件后门是指被恶意植入到软件系统中的一段代码,使得攻击者能够绕过正常的安全监测机制,从而窃取信息或者执行指令。静态代码分析技术是一种检测软件后门的重要手段,通过对源代码的扫描,定位可疑的后门片段。A software backdoor is a piece of code that is maliciously implanted into a software system, allowing attackers to bypass normal security monitoring mechanisms and steal information or execute instructions. Static code analysis technology is an important means of detecting software backdoors. It locates suspicious backdoor fragments by scanning the source code.

静态代码分析技术是一种不依赖于实际执行代码的分析方法,可以在不运行程序的情况下对代码进行分析,以发现其中的潜在软件后门。目前,静态代码分析技术已经广泛应用于软件开发和安全领域,例如:源代码安全扫描、代码混淆、污点分析、数据流分析、控制流分析等技术。Static code analysis technology is an analysis method that does not rely on the actual execution of code. It can analyze the code without running the program to find potential software backdoors. At present, static code analysis technology has been widely used in software development and security fields, such as source code security scanning, code obfuscation, taint analysis, data flow analysis, control flow analysis and other technologies.

然而,在由多种编程语言组成的软件中,传统的静态代码分析技术,因为需要将各种编程语言区分检测,将难以发现隐藏较深的软件后门。一些学者已经提出了一些解决方法,如基于符号执行的方法和基于深度学习的方法等,但是受限于检测效率和精度的问题,在实际场景中效果有限。However, in software composed of multiple programming languages, traditional static code analysis technology will have difficulty in discovering deeply hidden software backdoors because it needs to distinguish and detect various programming languages. Some scholars have proposed some solutions, such as symbolic execution-based methods and deep learning-based methods, but they are limited in effectiveness in actual scenarios due to problems with detection efficiency and accuracy.

发明内容Summary of the invention

(一)要解决的技术问题1. Technical issues to be resolved

本发明要解决的技术问题是:提供一种包括多种编程语言的软件的后门检测方法,发现软件内部隐藏较深的后门逻辑。The technical problem to be solved by the present invention is to provide a backdoor detection method for software including multiple programming languages to discover the backdoor logic hidden deeper inside the software.

(二)技术方案(II) Technical solution

为了解决上述技术问题,本发明提供了一种基于中间表示的多编程语言软件后门检测方法,包括以下步骤:In order to solve the above technical problems, the present invention provides a multi-programming language software backdoor detection method based on intermediate representation, comprising the following steps:

第一步:收集被测软件中各种编程语言的源代码文件;Step 1: Collect source code files of various programming languages in the software under test;

第二步:将各种编程语言的源代码都映射到统一的中间表示IR;Step 2: Map the source code of various programming languages to a unified intermediate representation IR;

第三步:在中间表示IR的基础上,建立数据流模型VFG;Step 3: Based on the intermediate representation IR, establish the data flow model VFG;

第四步:在VFG中遍历各节点,定位可能的后门植入点和后门触发点;Step 4: Traverse each node in the VFG to locate possible backdoor implantation points and backdoor triggering points;

第五步:在VFG中进行流追踪,搜索后门植入点和后门触发点之间是否存在可行程序路径,从而判定是否存在软件后门。Step 5: Perform flow tracking in VFG to search whether there is a feasible program path between the backdoor implantation point and the backdoor trigger point, so as to determine whether there is a software backdoor.

优选地,第一步中收集源代码文件时,若软件中使用了java和python两种语言,则收集以“.java”和“.py”为后缀的源代码文件。Preferably, when collecting source code files in the first step, if both Java and Python are used in the software, source code files with suffixes of ".java" and ".py" are collected.

优选地,第二步中将源代码统一转换成中间表示IR,中间表示是对程序逻辑的抽象,其中隐藏了编程语言的相关细节,保留了程序指令的结构特性。Preferably, in the second step, the source code is uniformly converted into an intermediate representation IR, which is an abstraction of program logic, in which relevant details of the programming language are hidden and the structural characteristics of the program instructions are retained.

优选地,第二步中具体的转换方式为:Preferably, the specific conversion method in the second step is:

对于不同编程语言代码的声明语句,统一创建DeclareStatement数据结构,其中将存储所声明变量的类型、名称和初始化表达式;For declaration statements in different programming language codes, a unified DeclareStatement data structure is created, which will store the type, name and initialization expression of the declared variable;

对于不同编程语言代码中的赋值语句,统一创建AssignStatement数据结构,其中将存储赋值语句对应的左值、右值和使用的赋值符号;For assignment statements in different programming language codes, a unified AssignStatement data structure is created, which stores the left value, right value and assignment symbol corresponding to the assignment statement;

对于不同编程语言代码中的循环语句,统一创建LoopStatement数据结构,其中将存储循环语句的循环跳出条件、迭代变量的初始化和循环体内部的各语句;For loop statements in different programming language codes, a LoopStatement data structure is uniformly created, which stores the loop exit conditions, the initialization of iteration variables, and the statements inside the loop body.

对于不同编程语言代码中的分支语句,统一创建BranchStatement数据结构,其中将存储分支语句的判断条件、条件为真时执行的语句和条件为假时执行的语句;For branch statements in different programming language codes, a BranchStatement data structure is uniformly created, which stores the judgment conditions of the branch statements, the statements executed when the conditions are true, and the statements executed when the conditions are false;

对于不同编程语言代码中的函数调用语句,统一创建CallStatement数据结构,其中将存储函数调用语句的函数名和参数列表;For function call statements in different programming language codes, a CallStatement data structure is uniformly created, which stores the function name and parameter list of the function call statement;

对于不同编程语言中的返回语句,统一创建ReturnStatement数据结构,其中将存储返回语句的表达式;For return statements in different programming languages, a unified ReturnStatement data structure is created, which will store the expression of the return statement;

对于不同编程语言中的参数声明,统一创建ArgStatement数据结构,其中将存储参数变量的名称和类型;For parameter declarations in different programming languages, a unified ArgStatement data structure is created, which will store the name and type of the parameter variable;

对于不同编程语言代码中的二元表达式,统一创建BinaryExpression数据结构,其中将存储二元表达式的左变量、右变量和使用的操作符;For binary expressions in different programming language codes, a unified BinaryExpression data structure is created, which will store the left variable, right variable and used operators of the binary expression;

对于不同编程语言代码中的一元表达式,统一创建UnaryExpression数据结构,其中将存储一元表达式的变量和操作符。For unary expressions in different programming language codes, a unified UnaryExpression data structure is created, which stores the variables and operators of the unary expressions.

优选地,第三步中构造数据流模型VFG用于表示程序中的数值传递关系。Preferably, in the third step, a data flow model VFG is constructed to represent the value transfer relationship in the program.

优选地,第三步中构造VFG的流程为:Preferably, the process of constructing VFG in the third step is:

31、借助数据结构的类型,确定程序语句所定义的变量:DeclareStatement中定义其声明的变量,AssignStatement定义左值;ArgStatement定义参数;31. Determine the variables defined by the program statements with the help of the data structure type: DeclareStatement defines the variables it declares, AssignStatement defines the left value, and ArgStatement defines the parameters;

32、借助数据结构的类型,确定程序语句所使用的变量:DeclareStatement中使用变量初始化表达式中的变量;AssignStatement使用右值中的变量;LoopStatement使用循环迭代变量;BranchStatement中使用条件表达式中的变量;CallStatement中使用各参数变量;ReturnStatement中使用返回的表达式中的变量;32. Determine the variables used by program statements with the help of the data structure type: variables in variable initialization expressions in DeclareStatement; variables in right values in AssignStatement; variables in loop iterations in LoopStatement; variables in conditional expressions in BranchStatement; variables in each parameter in CallStatement; variables in returned expressions in ReturnStatement;

33、建立函数内部的数值传递关系,基于步骤31和32的分析结果,明确每条程序语句对于变量的定义和使用,并且根据变量名称,建立变量定义到变量使用之间的关联,即函数内的数值传递关系;33. Establish the value transfer relationship within the function. Based on the analysis results of steps 31 and 32, clarify the definition and use of variables in each program statement, and establish the association between variable definition and variable use based on the variable name, that is, the value transfer relationship within the function;

34、建立函数之间的数值传递关系,对于函数调用点CallStatement,根据函数名找到所调用的具体的函数,从而建立函数调用点中的实参列表和具体函数的形参列表的关联,即函数间的数值传递关系;34. Establish a value transfer relationship between functions. For the function call point CallStatement, find the specific function called according to the function name, so as to establish the association between the actual parameter list in the function call point and the formal parameter list of the specific function, that is, the value transfer relationship between functions;

35、在VFG中为每一条程序语句创建一个节点,然后根据步骤33和34的分析结果,在节点之间创建有向边,从变量的定义点连接到变量的使用点,用于表示数据的传递。35. Create a node for each program statement in VFG, and then create directed edges between nodes based on the analysis results of steps 33 and 34, connecting the definition point of the variable to the use point of the variable to represent the transfer of data.

优选地,第四步中,在VFG中遍历节点找到的可能的后门植入点指可能支持攻击者在外部输入数据的指令,根据程序中所调用的函数和使用的变量来进行区分;在VFG中遍历节点找到的可能的后门触发点指可能是攻击者实施具体后门行为的指令,同样根据程序中所调用的函数和使用的变量来进行区分。Preferably, in the fourth step, the possible backdoor implantation points found by traversing the nodes in the VFG refer to instructions that may support the attacker to input data externally, which are distinguished according to the functions called and the variables used in the program; the possible backdoor triggering points found by traversing the nodes in the VFG refer to instructions that may be instructions for the attacker to implement specific backdoor behaviors, which are also distinguished according to the functions called and the variables used in the program.

优选地,第五步:在VFG中,搜索植入点和触发点之间是否存在可行程序路径,即在VFG中从植入点开始搜索,若存在能够到达触发点的路径,则认为可能存在软件后门。Preferably, the fifth step: searching in the VFG whether there is a feasible program path between the implantation point and the trigger point, that is, starting the search from the implantation point in the VFG, if there is a path that can reach the trigger point, it is considered that there may be a software backdoor.

本发明还提供了一种用于实现所述方法的系统,The present invention also provides a system for implementing the method.

本发明还提供了一种基于所述方法实现的网络安全分析方法。The invention also provides a network security analysis method implemented based on the method.

(三)有益效果(III) Beneficial effects

本发明通过建立统一的中间表示模型,可以实现对于包括多种编程语言的软件内部逻辑的跨语言分析,从而能够发现更深层次、更为隐蔽的后门问题,弥补传统分析手段因为需要对不同语言分别处理而导致的精度缺失缺陷。By establishing a unified intermediate representation model, the present invention can realize cross-language analysis of the internal logic of software including multiple programming languages, so as to discover deeper and more hidden backdoor problems and make up for the lack of precision caused by the need to process different languages separately in traditional analysis methods.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明基于中间表示的多编程语言软件后门分析方法的整体流程图,一共分为5个阶段,分别是收集软件中的源代码文件、将源代码转化为中间表示、构造数据流模型、定位植入点和触发点、搜索可行的后门执行路径;FIG1 is an overall flow chart of the multi-programming language software backdoor analysis method based on intermediate representation of the present invention, which is divided into five stages, namely, collecting source code files in the software, converting the source code into intermediate representation, constructing a data flow model, locating implantation points and trigger points, and searching for feasible backdoor execution paths;

图2是被测软件的Java代码截图;Figure 2 is a screenshot of the Java code of the software under test;

图3是被测软件的Python代码截图;Figure 3 is a screenshot of the Python code of the software under test;

图4是实施例代码对应的VFG模型,为了显示清晰,图中只保留了和后门相关的指令节点,总体可分为Java数据接收和Python控制台两部分,所涉及软件后门的整体流程贯通了这两个部分,由两种语言模块组成。但是在图中所示的VFG模型中,因为基于中间表示构建,所以可以在脚本调用点直接连通,且在语法层面不会显示区别,易于进行统一的分析。Figure 4 is the VFG model corresponding to the embodiment code. For the sake of clarity, only the instruction nodes related to the backdoor are retained in the figure. The overall process can be divided into two parts: Java data reception and Python console. The overall process of the software backdoor involved runs through these two parts and consists of two language modules. However, in the VFG model shown in the figure, because it is built based on the intermediate representation, it can be directly connected at the script call point, and there will be no difference at the syntax level, which is easy to perform unified analysis.

具体实施方式Detailed ways

为使本发明的目的、内容和优点更加清楚,下面结合附图和实施例,对本发明的具体实施方式作进一步详细描述。In order to make the purpose, content and advantages of the present invention more clear, the specific implementation methods of the present invention are further described in detail below in conjunction with the accompanying drawings and examples.

本发明提供一种基于中间表示的多编程语言软件后门检测方法,是一种包括多种编程语言的软件的后门检测方法。通过对各种编程语言建立统一的中间表示模型,然后在此基础上构造统一的模型结构,从而实现跨语言的数据流追踪,发现软件内部隐藏较深的后门逻辑,从而实现对于包含多种编程语言的复杂软件中隐藏较深的后门的检测方法。The present invention provides a multi-programming language software backdoor detection method based on intermediate representation, which is a backdoor detection method for software including multiple programming languages. By establishing a unified intermediate representation model for various programming languages, and then constructing a unified model structure on this basis, cross-language data flow tracking is achieved, and the backdoor logic hidden deeper inside the software is discovered, thereby realizing a detection method for the backdoor hidden deeper in complex software including multiple programming languages.

参考图1,本发明的技术方案包括如下步骤:Referring to FIG1 , the technical solution of the present invention comprises the following steps:

第一步:收集被测软件中各种编程语言的源代码文件;Step 1: Collect source code files of various programming languages in the software under test;

第二步:将各种编程语言的源代码都映射到统一的中间表示IR(IntermediateRepresentation);Step 2: Map the source code of various programming languages to a unified intermediate representation IR (Intermediate Representation);

第三步:在中间表示IR的基础上,建立数据流模型VFG(Value Flow Graph);Step 3: Based on the intermediate representation IR, establish the data flow model VFG (Value Flow Graph);

第四步:在VFG中遍历各节点,定位可能的后门植入点和后门触发点;Step 4: Traverse each node in the VFG to locate possible backdoor implantation points and backdoor triggering points;

第五步:在VFG中进行流追踪,搜索后门植入点和后门触发点之间是否存在可行程序路径,并依此判定是否存在软件后门。Step 5: Perform flow tracking in VFG to search for a feasible program path between the backdoor implantation point and the backdoor trigger point, and determine whether there is a software backdoor based on this.

第一步中收集源代码文件时,若软件中使用了java和python两种语言,则收集以“.java”和“.py”为后缀的源代码文件。When collecting source code files in the first step, if both Java and Python are used in the software, source code files with suffixes of ".java" and ".py" are collected.

第二步中将源代码统一转换成中间表示IR,中间表示是对程序逻辑的抽象,其中隐藏了编程语言的相关细节,保留了程序指令的结构特性,如变量声明、函数调用和函数返回等语句类型,在中间表示中仍然有清晰的区别,此步骤只是消除了不同语言之间的差异。In the second step, the source code is uniformly converted into an intermediate representation IR. The intermediate representation is an abstraction of the program logic, which hides the relevant details of the programming language and retains the structural characteristics of the program instructions. Statement types such as variable declaration, function call and function return still have clear distinctions in the intermediate representation. This step only eliminates the differences between different languages.

具体的转换方式为:The specific conversion method is:

对于不同编程语言代码的声明语句,统一创建DeclareStatement数据结构,其中将存储所声明变量的类型、名称和初始化表达式;For declaration statements in different programming language codes, a unified DeclareStatement data structure is created, which will store the type, name and initialization expression of the declared variable;

对于不同编程语言代码中的赋值语句,统一创建AssignStatement数据结构,其中将存储赋值语句对应的左值、右值和使用的赋值符号;For assignment statements in different programming language codes, a unified AssignStatement data structure is created, which stores the left value, right value and assignment symbol corresponding to the assignment statement;

对于不同编程语言代码中的循环语句,统一创建LoopStatement数据结构,其中将存储循环语句的循环跳出条件、迭代变量的初始化和循环体内部的各语句;For loop statements in different programming language codes, a LoopStatement data structure is uniformly created, which stores the loop exit conditions, the initialization of iteration variables, and the statements inside the loop body.

对于不同编程语言代码中的分支语句,统一创建BranchStatement数据结构,其中将存储分支语句的判断条件、条件为真时执行的语句和条件为假时执行的语句;For branch statements in different programming language codes, a BranchStatement data structure is uniformly created, which stores the judgment conditions of the branch statements, the statements executed when the conditions are true, and the statements executed when the conditions are false;

对于不同编程语言代码中的函数调用语句,统一创建CallStatement数据结构,其中将存储函数调用语句的函数名和参数列表;For function call statements in different programming language codes, a CallStatement data structure is uniformly created, which stores the function name and parameter list of the function call statement;

对于不同编程语言中的返回语句,统一创建ReturnStatement数据结构,其中将存储返回语句的表达式;For return statements in different programming languages, a unified ReturnStatement data structure is created, which will store the expression of the return statement;

对于不同编程语言中的参数声明,统一创建ArgStatement数据结构,其中将存储参数变量的名称和类型;For parameter declarations in different programming languages, a unified ArgStatement data structure is created, which will store the name and type of the parameter variable;

对于不同编程语言代码中的二元表达式,统一创建BinaryExpression数据结构,其中将存储二元表达式的左变量、右变量和使用的操作符;For binary expressions in different programming language codes, a unified BinaryExpression data structure is created, which will store the left variable, right variable and used operators of the binary expression;

对于不同编程语言代码中的一元表达式,统一创建UnaryExpression数据结构,其中将存储一元表达式的变量和操作符。For unary expressions in different programming language codes, a unified UnaryExpression data structure is created, which stores the variables and operators of the unary expressions.

对于被测代码中的语句和表达式顺序执行以上的转换过程,可以把不同编程语言中的相同的逻辑映射到统一的数据结构上,其中Statement类型的数据结构可对应到具体的程序语句,Expression类型的数据结构则对应到更细粒度的语句中的表达式,这两种数据结构共同构成了本发明所采用的中间表示模型,即通过上述转换过程,可以实现多种编程语言到中间表示的转化。By sequentially executing the above conversion process for the statements and expressions in the tested code, the same logic in different programming languages can be mapped to a unified data structure, wherein the Statement type data structure can correspond to a specific program statement, and the Expression type data structure corresponds to the expression in a more fine-grained statement. These two data structures together constitute the intermediate representation model adopted by the present invention, that is, through the above conversion process, the conversion of multiple programming languages to intermediate representation can be realized.

第三步中构造数据流模型VFG用于表示程序中的数值传递关系。构造的流程为:In the third step, a data flow model VFG is constructed to represent the value transfer relationship in the program. The construction process is:

31、借助数据结构的类型,确定程序语句所定义的变量:DeclareStatement中定义其声明的变量,AssignStatement定义左值;ArgStatement定义参数;31. Determine the variables defined by the program statements with the help of the data structure type: DeclareStatement defines the variables it declares, AssignStatement defines the left value, and ArgStatement defines the parameters;

32、借助数据结构的类型,确定程序语句所使用的变量:DeclareStatement中使用变量初始化表达式中的变量;AssignStatement使用右值中的变量;LoopStatement使用循环迭代变量;BranchStatement中使用条件表达式中的变量;CallStatement中使用各参数变量;ReturnStatement中使用返回的表达式中的变量;32. Determine the variables used by program statements with the help of the data structure type: variables in variable initialization expressions in DeclareStatement; variables in right values in AssignStatement; variables in loop iterations in LoopStatement; variables in conditional expressions in BranchStatement; variables in each parameter in CallStatement; variables in returned expressions in ReturnStatement;

33、建立函数内部的数值传递关系,基于31和32的分析结果,明确每条程序语句对于变量的定义和使用,并且根据变量名称,建立变量定义到变量使用之间的关联,即函数内的数值传递关系;33. Establish the value transfer relationship within the function. Based on the analysis results of 31 and 32, clarify the definition and use of variables in each program statement, and establish the association between variable definition and variable use based on the variable name, that is, the value transfer relationship within the function;

34、建立函数之间的数值传递关系,对于函数调用点CallStatement,根据函数名找到所调用的具体的函数,从而建立函数调用点中的实参列表和具体函数的形参列表的关联,即函数间的数值传递关系;34. Establish a value transfer relationship between functions. For the function call point CallStatement, find the specific function called according to the function name, so as to establish the association between the actual parameter list in the function call point and the formal parameter list of the specific function, that is, the value transfer relationship between functions;

35、在VFG中为每一条程序语句创建一个节点,然后根据33和34的分析结果,在节点之间创建有向边,从变量的定义点连接到变量的使用点,用于表示数据的传递。35. Create a node for each program statement in VFG, and then create directed edges between the nodes based on the analysis results of 33 and 34, connecting the definition point of the variable to the use point of the variable to represent the transfer of data.

经过上述步骤,可以依据IR生成VFG。因为软件后门的逻辑通常与数值的传递和系统指令相关,所以后续步骤将根据软件后门特有的数据传递模式在VFG中进行检测。After the above steps, VFG can be generated based on IR. Because the logic of software backdoors is usually related to the transmission of values and system instructions, the subsequent steps will be detected in VFG according to the data transmission mode unique to software backdoors.

第四步中,在VFG中遍历节点找到的可能的后门植入点指可能支持攻击者在外部输入数据的指令,如用户输入点、网络接口、读取文件内容等,可以根据程序中所调用的函数和使用的变量来进行区分;在VFG中遍历节点找到的可能的后门触发点指可能是攻击者实施具体后门行为的指令,如执行系统命令、创建文件、远程传输数据等,同样可以根据程序中所调用的函数和使用的变量来进行区分。In the fourth step, the possible backdoor implantation points found by traversing the nodes in the VFG refer to instructions that may support attackers to input data externally, such as user input points, network interfaces, reading file contents, etc., which can be distinguished according to the functions called and variables used in the program; the possible backdoor triggering points found by traversing the nodes in the VFG refer to instructions that may be used by attackers to implement specific backdoor behaviors, such as executing system commands, creating files, and remotely transmitting data, which can also be distinguished according to the functions called and variables used in the program.

第五步:在VFG中,搜索植入点和触发点之间是否存在可行程序路径,即在VFG中从植入点开始搜索,可以到达触发点。如果存在这样的路径,则认为可能存在软件后门,并报出问题。Step 5: In VFG, search whether there is a feasible program path between the implantation point and the trigger point, that is, starting from the implantation point in VFG, the search can reach the trigger point. If such a path exists, it is considered that there may be a software backdoor and the problem is reported.

实施例1Example 1

本实施例用于阐述基于中间表示的多编程语言软件后门分析方法。This embodiment is used to illustrate a multi-programming language software backdoor analysis method based on intermediate representation.

被测软件由两段代码组成,分别是Java和Python,具体实例代码如图2所示的Java代码和图3所示的Python代码(process_input.py脚本)所示。The software under test consists of two pieces of code, Java and Python. The specific example code is shown in the Java code in Figure 2 and the Python code (process_input.py script) in Figure 3.

在图2这段代码中,通过套接字创建网络传输,然后将网络传输的输入,传递给本地的process_input.py脚本进行处理,这段java代码中并不包含完整的后门逻辑。In the code in Figure 2, network transmission is created through the socket, and then the input of the network transmission is passed to the local process_input.py script for processing. This Java code does not contain the complete backdoor logic.

在这段代码中,将接收控制台传入的参数,若其中包括download参数,将执行下载文件操作,且下载的文件同样由参数决定。In this code, the parameters passed in by the console will be received. If the download parameter is included, the file download operation will be executed, and the downloaded file is also determined by the parameter.

以上两段代码,构成了整个示例软件的完整逻辑,通过java代码建立网络连接并接收数据,通过命令行系统调用将外部输入传递给python脚本,然后在python脚本内部执行相关的操作,可以通过download参数执行文件下载,攻击者则可利用该方法下载文件到服务器。The above two pieces of code constitute the complete logic of the entire sample software. The Java code is used to establish a network connection and receive data. The command line system call is used to pass the external input to the Python script. Then, the relevant operations are performed inside the Python script. The download parameter can be used to execute file downloads. Attackers can use this method to download files to the server.

首先对两段代码建立中间表示IR,然后在IR基础上建立数据流模型VFG,表示多编程语言软件的数值传递,构建的VFG模型如图4所示。First, an intermediate representation IR is established for the two pieces of code, and then a data flow model VFG is established based on the IR to represent the numerical transfer of multi-programming language software. The constructed VFG model is shown in Figure 4.

最后在VFG模型上进行流追踪,定位到的后门植入点为“socket”,是外部数据的输入点,记为source;定位到后门的触发点为“canisrufus.commit”,是系统调用的执行点,记为sink;继而搜索二者之间的执行路径,发现可行路径[socket -> … ->command -> …->exec(“python”) -> … ->args.download -> … ->canisrufus.commit],且路径中不包含判定外部输入数据是否合规的操作,所以可以判定该多编程语言软件中存在疑似的后门。Finally, flow tracing is performed on the VFG model. The backdoor implantation point located is "socket", which is the input point of external data and is recorded as source. The trigger point of the backdoor is located as "canisrufus.commit", which is the execution point of the system call and is recorded as sink. Then, the execution path between the two is searched, and a feasible path is found [socket -> … ->command -> …->exec("python") -> … ->args.download -> … ->canisrufus.commit]. The path does not contain operations to determine whether the external input data is compliant. Therefore, it can be determined that there is a suspected backdoor in the multi-programming language software.

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明技术原理的前提下,还可以做出若干改进和变形,这些改进和变形也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the technical principles of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention.

Claims (10)

1.一种基于中间表示的多编程语言软件后门检测方法,其特征在于,包括以下步骤:1. A method for detecting backdoors in multiple programming language software based on intermediate representation, characterized in that it comprises the following steps: 第一步:收集被测软件中各种编程语言的源代码文件;Step 1: Collect source code files of various programming languages in the software under test; 第二步:将各种编程语言的源代码都映射到统一的中间表示IR;Step 2: Map the source code of various programming languages to a unified intermediate representation IR; 第三步:在中间表示IR的基础上,建立数据流模型VFG;Step 3: Based on the intermediate representation IR, establish the data flow model VFG; 第四步:在VFG中遍历各节点,定位可能的后门植入点和后门触发点;Step 4: Traverse each node in the VFG to locate possible backdoor implantation points and backdoor triggering points; 第五步:在VFG中进行流追踪,搜索后门植入点和后门触发点之间是否存在可行程序路径,从而判定是否存在软件后门;Step 5: Perform flow tracking in VFG to search for a feasible program path between the backdoor implantation point and the backdoor trigger point, so as to determine whether there is a software backdoor; 第一步中收集源代码文件时,若软件中使用了java和python两种语言,则收集以“.java”和“.py”为后缀的源代码文件。When collecting source code files in the first step, if both Java and Python are used in the software, source code files with suffixes of ".java" and ".py" are collected. 2.如权利要求1所述的方法,其特征在于,第二步中将源代码统一转换成中间表示IR,中间表示是对程序逻辑的抽象,其中隐藏了编程语言的相关细节,保留了程序指令的结构特性。2. The method as claimed in claim 1 is characterized in that in the second step, the source code is uniformly converted into an intermediate representation IR, which is an abstraction of program logic, in which relevant details of the programming language are hidden and the structural characteristics of the program instructions are retained. 3.如权利要求2所述的方法,其特征在于,第二步中具体的转换方式为:3. The method according to claim 2, characterized in that the specific conversion method in the second step is: 对于不同编程语言代码的声明语句,统一创建DeclareStatement数据结构,其中将存储所声明变量的类型、名称和初始化表达式;For declaration statements in different programming language codes, a unified DeclareStatement data structure is created, which will store the type, name and initialization expression of the declared variable; 对于不同编程语言代码中的赋值语句,统一创建AssignStatement数据结构,其中将存储赋值语句对应的左值、右值和使用的赋值符号;For assignment statements in different programming language codes, a unified AssignStatement data structure is created, which stores the left value, right value and assignment symbol corresponding to the assignment statement; 对于不同编程语言代码中的循环语句,统一创建LoopStatement数据结构,其中将存储循环语句的循环跳出条件、迭代变量的初始化和循环体内部的各语句;For loop statements in different programming language codes, a LoopStatement data structure is uniformly created, which stores the loop exit conditions, the initialization of iteration variables, and the statements inside the loop body. 对于不同编程语言代码中的分支语句,统一创建BranchStatement数据结构,其中将存储分支语句的判断条件、条件为真时执行的语句和条件为假时执行的语句;For branch statements in different programming language codes, a BranchStatement data structure is uniformly created, which stores the judgment conditions of the branch statements, the statements executed when the conditions are true, and the statements executed when the conditions are false; 对于不同编程语言代码中的函数调用语句,统一创建CallStatement数据结构,其中将存储函数调用语句的函数名和参数列表;For function call statements in different programming language codes, a CallStatement data structure is uniformly created, which stores the function name and parameter list of the function call statement; 对于不同编程语言中的返回语句,统一创建ReturnStatement数据结构,其中将存储返回语句的表达式;For return statements in different programming languages, a unified ReturnStatement data structure is created, which will store the expression of the return statement; 对于不同编程语言中的参数声明,统一创建ArgStatement数据结构,其中将存储参数变量的名称和类型;For parameter declarations in different programming languages, a unified ArgStatement data structure is created, which will store the name and type of the parameter variable; 对于不同编程语言代码中的二元表达式,统一创建BinaryExpression数据结构,其中将存储二元表达式的左变量、右变量和使用的操作符;For binary expressions in different programming language codes, a unified BinaryExpression data structure is created, which will store the left variable, right variable and used operators of the binary expression; 对于不同编程语言代码中的一元表达式,统一创建UnaryExpression数据结构,其中将存储一元表达式的变量和操作符。For unary expressions in different programming language codes, a unified UnaryExpression data structure is created, which stores the variables and operators of the unary expressions. 4.如权利要求3所述的方法,其特征在于,第三步中构造数据流模型VFG用于表示程序中的数值传递关系。4. The method as claimed in claim 3 is characterized in that in the third step, a data flow model VFG is constructed to represent the numerical transfer relationship in the program. 5.如权利要求4所述的方法,其特征在于,第三步中构造VFG的流程为:5. The method according to claim 4, characterized in that the process of constructing the VFG in the third step is: 31、借助数据结构的类型,确定程序语句所定义的变量:DeclareStatement中定义其声明的变量,AssignStatement定义左值;ArgStatement定义参数;31. Determine the variables defined by the program statements with the help of the data structure type: DeclareStatement defines the variables it declares, AssignStatement defines the left value, and ArgStatement defines the parameters; 32、借助数据结构的类型,确定程序语句所使用的变量:DeclareStatement中使用变量初始化表达式中的变量;AssignStatement使用右值中的变量;LoopStatement使用循环迭代变量;BranchStatement中使用条件表达式中的变量;CallStatement中使用各参数变量;ReturnStatement中使用返回的表达式中的变量;32. Determine the variables used by program statements with the help of the data structure type: variables in variable initialization expressions in DeclareStatement; variables in right values in AssignStatement; variables in loop iterations in LoopStatement; variables in conditional expressions in BranchStatement; variables in each parameter in CallStatement; variables in returned expressions in ReturnStatement; 33、建立函数内部的数值传递关系,基于步骤31和32的分析结果,明确每条程序语句对于变量的定义和使用,并且根据变量名称,建立变量定义到变量使用之间的关联,即函数内的数值传递关系;33. Establish the value transfer relationship within the function. Based on the analysis results of steps 31 and 32, clarify the definition and use of variables in each program statement, and establish the association between variable definition and variable use based on the variable name, that is, the value transfer relationship within the function; 34、建立函数之间的数值传递关系,对于函数调用点CallStatement,根据函数名找到所调用的具体的函数,从而建立函数调用点中的实参列表和具体函数的形参列表的关联,即函数间的数值传递关系;34. Establish a value transfer relationship between functions. For the function call point CallStatement, find the specific function called according to the function name, so as to establish the association between the actual parameter list in the function call point and the formal parameter list of the specific function, that is, the value transfer relationship between functions; 35、在VFG中为每一条程序语句创建一个节点,然后根据步骤33和34的分析结果,在节点之间创建有向边,从变量的定义点连接到变量的使用点,用于表示数据的传递。35. Create a node for each program statement in VFG, and then create directed edges between nodes based on the analysis results of steps 33 and 34, connecting the definition point of the variable to the use point of the variable to represent the transfer of data. 6.如权利要求5所述的方法,其特征在于,第四步中,在VFG中遍历节点找到的可能的后门植入点指可能支持攻击者在外部输入数据的指令,根据程序中所调用的函数和使用的变量来进行区分;在VFG中遍历节点找到的可能的后门触发点指可能是攻击者实施具体后门行为的指令,同样根据程序中所调用的函数和使用的变量来进行区分。6. The method as claimed in claim 5 is characterized in that, in the fourth step, the possible backdoor implantation points found by traversing the nodes in the VFG refer to instructions that may support the attacker to input data externally, which are distinguished according to the functions called and the variables used in the program; the possible backdoor triggering points found by traversing the nodes in the VFG refer to instructions that may be instructions for the attacker to implement specific backdoor behaviors, which are also distinguished according to the functions called and the variables used in the program. 7.如权利要求5所述的方法,其特征在于,第五步:在VFG中,搜索植入点和触发点之间是否存在可行程序路径,即在VFG中从植入点开始搜索,若存在能够到达触发点的路径,则认为可能存在软件后门。7. The method as claimed in claim 5 is characterized in that the fifth step: in the VFG, searching whether there is a feasible program path between the implantation point and the trigger point, that is, starting the search from the implantation point in the VFG, if there is a path that can reach the trigger point, it is considered that there may be a software backdoor. 8.如权利要求5所述的方法,其特征在于,该方法在静态代码分析中应用。8. The method as claimed in claim 5 is characterized in that the method is applied in static code analysis. 9.一种用于实现如权利要求1至8中任一项所述方法的系统。9. A system for implementing the method according to any one of claims 1 to 8. 10.一种基于如权利要求1至8中任一项所述方法实现的网络安全分析方法。10. A network security analysis method implemented based on the method according to any one of claims 1 to 8.
CN202410341938.3A 2024-03-25 2024-03-25 Multi-programming language software back door detection method based on intermediate representation Pending CN118211221A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410341938.3A CN118211221A (en) 2024-03-25 2024-03-25 Multi-programming language software back door detection method based on intermediate representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410341938.3A CN118211221A (en) 2024-03-25 2024-03-25 Multi-programming language software back door detection method based on intermediate representation

Publications (1)

Publication Number Publication Date
CN118211221A true CN118211221A (en) 2024-06-18

Family

ID=91445703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410341938.3A Pending CN118211221A (en) 2024-03-25 2024-03-25 Multi-programming language software back door detection method based on intermediate representation

Country Status (1)

Country Link
CN (1) CN118211221A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5857182A (en) * 1997-01-21 1999-01-05 International Business Machines Corporation Database management system, method and program for supporting the mutation of a composite object without read/write and write/write conflicts
CN106371887A (en) * 2016-11-08 2017-02-01 西安电子科技大学 System and method for MSVL compiling
CN115906086A (en) * 2023-02-23 2023-04-04 中国人民解放军国防科技大学 Method, system and storage medium for detecting webpage backdoor based on code attribute graph
CN117556431A (en) * 2024-01-12 2024-02-13 北京北大软件工程股份有限公司 Mixed software vulnerability analysis method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5857182A (en) * 1997-01-21 1999-01-05 International Business Machines Corporation Database management system, method and program for supporting the mutation of a composite object without read/write and write/write conflicts
CN106371887A (en) * 2016-11-08 2017-02-01 西安电子科技大学 System and method for MSVL compiling
CN115906086A (en) * 2023-02-23 2023-04-04 中国人民解放军国防科技大学 Method, system and storage medium for detecting webpage backdoor based on code attribute graph
CN117556431A (en) * 2024-01-12 2024-02-13 北京北大软件工程股份有限公司 Mixed software vulnerability analysis method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
常超;刘克胜;谭龙丹;贾文超;: "基于图模型的C程序数据流分析", 浙江大学学报(工学版), no. 05, 15 May 2017 (2017-05-15) *
葛召华等: "《计算机应用技术与课程建设研究》", 31 December 2022, 西北工业出版社, pages: 93 - 94 *

Similar Documents

Publication Publication Date Title
CN110399300B (en) A Python software fuzzing method based on dynamic type awareness
Heuzeroth et al. Automatic design pattern detection
US8935677B2 (en) Automatic reverse engineering of input formats
CN109426615A (en) Null pointer dereference detection method, system, equipment and the medium of interprocedual
CN113497809B (en) MIPS framework vulnerability mining method based on control flow and data flow analysis
CN117556431B (en) Mixed software vulnerability analysis method and system
CN105446881A (en) Automatic detection method for program unaccessible paths
CN113836009A (en) A smart contract fuzzing method and system based on reinforcement learning
CN115098863A (en) A smart contract reentrancy vulnerability detection method based on static and dynamic analysis
CN116720192A (en) A vulnerability detection method based on hybrid analysis technology for MIPS architecture
CN113448870A (en) Intelligent contract reentry defect detection method based on dynamic execution information analysis
Chen et al. Efficient detection of java deserialization gadget chains via bottom-up gadget search and dataflow-aided payload construction
CN105487983A (en) Sensitive point approximation method based on intelligent route guidance
Shokri et al. Arcode: Facilitating the use of application frameworks to implement tactics and patterns
CN108875375A (en) A kind of dynamic characteristic information extracting method towards the detection of Android system privacy compromise
CN102681932B (en) Method for detecting processing correctness of software on abnormal input
CN115906092A (en) Symbolic execution method for detecting intelligent contract vulnerability across contracts
CN110879708A (en) Abstract syntax tree and theorem proving-based local sensitive program analysis method
CN101930401B (en) A software vulnerability model detection method based on detection objects
CN114462043A (en) Java deserialization vulnerability detection system and method based on reinforcement learning
CN118211221A (en) Multi-programming language software back door detection method based on intermediate representation
Han et al. An optimized static propositional function model to detect software vulnerability
CN116383070B (en) A symbolic execution method for high MC/DC
Lee A study on intermediate code generation for security weakness analysis of smart contract chaincode
Abdelaziz et al. Schooling to exploit foolish contracts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination