[go: up one dir, main page]

CN111241307A - Software project and third-party library knowledge graph construction method for software system - Google Patents

Software project and third-party library knowledge graph construction method for software system Download PDF

Info

Publication number
CN111241307A
CN111241307A CN202010077130.0A CN202010077130A CN111241307A CN 111241307 A CN111241307 A CN 111241307A CN 202010077130 A CN202010077130 A CN 202010077130A CN 111241307 A CN111241307 A CN 111241307A
Authority
CN
China
Prior art keywords
party library
software project
software
knowledge
release
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010077130.0A
Other languages
Chinese (zh)
Inventor
陈碧欢
彭鑫
赵文耘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202010077130.0A priority Critical patent/CN111241307A/en
Publication of CN111241307A publication Critical patent/CN111241307A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/36Software reuse

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Stored Programmes (AREA)

Abstract

本发明属于软件工程技术领域,具体为面向软件生态系统的软件项目及第三方库知识图谱构造方法。本发明包括:通过爬取和解析软件项目的基本信息和版本发布通知获取软件项目和软件项目发布版本的基本知识;通过代码克隆检测分析获取不同软件项目的发布版本之间的代码克隆知识;通过爬取和解析软件项目的缺陷追踪系统获取软件项目发布版本中的缺陷与缺陷修复知识,并分析缺陷与第三方库API以及代码克隆的链接知识。本发明所构造的软件项目知识图谱包括软件项目、软件项目发布版本、缺陷与代码克隆之间的关系等;本发明所构造的面向软件生态系统的软件项目及第三方库知识图谱能够支持软件项目成熟度评估、第三方库更新推荐、冲突检测等智能化应用。

Figure 202010077130

The invention belongs to the technical field of software engineering, in particular to a software project and a third-party library knowledge map construction method oriented to a software ecosystem. The invention includes: acquiring the basic knowledge of the software project and the software project release version by crawling and analyzing the basic information and version release notice of the software project; The defect tracking system that crawls and parses software projects obtains the knowledge of defects and defect repairs in the released version of the software project, and analyzes the linking knowledge of defects with third-party library APIs and code clones. The software project knowledge map constructed by the present invention includes software projects, software project release versions, the relationship between defects and code clones, etc.; the software ecosystem-oriented software project and third-party library knowledge map constructed by the present invention can support software projects Intelligent applications such as maturity assessment, third-party library update recommendation, and conflict detection.

Figure 202010077130

Description

面向软件系统的软件项目及第三方库知识图谱构造方法Software project and third-party library knowledge graph construction method for software system

技术领域technical field

本发明属于软件工程技术领域,具体涉及一种面向软件生态系统的软件项目及第三方库知识图谱构造方法。The invention belongs to the technical field of software engineering, and in particular relates to a software project and a third-party library knowledge graph construction method oriented to a software ecosystem.

背景技术Background technique

开源社区以及企业内部众多的软件系统分属相关的业务领域或分类中,相互竞争、相互依赖,同时存在大量重复代码和相似功能,构成了复杂的软件生态系统。从这种软件生态系统中选择合适的项目以代码或第三方库等方式进行复用是提高软件开发效率和软件产品质量的一种重要手段。开发人员在决定是否要复用一个软件项目时,不仅需要考虑一个软件项目的多维度知识(例如,软件项目的业务分类、功能特性、许可证、缺陷等),而且需要考虑类似软件项目之间的关系(例如,软件项目之间的功能特性差异、派生关系、代码克隆等)。此外,在开发人员已经复用了一个软件项目之后(例如,通过第三方库依赖的方式、或者通过二次开发的方式),需要根据该软件项目的版本演化进行产品代码的协同演化。由此可见,面向软件生态系统的软件项目的复用决策和复用演化都需要大量的软件项目知识。然而,这些知识往往是多源异构的,导致开发人员难以全面地、有效地获得复用决策和复用演化的知识支持。The open source community and numerous software systems within the enterprise belong to related business fields or categories, compete with each other and depend on each other, and at the same time, there are a large number of duplicate codes and similar functions, which constitute a complex software ecosystem. It is an important means to improve the efficiency of software development and the quality of software products by selecting appropriate projects from this software ecosystem and reusing them in the form of code or third-party libraries. When developers decide whether to reuse a software project, they not only need to consider the multi-dimensional knowledge of a software project (for example, the business classification, functional characteristics, licenses, defects, etc. of the software project), but also need to consider the differences between similar software projects. (for example, functional feature differences between software projects, derivation relationships, code clones, etc.). In addition, after developers have reused a software project (for example, through third-party library dependencies, or through secondary development), it is necessary to perform co-evolution of product code according to the version evolution of the software project. It can be seen that the reuse decision and reuse evolution of software projects oriented to the software ecosystem require a lot of software project knowledge. However, such knowledge is often multi-source and heterogeneous, which makes it difficult for developers to obtain knowledge support for reuse decision and reuse evolution comprehensively and effectively.

知识图谱将现实世界中的实体、以及实体之间的关联以图的形式进行表示,其中节点表示实体,而边表示实体之间的关联关系。知识图谱为知识的表示与理解提供了基础,从而支持上层的智能化应用。目前,知识图谱已经在搜索、金融、电商、医疗、安全等领域得到了广泛应用,例如,谷歌利用知识图谱提高搜索引擎的效果。The knowledge graph represents entities in the real world and the associations between entities in the form of graphs, where nodes represent entities and edges represent associations between entities. The knowledge graph provides the basis for the representation and understanding of knowledge, thus supporting the intelligent application of the upper layer. At present, knowledge graphs have been widely used in search, finance, e-commerce, medical care, security and other fields. For example, Google uses knowledge graphs to improve the effect of search engines.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种面向软件生态系统的软件项目及第三方库知识图谱构造方法,从而自动构建一个支持软件项目成熟度评估、自动分类与自动推荐、以及第三方库更新推荐、缺陷预警与冲突检测等智能化应用的软件项目及第三方库知识图谱。The purpose of the present invention is to provide a software project and third-party library knowledge graph construction method oriented to the software ecosystem, so as to automatically construct a software project maturity assessment, automatic classification and automatic recommendation, and third-party library update recommendation and defect warning Knowledge graph of software projects and third-party libraries for intelligent applications such as conflict detection.

本发明提供的面向软件生态系统的软件项目及第三方库知识图谱构造方法,包括,通过爬取和解析软件项目的基本信息和版本发布通知获取软件项目和软件项目发布版本的基本知识;在软件项目是第三方库的情况下,通过静态分析第三方库发布版本的源代码或者二进制包获取第三方库发布版本的API知识、API演化知识、以及API调用知识;在软件项目是非第三方库的情况下,通过静态分析软件项目发布版本的源代码或者二进制包获取软件项目发布版本调用第三方库API的知识;通过代码克隆检测分析获取不同软件项目的发布版本之间的代码克隆知识;通过爬取和解析软件项目的缺陷追踪系统获取软件项目发布版本中的缺陷与缺陷修复知识,并分析缺陷与第三方库API以及代码克隆的链接知识。具体步骤如下。The software project and third-party library knowledge graph construction method oriented to the software ecosystem provided by the present invention includes: acquiring the basic knowledge of the software project and the release version of the software project by crawling and parsing the basic information and version release notice of the software project; If the project is a third-party library, obtain the API knowledge, API evolution knowledge, and API calling knowledge of the third-party library release version by statically analyzing the source code or binary package of the third-party library release version; if the software project is a non-third-party library In this case, statically analyze the source code or binary package of the software project release version to obtain the knowledge of calling the third-party library API from the software project release version; obtain the code clone knowledge between different software project release versions through code clone detection and analysis; Get and analyze the defect tracking system of the software project to obtain the knowledge of defects and defect repairs in the release version of the software project, and analyze the linking knowledge of defects and third-party library APIs and code clones. Specific steps are as follows.

(1)软件项目基本知识抽取(1) Extraction of basic knowledge of software projects

软件项目的基本知识包括编程语言、业务分类、标签、以及软件项目之间的派生关系。通过爬虫爬取项目托管网站或者第三方库仓库网站上的所有软件项目;针对每一个软件项目,通过网页包装器解析结构化的项目托管网站或者第三方库仓库网站页面,从中抽取软件项目的编程语言、标签、以及软件项目之间的派生关系,并通过主题模型从软件项目的基本描述中提取业务分类。Basic knowledge of software projects includes programming languages, business classifications, labels, and derivation relationships between software projects. Crawling all software projects on the project hosting website or third-party library warehouse website through crawlers; for each software project, the structured project hosting website or third-party library warehouse website page is parsed through the webpage wrapper, and the programming of the software project is extracted from it. Languages, tags, and derivation relationships between software items, and business classifications are extracted from the basic descriptions of software items through topic models.

(2)软件项目发布版本基本知识抽取(2) Basic knowledge extraction of software project release version

软件项目发布版本的基本知识包括新增功能特性、许可证、仓库来源、开发人员、以及发布版本之间的前后版本关系与第三方库依赖关系。其中,仓库来源只适用于软件项目是第三方库的情况,开发人员通过仓库来源下载和使用第三方库发布版本。这一步通过分析软件项目版本发布通知、代码提交历史、第三方库仓库网站页面、以及第三方库依赖声明文件获得软件项目发布版本的基本知识;具体包括以下步骤:The basics of software project releases include new features, licenses, repository sources, developers, and pre- and post-version relationships and third-party library dependencies between releases. Among them, the repository source is only applicable when the software project is a third-party library, and developers download and use the third-party library to release the version through the repository source. This step obtains the basic knowledge of the software project release version by analyzing the software project version release notice, code submission history, the third-party library repository website page, and the third-party library dependency declaration file; it includes the following steps:

1)遍历软件项目的所有发布版本,建立发布版本之间的前后版本关系;通过关键词匹配找到软件项目版本发布通知中的新增功能特性描述,并通过主题模型,提取软件项目发布版本中的新增功能特性;1) Traverse all released versions of the software project, and establish the relationship between the previous and previous versions; find the description of the new features in the software project version release notice through keyword matching, and extract the software project release version through the topic model. Added features;

2)遍历软件项目的所有发布版本,根据发布时间确定在软件项目发布版本发布之前、在上一个软件项目发布版本发布之后的代码提交历史片段,并分析其中的每一次代码提交的开发人员以及是否修改了许可证声明,从而确定软件项目发布版本的开发人员与许可证信息;此外,融合现有的许可证之间的冲突关系;2) Traverse all released versions of the software project, determine the code submission history segments before the release of the software project release version and after the release of the previous software project release version according to the release time, and analyze the developer of each code submission and whether Modified the license statement to identify the developer and license information of the released version of the software project; in addition, to incorporate existing license conflicts;

3)如果软件项目是第三方库,遍历第三方库仓库网站的所有第三方库发布版本,通过网页包装器解析结构化的第三方库仓库网站页面,抽取第三方库发布版本的仓库来源信息;3) If the software project is a third-party library, traverse all the third-party library release versions of the third-party library warehouse website, parse the structured third-party library warehouse website page through the webpage wrapper, and extract the warehouse source information of the third-party library release version;

4)分析软件项目发布版本中的第三方库依赖声明文件,抽取软件项目发布版本所依赖的第三方库发布版本。4) Analyze the third-party library dependency declaration file in the software project release version, and extract the third-party library release version that the software project release version depends on.

(3)第三方库API知识抽取(3) Third-party library API knowledge extraction

软件项目往往调用了第三方库API;而第三方库API随着第三方库发布版本而发生演化。这一步通过静态分析第三方库发布版本与软件项目发布版本的源代码或者二进制包获取第三方库API相关的知识;具体包括以下步骤:Software projects often call third-party library APIs; and third-party library APIs evolve with the release of third-party libraries. In this step, the knowledge related to the API of the third-party library is obtained by statically analyzing the source code or binary package of the third-party library release version and the software project release version; it specifically includes the following steps:

1)在软件项目是第三方库的情况下,通过静态分析方法分析第三方库发布版本的源代码或者二进制包,获取第三方库发布版本中所提供的第三方库API知识;1) When the software project is a third-party library, analyze the source code or binary package of the third-party library release version through static analysis methods, and obtain the third-party library API knowledge provided in the third-party library release version;

2)在软件项目是第三方库的情况下,通过代码差异分析方法分析相邻两个第三方库发布版本的第三方库API,确定第三方库发布版本中初次引入与弃用的第三方库API、取代弃用第三方库API的第三方库API、以及发生变化的第三方库API的前后版本关系,这些知识刻画了第三方库API的演化知识;2) When the software project is a third-party library, analyze the third-party library APIs of two adjacent third-party library releases through the code difference analysis method, and determine the first-introduced and deprecated third-party libraries in the third-party library release version API, the third-party library API that replaces the deprecated third-party library API, and the previous version relationship of the third-party library API that has changed, these knowledge describe the evolution knowledge of the third-party library API;

3)在软件项目是第三方库的情况下,通过程序调用图分析方法建立第三方库发布版本中第三方库API的调用图,抽取第三方库API之间的调用知识;3) When the software project is a third-party library, the call graph of the third-party library API in the release version of the third-party library is established by the program call graph analysis method, and the calling knowledge between the third-party library APIs is extracted;

4)在软件项目不是第三方库的情况下,通过静态分析软件项目发布版本的源代码或者二进制包获取软件项目发布版本调用第三方库API的知识。4) In the case that the software project is not a third-party library, obtain the knowledge of calling the third-party library API from the software project release version by statically analyzing the source code or binary package of the software project release version.

(4)软件项目发布版本代码克隆知识抽取(4) Knowledge extraction of software project release version code cloning

软件项目之间可能存在代码克隆的关系;通过代码克隆检测分析方法获取不同软件项目的发布版本之间的代码克隆知识、以及同一软件项目的不同发布版本之间的代码克隆演化关系。There may be a code clone relationship between software projects; the code clone detection and analysis method is used to obtain the code clone knowledge between the release versions of different software projects, and the code clone evolution relationship between different release versions of the same software project.

(5)软件项目发布版本缺陷知识抽取(5) Software project release version defect knowledge extraction

软件项目发布版本会修复缺陷,同时也会引入新的缺陷。这一步通过分析软件项目缺陷追踪系统中的缺陷获取软件项目发布版本中的缺陷与缺陷修复知识,并分析缺陷与第三方库API以及代码克隆的链接知识,具体包括以下步骤:Software project releases fix bugs and introduce new bugs. This step obtains the defect and defect repair knowledge in the software project release version by analyzing the defects in the software project defect tracking system, and analyzes the linking knowledge between defects and third-party library APIs and code clones, including the following steps:

1)通过爬取和解析软件项目缺陷追踪系统中缺陷的结构化数据,获取每个缺陷所影响的软件项目发布版本;1) Obtain the software project release version affected by each defect by crawling and parsing the structured data of defects in the software project defect tracking system;

2)在软件项目是第三方库的情况下,遍历软件项目的代码提交历史,找到包含缺陷的ID标识符以及“fix”关键字的代码提交,分析这次代码提交中的发生变化的第三方库API,建立缺陷影响第三方库API的关系;2) If the software project is a third-party library, traverse the code submission history of the software project, find the code submission containing the ID identifier of the defect and the "fix" keyword, and analyze the third party that has changed in this code submission. Library API, establish the relationship that the defect affects the third-party library API;

3)遍历软件项目的代码提交历史,找到包含缺陷的ID标识符以及“fix”关键字的代码提交,分析这次代码提交中的发生变化的代码片段,并基于步骤(4)中的代码克隆知识建立代码克隆包含缺陷的关系。3) Traverse the code submission history of the software project, find the code submission containing the ID identifier of the defect and the "fix" keyword, analyze the changed code fragments in this code submission, and clone based on the code in step (4) Knowledge establishes the relationship that code clones contain defects.

本发明所构建的面向软件生态系统的软件项目及第三方库知识图谱,其高层结构如图1所示,包括软件项目、编程语言、业务分类、标签、软件项目发布版本、仓库来源、功能特性、许可证、开发人员、第三方库API、缺陷与代码克隆以及这些实体之间的关系。本发明基于软件项目托管网站、第三方库仓库网站、以及缺陷追踪系统的内容爬取与分析、以及软件项目代码提交分析与API分析自动化地构造面向软件生态系统的软件项目及第三方库知识图谱,从而支持软件项目成熟度评估、自动分类与自动推荐、以及第三方库更新推荐、缺陷预警与冲突检测等智能化应用。The software project and third-party library knowledge map oriented to the software ecosystem constructed by the present invention, its high-level structure is shown in Figure 1, including software project, programming language, business classification, label, software project release version, warehouse source, functional characteristics , licenses, developers, third-party library APIs, bugs and code clones, and the relationships between these entities. The present invention automatically constructs a software project and third-party library knowledge map oriented to the software ecosystem based on the software project hosting website, the third-party library warehouse website, and the content crawling and analysis of the defect tracking system, as well as the software project code submission analysis and API analysis. , so as to support software project maturity assessment, automatic classification and automatic recommendation, as well as third-party library update recommendation, defect warning and conflict detection and other intelligent applications.

附图说明Description of drawings

图1为本发明所构建的面向软件生态系统的软件项目及第三方库知识图谱的高层结构。FIG. 1 is a high-level structure of a software project and a knowledge graph of a third-party library oriented to a software ecosystem constructed by the present invention.

具体实施方式Detailed ways

以下针对GitHub Java开源项目以及Maven第三方库说明本发明的具体实施方式,其主要使用过程为:The specific implementation of the present invention is described below with respect to the GitHub Java open source project and the Maven third-party library, and its main use process is as follows:

(1)软件项目基本知识抽取。通过GitHub API抽取GitHub上面的Java开源项目列表,使用Python库Scrapy自动爬取每个Java开源项目的网页,然后利用Python库Beautiful Soup解析网页内容来抽取软件项目的基本知识;使用Python库Scrapy自动爬取Maven第三方库列表、以及每个第三方库的网页,然后利用Python库Beautiful Soup解析网页内容来抽取第三方库软件项目的基本知识;通过LDA主题模型抽取软件项目的业务分类;(1) Basic knowledge extraction of software projects. Extract the list of Java open source projects on GitHub through the GitHub API, use the Python library Scrapy to automatically crawl the webpage of each Java open source project, and then use the Python library Beautiful Soup to parse the content of the webpage to extract the basic knowledge of the software project; use the Python library Scrapy to automatically crawl Take the Maven third-party library list and the webpage of each third-party library, and then use the Python library Beautiful Soup to parse the webpage content to extract the basic knowledge of the third-party library software project; extract the business classification of the software project through the LDA topic model;

(2)软件项目发布版本基本知识抽取。通过关键词(“Features”与“Improvements”)匹配找到软件项目版本发布通知中的新增功能特性描述,并使用LDA主题模型提取软件项目发布版本中的新增功能特性;使用差异分析确定软件项目发布版本是否修改了许可证,并使用许可证识别工具Ninka确定软件项目发布版本修改后的许可证;利用Python库Beautiful Soup解析网页内容来抽取第三方库软件项目的仓库来源信息;(2) Basic knowledge extraction of software project release version. Find the description of the new function features in the software project version release notice by matching keywords ("Features" and "Improvements"), and use the LDA topic model to extract the new function features in the software project release version; use difference analysis to determine the software project Whether the license has been modified in the release version, and use the license identification tool Ninka to determine the modified license of the software project release version; use the Python library Beautiful Soup to parse the content of the web page to extract the warehouse source information of the third-party library software project;

(3)第三方库API知识抽取。通过静态分析工具Soot抽取第三方库发布版本的第三方库API知识;通过基于哈希值的代码差异分析抽取第三方库API演化知识;通过静态分析工具Soot获得第三方库API的调用图并建立第三方库API的调用关系;通过Java Parser遍历软件项目版本版本源代码的抽象语法树并获得软件项目发布版本所调用的第三方库API;(3) Third-party library API knowledge extraction. The third-party library API knowledge of the release version of the third-party library is extracted by the static analysis tool Soot; the third-party library API evolution knowledge is extracted by the code difference analysis based on the hash value; the call graph of the third-party library API is obtained by the static analysis tool Soot and established The calling relationship of the third-party library API; traverse the abstract syntax tree of the source code of the software project version version through Java Parser and obtain the third-party library API called by the software project release version;

(4)软件项目发布版本代码克隆知识抽取。使用克隆检测分析工具SAGA抽取不同软件项目的发布版本之间的代码克隆与同一软件项目的不同发布版本之间的代码克隆演化关系;(4) Knowledge extraction of software project release version code cloning. Use the clone detection and analysis tool SAGA to extract the code clones between release versions of different software projects and the code clone evolution relationship between different release versions of the same software project;

(5)软件项目发布版本缺陷知识抽取。使用Python库Scrapy自动爬取软件项目在Jira缺陷追踪系统中的每个缺陷的网页,然后利用Python库Beautiful Soup解析网页内容来抽取缺陷知识;遍历软件项目的代码提交历史,找到包含缺陷的ID标识符以及“fix”关键字的代码提交,分析代码提交中的发生变化的第三方库API以及代码片段,建立缺陷影响第三方库API的关系、以及代码克隆包含缺陷的关系。(5) Knowledge extraction of software project release version defect knowledge. Use the Python library Scrapy to automatically crawl the web page of each defect of the software project in the Jira defect tracking system, and then use the Python library Beautiful Soup to parse the web page content to extract defect knowledge; traverse the code submission history of the software project to find the ID that contains the defect. Code submissions with symbols and the "fix" keyword, analyze the changed third-party library APIs and code fragments in the code submissions, establish the relationship between defects affecting the third-party library API, and the relationship between code clones containing defects.

通过该过程所构建的面向软件生态系统的软件项目及第三方库知识图谱包含了项目知识、第三方库知识、缺陷知识、代码克隆知识、以及不同知识之间的链接关系,为软件项目提供了丰富的语义知识。基于这种软件项目及第三方库知识图谱可以实现软件项目成熟度评估、自动分类与自动推荐、以及第三方库更新推荐、缺陷预警与冲突检测等智能化应用。The knowledge graph of software projects and third-party libraries oriented to the software ecosystem constructed through this process includes project knowledge, third-party library knowledge, defect knowledge, code cloning knowledge, and the link relationship between different knowledge, providing software projects with Rich semantic knowledge. Based on this knowledge graph of software projects and third-party libraries, intelligent applications such as software project maturity assessment, automatic classification and automatic recommendation, third-party library update recommendation, defect warning and conflict detection can be realized.

Claims (4)

1. A software project and third-party library knowledge graph construction method oriented to a software ecosystem is characterized by comprising the following specific steps:
(1) software project basic knowledge extraction
The basic knowledge of the software project comprises a programming language, a business classification, a label and a derivative relation among the software projects; crawling all software projects on a project hosting website or a third-party library warehouse website through a crawler; analyzing a structured project hosting website or a third-party library website page by a webpage wrapper aiming at each software project, extracting the programming language, the labels and the derivative relation among the software projects of the software project from the page, and extracting the business classification from the basic description of the software project by a theme model;
(2) software project release version basic knowledge extraction
The basic knowledge of the software project release version comprises newly-added functional characteristics, licenses, warehouse sources, developers, and the front-back version relationship and the third-party library dependency relationship among the release versions; the warehouse source is suitable for the condition that the software project is a third-party library, and developers download and use the third-party library to release versions through the warehouse source; obtaining basic knowledge of the software project release version by analyzing the software project version release notice, the code submission history, the third-party library website page and the third-party library dependency declaration file;
(3) third party library API knowledge extraction
Software projects often call third party library APIs; the third-party library API evolves along with the release version of the third-party library; acquiring knowledge related to the API of the third-party library by statically analyzing source codes or binary packages of the release version of the third-party library and the release version of the software project;
(4) software project release version code clone knowledge extraction
There may be a relationship of code cloning between software projects; acquiring code cloning knowledge among release versions of different software projects and code cloning evolution relations among different release versions of the same software project by a code cloning detection analysis method;
(5) software project release version defect knowledge extraction
The software project release version can repair the defects and introduce new defects at the same time; and acquiring the defects and defect repair knowledge in the software project release version by analyzing the defects in the software project defect tracking system, and analyzing the link knowledge of the defects, the third-party library API and code clone.
2. The software ecosystem-oriented software project and third party library knowledge graph construction method according to claim 1, wherein the extraction of the basic knowledge of the software project release version in the step (2) specifically comprises the following substeps:
1) traversing all release versions of the software project, and establishing a front-back version relationship between the release versions; finding the newly added functional characteristic description in the software project version release notice through keyword matching, and extracting the newly added functional characteristic in the software project release version through a topic model;
2) traversing all release versions of the software project, determining code submission history segments before the release of the release version of the software project and after the release of the release version of the previous software project according to the release time, and analyzing developers submitted with each code and whether a license statement is modified, thereby determining the developers and the license information of the release version of the software project; in addition, the conflict relationship between the existing licenses is fused;
3) if the software project is a third-party library, traversing all third-party library release versions of a third-party library warehouse website, analyzing structured third-party library warehouse website pages through a webpage wrapper, and extracting warehouse source information of the third-party library release versions;
4) and analyzing the third-party library dependency declaration file in the software project release version, and extracting the third-party library release version on which the software project release version depends.
3. The software ecosystem-oriented software project and third party library knowledge graph construction method according to claim 1, wherein the third party library API knowledge extraction in the step (3) specifically comprises the following sub-steps:
1) under the condition that the software project is a third-party library, analyzing a source code or a binary package of a release version of the third-party library by a static analysis method to acquire third-party library API knowledge provided in the release version of the third-party library;
2) under the condition that the software project is a third-party library, analyzing third-party library APIs of two adjacent third-party library release versions through a code difference analysis method, determining the relationship between the third-party library API introduced and abandoned for the first time in the third-party library release version, the third-party library API replacing the third-party library API abandoned, and the front and back versions of the changed third-party library API, and describing the evolution knowledge of the third-party library API;
3) under the condition that the software project is a third-party library, establishing a call graph of the third-party library API in the release version of the third-party library by using a program call graph analysis method, and extracting call knowledge among the third-party library API;
4) and under the condition that the software project is not the third-party library, acquiring the knowledge that the software project release version calls the third-party library API by statically analyzing the source code or the binary package of the software project release version.
4. The software ecosystem-oriented software project and third party library knowledge graph construction method according to claim 1, wherein the software project release version defect knowledge extraction in the step (5) specifically comprises the following substeps:
1) acquiring a software project release version influenced by each defect by crawling and analyzing the structured data of the defects in the software project defect tracking system;
2) traversing the code submission history of the software project under the condition that the software project is a third-party library, finding out code submission containing the ID identifier of the defect and the 'fix' keyword, analyzing the changed third-party library API in the code submission, and establishing the relation that the defect influences the third-party library API;
3) traversing the code submission history of the software project, finding the code submission containing the ID identifier of the defect and the 'fix' key, analyzing the changed code segment in the code submission, and establishing the relation that the code clone contains the defect based on the code clone knowledge in the step (4).
CN202010077130.0A 2020-01-23 2020-01-23 Software project and third-party library knowledge graph construction method for software system Pending CN111241307A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010077130.0A CN111241307A (en) 2020-01-23 2020-01-23 Software project and third-party library knowledge graph construction method for software system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010077130.0A CN111241307A (en) 2020-01-23 2020-01-23 Software project and third-party library knowledge graph construction method for software system

Publications (1)

Publication Number Publication Date
CN111241307A true CN111241307A (en) 2020-06-05

Family

ID=70872958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010077130.0A Pending CN111241307A (en) 2020-01-23 2020-01-23 Software project and third-party library knowledge graph construction method for software system

Country Status (1)

Country Link
CN (1) CN111241307A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881300A (en) * 2020-07-03 2020-11-03 扬州大学 Third-party library dependency-oriented knowledge graph construction method and system
CN111949800A (en) * 2020-07-06 2020-11-17 北京大学 A method and system for establishing a knowledge graph of an open source project
CN111949307A (en) * 2020-07-06 2020-11-17 北京大学 An optimization method and system for open source project knowledge graph
CN112084309A (en) * 2020-09-17 2020-12-15 北京中科微澜科技有限公司 License selection method and system based on open source software map
CN112416367A (en) * 2020-11-19 2021-02-26 云南电网有限责任公司信息中心 Application resource change influence analysis system based on software reverse disassembly and analysis
CN113011461A (en) * 2021-02-19 2021-06-22 中国科学院软件研究所 Software demand tracking link recovery method and electronic device based on classification enhanced through knowledge learning
CN113139192A (en) * 2021-04-09 2021-07-20 扬州大学 Third-party library security risk analysis method and system based on knowledge graph
CN113342331A (en) * 2021-05-21 2021-09-03 武汉大学 Evolution analysis method of ecology-oriented software service system
CN113986340A (en) * 2021-10-11 2022-01-28 复旦大学 A specific project software code knowledge management platform and its construction method
CN114138330A (en) * 2021-12-07 2022-03-04 中国人民解放军国防科技大学 Code clone detection optimization method and device based on knowledge graph and electronic equipment
CN114461484A (en) * 2021-12-20 2022-05-10 奇安盘古(上海)信息技术有限公司 Method, apparatus, device, medium and program for determining association of application

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8635204B1 (en) * 2010-07-30 2014-01-21 Accenture Global Services Limited Mining application repositories
CN106462703A (en) * 2014-05-22 2017-02-22 软件营地株式会社 System and method for analyzing patch file
CN107608732A (en) * 2017-09-13 2018-01-19 扬州大学 A kind of bug search localization methods based on bug knowledge mappings
CN108121829A (en) * 2018-01-12 2018-06-05 扬州大学 The domain knowledge collection of illustrative plates automated construction method of software-oriented defect
CN108196880A (en) * 2017-12-11 2018-06-22 北京大学 Software project knowledge mapping method for automatically constructing and system
CN108959433A (en) * 2018-06-11 2018-12-07 北京大学 A kind of method and system extracting knowledge mapping and question and answer from software project data
CN109739994A (en) * 2018-12-14 2019-05-10 复旦大学 A construction method of API knowledge graph based on reference documents
CN110134613A (en) * 2019-05-22 2019-08-16 北京航空航天大学 A Software Defect Data Acquisition System Based on Code Semantics and Background Information

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8635204B1 (en) * 2010-07-30 2014-01-21 Accenture Global Services Limited Mining application repositories
CN106462703A (en) * 2014-05-22 2017-02-22 软件营地株式会社 System and method for analyzing patch file
CN107608732A (en) * 2017-09-13 2018-01-19 扬州大学 A kind of bug search localization methods based on bug knowledge mappings
CN108196880A (en) * 2017-12-11 2018-06-22 北京大学 Software project knowledge mapping method for automatically constructing and system
CN108121829A (en) * 2018-01-12 2018-06-05 扬州大学 The domain knowledge collection of illustrative plates automated construction method of software-oriented defect
CN108959433A (en) * 2018-06-11 2018-12-07 北京大学 A kind of method and system extracting knowledge mapping and question and answer from software project data
CN109739994A (en) * 2018-12-14 2019-05-10 复旦大学 A construction method of API knowledge graph based on reference documents
CN110134613A (en) * 2019-05-22 2019-08-16 北京航空航天大学 A Software Defect Data Acquisition System Based on Code Semantics and Background Information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIN Z: "Improving software text retrieval using conceptual knowledge in source code" *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881300A (en) * 2020-07-03 2020-11-03 扬州大学 Third-party library dependency-oriented knowledge graph construction method and system
CN111949800A (en) * 2020-07-06 2020-11-17 北京大学 A method and system for establishing a knowledge graph of an open source project
CN111949307A (en) * 2020-07-06 2020-11-17 北京大学 An optimization method and system for open source project knowledge graph
CN111949307B (en) * 2020-07-06 2021-06-25 北京大学 Optimization method and system of open source project knowledge graph
CN112084309A (en) * 2020-09-17 2020-12-15 北京中科微澜科技有限公司 License selection method and system based on open source software map
CN112084309B (en) * 2020-09-17 2024-06-04 北京中科微澜科技有限公司 License selection method and system based on open source software map
CN112416367A (en) * 2020-11-19 2021-02-26 云南电网有限责任公司信息中心 Application resource change influence analysis system based on software reverse disassembly and analysis
CN113011461B (en) * 2021-02-19 2022-08-05 中国科学院软件研究所 Software demand tracking link recovery method and electronic device based on classification and enhanced through knowledge learning
CN113011461A (en) * 2021-02-19 2021-06-22 中国科学院软件研究所 Software demand tracking link recovery method and electronic device based on classification enhanced through knowledge learning
CN113139192A (en) * 2021-04-09 2021-07-20 扬州大学 Third-party library security risk analysis method and system based on knowledge graph
CN113139192B (en) * 2021-04-09 2024-04-19 扬州大学 Third party library security risk analysis method and system based on knowledge graph
CN113342331A (en) * 2021-05-21 2021-09-03 武汉大学 Evolution analysis method of ecology-oriented software service system
CN113342331B (en) * 2021-05-21 2023-10-03 武汉大学 An ecologically oriented software service system evolution analysis method
CN113986340A (en) * 2021-10-11 2022-01-28 复旦大学 A specific project software code knowledge management platform and its construction method
CN114138330A (en) * 2021-12-07 2022-03-04 中国人民解放军国防科技大学 Code clone detection optimization method and device based on knowledge graph and electronic equipment
CN114138330B (en) * 2021-12-07 2024-07-26 中国人民解放军国防科技大学 Knowledge graph-based code clone detection optimization method and device and electronic equipment
CN114461484A (en) * 2021-12-20 2022-05-10 奇安盘古(上海)信息技术有限公司 Method, apparatus, device, medium and program for determining association of application

Similar Documents

Publication Publication Date Title
CN111241307A (en) Software project and third-party library knowledge graph construction method for software system
US11714611B2 (en) Library suggestion engine
Luan et al. Aroma: Code recommendation via structural code search
CN106843840B (en) Source code version evolution annotation multiplexing method based on similarity analysis
Bernardi et al. Design pattern detection using a DSL‐driven graph matching approach
CN113139192B (en) Third party library security risk analysis method and system based on knowledge graph
CN106874764A (en) A kind of method that Android application readjustment sequences are automatically generated based on call back function modeling
Nam et al. Marble: Mining for boilerplate code to identify API usability problems
JP2020126641A (en) API mashup exploration and recommendations
Lee et al. Automatic detection and update suggestion for outdated API names in documentation
Ma et al. Impact analysis of cross-project bugs on software ecosystems
CN117195233A (en) Bill of materials SBOM+ analysis method and device for open source software supply chain
Sudhamani et al. Code similarity detection through control statement and program features
CN117725592A (en) A smart contract vulnerability detection method based on directed graph attention network
US12197586B2 (en) Systems and processes for facilitating edits to software bill of materials
Van Hattem Mastering Python
Jansen et al. Searchseco: A worldwide index of the open source software ecosystem
Tinnes et al. Learning domain-specific edit operations from model repositories with frequent subgraph mining
CN117389518A (en) Fine-grained software supply chain construction method for Python open source ecology
CN113126998A (en) Incremental source code acquisition method and device, electronic equipment and storage medium
US20240241821A1 (en) Second party software components discovery
US20060048094A1 (en) Systems and methods for decoupling inputs and outputs in a workflow process
US20230367881A1 (en) Systems and processes for creating software bill of materials for large distributed builds
CN113032779B (en) Multi-behavior joint matching method and device based on behavior parameter Boolean expression rule
CN116401145A (en) Method and device for static analysis and processing of source code

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200605