CN114186974B

CN114186974B - A multi-model fusion development task association method, device, equipment and medium

Info

Publication number: CN114186974B
Application number: CN202111542359.8A
Authority: CN
Inventors: 张洋; 蔡孟栾; 王涛; 王怀民; 吴逸文; 陈婷婷; 邬小军
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2024-12-06
Anticipated expiration: 2041-12-13
Also published as: CN114186974A

Abstract

The present invention discloses a development task association method of multi-model fusion, which constructs an active open source project set in a collaborative development community according to preset indicators; in the active open source project set, all project development task report data are collected by using an API to construct an alternative task report database; in the alternative task report database, URL link information in all task reports is extracted by using a regular expression to generate a task report data set; a query task data group and a candidate task data group are constructed in the task report data set to obtain a similarity score between a query task and each candidate task; similarity scores between the query task and each candidate task are weighted and summed to obtain a final similarity score between each task report, a development task association model based on multi-model fusion is constructed according to the final similarity score, and a task report association tool is generated.

Description

Development task association method, device, equipment and medium for multi-model fusion

Technical Field

The application relates to the field of software development, in particular to a multi-model fusion development task association method, device, equipment and medium.

Background

Social programming (socialization) was first proposed by the open community GitHub, and aims to provide a developer-friendly software development environment, which helps developers to efficiently interconnect, collaborate and develop. The presence of social programming greatly enhances code multiplexing and development task resolution efficiency. The developer can participate in reporting and discussing tasks autonomously, so that task reports are often reported by different developers at different times as an important class of software development knowledge. In practice, it is often the case that two task reports contain relevant information, and the developer can link the relevant task reports together through URL links during the task discussion. In one software project, finding and correlating related task reports can provide more resources and information for developers to solve target tasks, thereby improving task solution efficiency.

Currently in collaborative development communities like Github, the approach of correlating task reports relies primarily on manual links by the developer. However, the real world linking process requires a lot of time and labor. Especially for those large-scale software projects, developers may need to find large amounts of historical task data to locate relevant tasks through their textual description information, and such manually-based association methods rely primarily on the experience and knowledge of individual developers. Therefore, how to implement an automated development task is a technical problem to be solved.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a multi-model fusion development task association method, device, equipment and medium, and aims to solve the technical problem that the prior art cannot realize automatic development task association.

In order to achieve the above object, the present invention provides a development task association method for multi-model fusion, the method comprising:

Constructing an active open source project set in the collaborative development community according to a preset index;

Collecting development task report data of all projects by using an API in the active open source project set to construct an alternative task report database;

Extracting URL link information in all task reports by using regular expressions in the alternative task report database to generate a task report data set;

Constructing a query task data set and a candidate task data set in the task report data set, and respectively utilizing a structural data analysis model, a text semantic representation model and a historical relevance model to obtain a similarity score between a calculation query task and each candidate task;

and carrying out weighted summation on the similarity scores between the query task and each candidate task, obtaining a final similarity score between each task report, and constructing a development task association model based on multi-model fusion according to the final similarity score to generate a task report association tool.

Optionally, the step of constructing the active open source project set in the collaborative development community according to the preset index includes:

In the collaborative development community Github, basic information data of the project is collected by using an API, and the flow opening source project is screened according to Star, fork, delete and the Creation time index;

and constructing an active open source item set from the screened popular open source items.

Optionally, the step of collecting development task report data of all projects by using an API in the active open source project set to construct an alternative task report database includes:

Collecting task report data of all projects in the active open source project set by utilizing an Issu API and a Pull Request (PR) API of the Github, wherein the specific data collection content comprises task ID, task processing state, submitter, task title, task description, task comment, submission time, category, label, milestone and the like;

And constructing an alternative task report database according to the collected report data.

Optionally, the step of extracting URL link information in all task reports by using regular expressions in the alternative task report database to generate a task report data set includes:

extracting URL link information in all task reports by using a regular expression in the alternative task report database;

Checking URL link information in the task report by utilizing the Cross-REFERENCED API of Github, screening out actual task report association connection, and constructing an association information reference library according to the task report association connection;

and removing the task report data which does not contain the link information from the associated information reference library to form a final task report data set.

Optionally, the step of constructing a query task data set and a candidate task data set in the task report data set, and obtaining the similarity score between the query task and each candidate task by using the structural data analysis model, the text semantic representation model and the historical relevance model, respectively, further includes:

Extracting text data in each task report data in the task report data set, wherein the text data comprises a task report title, a description and a comment;

deleting stop words, numbers, punctuation marks and other non-alphabetic characters in the text data;

the remaining words are converted to root form using the Snowball Stemmer technique in NLTK to reduce feature dimensions and unify similar words into a common representation to obtain pre-processed task report data.

Optionally, the step of constructing a query task data set and a candidate task data set in the task report data set, and obtaining a similarity score between the computing query task and each candidate task by using a structural data analysis model, a text semantic representation model and a task association network model, respectively, includes:

Selecting the latest 40% sample as a query task data set according to the creation time of a task report in the task report data set, and taking the task report data as a candidate task data set;

Calculating a structural information (Structural information) similarity Score _S between the query task and each candidate task using a structural data parsing model;

Calculating a text information (Textual information) similarity Score _T between the query task and each candidate task using a text semantic representation model;

Historical information (Historical information) similarity Score _H between the query task and each candidate task is calculated using a historical relevance model.

Optionally, the step of weighting and summing the similarity scores between the query task and each candidate task and obtaining a final similarity score between each task report, and constructing a development task association model based on multi-model fusion according to the similarity scores to generate a task report association tool includes:

Weighting and summing the similarity scores between the query task and each candidate task to obtain a final similarity score, and constructing a development task association model based on multi-model fusion according to the final similarity score;

evaluating the model by using the Top-k recall rate evaluation index and the task report data set;

and selecting an optimal sub-model weight combination according to the evaluation result to form a task report association tool.

In addition, in order to achieve the above object, the present invention also proposes a development task association apparatus for multimodal fusion, the apparatus comprising:

The project construction module is used for constructing an active open source project set in the collaborative development community according to preset indexes;

The data construction module is used for collecting development task report data of all projects in the active open source project set by using an API so as to construct an alternative task report database;

the link acquisition module is used for extracting URL link information in all task reports by using a regular expression in the alternative task report database so as to generate a task report data set;

The task calculation module is used for constructing a query task data set and a candidate task data set in the task report data set, and respectively utilizing a structural data analysis model, a text semantic representation model and a historical relevance model to obtain a similarity score between a calculation query task and each candidate task;

and the tool generation module is used for carrying out weighted summation on the similarity scores between the query task and each candidate task, obtaining a final similarity score between each task report, and constructing a development task association model based on multi-model fusion according to the final similarity score so as to generate a task report association tool.

In addition, in order to achieve the aim, the invention also provides a computer device, which comprises a memory, a processor and a multi-model fusion development task association program which is stored on the memory and can run on the processor, wherein the multi-model fusion development task association program is configured to realize the multi-model fusion development task association method.

In addition, in order to achieve the above object, the present invention also proposes a medium on which a multimodal fusion development task association program is stored, which when executed by a processor, implements the steps of the multimodal fusion development task association method as described above.

The method comprises the steps of constructing an active open source project set in a collaborative development community according to preset indexes, collecting development task report data of all projects in the active open source project set by using an API to construct an alternative task report database, extracting URL link information in all task reports in the alternative task report database by using a regular expression to generate a task report data set, constructing a query task data set and a candidate task data set in the task report data set, respectively using a structural data analysis model, a text semantic representation model and a historical relevance model to obtain similarity scores between a calculation query task and each candidate task, carrying out weighted summation on the similarity scores between the query task and each candidate task to obtain a final similarity score between each task report, constructing a development task relevance model based on multi-model fusion according to the final similarity scores, generating a task report relevance tool, realizing task report related to new task recommendation by combining with a deep learning technology, and realizing automatic development task relevance by carrying out weighted summation on the similarity scores to screen out optimal weight and final construction of the task report relevance tool.

Drawings

FIG. 1 is a schematic diagram of a multi-model converged development task association device of a hardware runtime environment in accordance with an embodiment of the present invention;

FIG. 2 is a flowchart of a development task association method for multi-model fusion according to a first embodiment of the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic diagram of a development task association device structure of multi-model fusion of a hardware running environment according to an embodiment of the present invention.

As shown in FIG. 1, the multimodal fusion development task association apparatus may include a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the structure shown in FIG. 1 does not constitute a limitation of a multi-model fusion development task association apparatus, and may include more or fewer components than illustrated, or may combine certain components, or may be a different arrangement of components.

As shown in fig. 1, an operating system, a data storage module, a network communication module, a user interface module, and a multimodal fusion development task association program may be included in the memory 1005 as one type of storage medium.

In the multi-model fusion development task association device shown in fig. 1, the network interface 1004 is mainly used for carrying out data communication with a network server, the user interface 1003 is mainly used for carrying out data interaction with a user, and the processor 1001 and the memory 1005 in the multi-model fusion development task association device can be arranged in the multi-model fusion development task association device, and the multi-model fusion development task association device invokes a multi-model fusion development task association program stored in the memory 1005 through the processor 1001 and executes the multi-model fusion development task association method provided by the embodiment of the invention.

The embodiment of the invention provides a multi-model fusion development task association method, and referring to fig. 2, fig. 2 is a flow diagram of a first embodiment of the multi-model fusion development task association method.

In this embodiment, the development task association method for multi-model fusion includes the following steps:

And step S10, constructing an active open source project set in the collaborative development community according to a preset index.

In specific implementation, the API of Github is utilized to collect popular open source projects which at least contain 10 Stars, are at least 1 time by the Fork, are not deleted and are not the Fork projects, are created after 2010 and before 2021, and meanwhile, active open source projects which at least contain 100 Issues or Pull requests, at least contain 3 code contributors, and have development activities such as code submission, development task processing, contribution merging, comment submission and the like in the last 3 months are screened.

Further, the step of constructing the active open source project set in the collaborative development community according to the preset index comprises the steps of utilizing an API to collect basic information data of projects in the collaborative development community Github, screening out the flow open source projects according to Star, fork, delete and the Creation time index, and constructing the screened out popular open source projects into the active open source project set.

And S20, collecting development task report data of all projects in the active open source project set by using an API to construct an alternative task report database.

In specific implementation, based on an active open source project set, an Issureport data of all projects is collected by using an IssuAPI of Github, specific data collection contents comprise task ID, task processing state (open, closed), submitter, task title, task description, task comment, submission time, category (0 represents Issue), label, milestone and the like, PR report data of all projects is collected by using a Pull Request (PR) API of Github, specific data collection contents comprise task ID, task processing state (open, closed), submitter, task title, task description, task comment, submission time, category (1 represents PR), label, milestone and the like, and candidate task report databases are constructed by combining the Issureport data of all projects and the collected data result of PR report data of all projects.

Further, the step of collecting development task report data of all projects by using APIs in the active open source project set to construct an alternative task report database comprises the steps of collecting task report data of all projects by using an Issu API and a Pull Request (PR) API of Github in the active open source project set, wherein specific data collection content comprises task ID, task processing state, submitter, task title, task description, task comment, submission time, category, label, milestone and the like, and constructing the alternative task report database according to the collected report data.

And step S30, extracting URL link information in all task reports by using a regular expression in the alternative task report database to generate a task report data set.

In specific implementation, for the alternative task report dataset, URL link information in all task reports is extracted by using a regular expression (gitub. Com/[ a-zA-Z0-9- ]/issues |pull/[0-9] +), link information in the task reports is checked by using the Cross-REFERENCED API of Github, and an actual task report associated link is screened out, so that a task associated information reference library is constructed. In a specific work, only intra-project links are considered, and task links crossing the project are temporarily not considered. Removing task report data which does not contain link information according to the task associated information reference library to form a final task report data set;

The step of extracting URL link information in all task reports by using a regular expression in the alternative task report database to generate a task report data set comprises the steps of extracting URL link information in all task reports by using a regular expression in the alternative task report database, checking URL link information in the task report by using a Cross-REFERENCED API of Github, screening out actual task report association connection, constructing an association information reference library according to the task report association connection, and removing task report data which does not contain link information in the association information reference library to form a final task report data set.

And S40, constructing a query task data set and a candidate task data set in the task report data set, and respectively utilizing a structural data analysis model, a text semantic representation model and a historical relevance model to obtain a similarity score between the calculation query task and each candidate task.

Further, before the step of obtaining the similarity score between the query task and each candidate task by respectively using the structural data analysis model, the text semantic representation model and the historical relevance model, the method further comprises the steps of extracting text data in each task report data in the task report data set, including titles, descriptions and comments of task reports, deleting stop words, numbers, punctuations and other non-alphabetical characters in the text data, and converting the residual words into root forms by using Snowball Stemmer technology in NLTK to reduce feature dimensions and unify similar words into one common representation so as to obtain the preprocessed task report data.

In implementations, the Snowball Stemmer technique in NLTK is used to convert the remaining words into their root form to reduce feature dimensions and unify similar words into a common representation.

Further, the step of constructing a query task data set and a candidate task data set in the task report data set, respectively utilizing a structural data analysis model, a text semantic representation model and a historical relevance model to obtain a similarity Score between a query task and each candidate task comprises the steps of selecting the latest 40% sample as the query task data set according to the creation time of a task report in the task report data set, taking the task report data as the candidate task data set, calculating structural information (Structural information) similarity Score _S between the query task and each candidate task by using the structural data analysis model, calculating text information (Textual information) similarity Score _T between the query task and each candidate task by using the text semantic representation model, and calculating historical information (Historical information) similarity Score _H between the query task and each candidate task by using the historical relevance model.

In a specific implementation, for each project, according to the creation time of the task report, the latest 40% sample is selected as the query task data set, all task report data are candidate task data sets, the structured data of the task report are extracted and analyzed, and the structural data analysis model is used for calculating the structural information similarity Score _S between the query task and each candidate task (the creation time of the task is required to be earlier than that of the query task). Structured data variables of the task report include task processing status state (boolean type, "0" stands for open, "1" stands for closed), submitter submitter (text type), category label type (boolean type, "0" stands for Issue, "1" stands for PR), label (text type), milestone milestone (text type), description complexity complexity (numerical, total number of words of task report title and description), comment (numerical, comment number), etc. For a specific text type variable X, summarizing and de-duplicating all X variable values in task report data to obtain N different X values, then encoding the different X values by natural numbers { 1..N } in sequence, and establishing one-to-one mapping between the numerical values and the X text values; for a task report with multiple labels, only the first label is selected as the analysis object. Then, a feature vector { state, submitter, type, label, milestone, complexity, comment }, which characterizes its structural information, is constructed for each task report. For the structural information feature vectors V _s1 and V _s2 given two task reports, their structural information similarity Score _S is calculated using cosine similarity, as follows:

Text data of the task report is extracted and analyzed, and a text semantic representation model is used to calculate a text information similarity Score _T between the query task and each candidate task (task creation time is required earlier than the query task). Based on the preprocessed task report document data (task title, task description, and task comments), a text similarity Score _B between the query task and each candidate task (task creation time needs to be earlier than the query task) is calculated using the Bert text representation model. And calling BERTClient functions in the BERT_service.client library to extract the characteristics of each sentence in the task report, wherein the characteristic vector dimension threshold can be set to be 100, 200, 500, 1000 and the like. For text information feature vectors V _t1 and V _t2 of a given two task report, their text information similarity Score _T is calculated using cosine similarity, the calculation method is as follows:

According to the information of the task report submitter, extracting all the task report data of the historical submissions or the participation comments of the submitter, and calculating the historical information (Historical information) similarity Score _H between the query task and each candidate task (the task creation time is required to be earlier than the query task) by using a historical relevance model. According to the information of the submitters of the task reports, extracting the historical participation (submission or participation comment) task report IDs of the submitters from the task report dataset, arranging the historical participation (submission or participation comment) task report IDs in a reverse order according to the submission time of the task reports, and constructing and forming a characteristic vector { ID ₁,id₂,…,id_n } representing the historical information of the submitters of each task report. Wherein the token vector dimension threshold may be set to 100, 200, 500, 1000, etc., and the feature vector is filled with "0" if the presenter has not generated historical engagement information or the existing dimension is below the dimension threshold. Thus, for the historical information feature vectors V _h1 and V _h2 given two task reports, their historical information similarity Score _H is calculated using cosine similarity, the calculation method is as follows:

and S50, carrying out weighted summation on the similarity scores between the query task and each candidate task, obtaining a final similarity score between each task report, and constructing a development task association model based on multi-model fusion according to the final similarity score to generate a task report association tool.

Further, the step of carrying out weighted summation on the similarity scores between the query task and each candidate task to obtain a final similarity score between each task report, constructing a development task association model based on multi-model fusion according to the similarity scores to generate a task report association tool comprises the steps of carrying out weighted summation on the similarity scores between the query task and each candidate task to obtain a final similarity score, constructing a development task association model based on multi-model fusion according to the final similarity scores, evaluating the models by utilizing Top-k recall rating level evaluation indexes and the task report data set, and selecting optimal sub-model weight combinations according to evaluation results to form the task report association tool.

In specific implementation, the three sub-model similarity scores obtained above are weighted and summed to construct a development task association model based on multi-model fusion, model evaluation is performed by using various evaluation indexes, and an optimal sub-model weight combination is selected to form a final task report association tool. The specific implementation steps are as follows, the three similarity scores obtained in the step S5 are weighted and summed, the three scoring weights are A, B, C respectively, so that the final similarity Score between each task report pair is calculated, and the calculation mode is shown in the formula:

Score=A.Score_s+B.Score_T+C.Score_H

The model is evaluated using the Top-k recall (R@k) evaluation index, and the task association information benchmark library, wherein R@k is to check if the Top-k recommendation is correct. For the task report i to be queried, R@k can calculate the following formula, and k can be 1-10 when actually evaluating:

It can be understood that the three scoring weights A, B, C are respectively given different weights (the sum is 1), performance evaluation is performed on the query task report association results of all the items according to the model evaluation indexes mentioned above, average values of R@1, R@5 and R@10 of all the items are calculated, and the three indexes are added and summed to form a final evaluation index. And selecting an optimal sub-model weight combination according to the final evaluation index, and combining the three sub-models on the basis to form a final task report association tool.

The method comprises the steps of constructing an active open source project set in a collaborative development community according to preset indexes, collecting development task report data of all projects in the active open source project set by using an API to construct an alternative task report database, extracting URL link information in all task reports in the alternative task report database by using a regular expression to generate a task report data set, constructing a query task data set and a candidate task data set in the task report data set, respectively using a structural data analysis model, a text semantic representation model and a historical relevance model to obtain similarity scores between a computing query task and each candidate task, carrying out weighted summation on the similarity scores between the query task and each candidate task to obtain a final similarity score between each task report, constructing a development task relevance model based on multi-model fusion according to the final similarity scores, generating a task report relevance tool, realizing task report related to new task recommendation by combining with a deep learning technology, carrying out weighted summation on the similarity scores to obtain optimal weights and finally constructing the task report relevance tool, and realizing automatic development task relevance.

In addition, the embodiment of the invention also provides a medium, wherein the medium is stored with a multi-model fusion development task association program, and the multi-model fusion development task association program realizes the steps of the multi-model fusion development task association method when being executed by a processor.

The embodiments or specific implementation manners of the multi-model fusion development task association device of the present invention may refer to the above method embodiments, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method of development task association for multimodal fusion, the method comprising:

weighting and summing the similarity scores between the query task and each candidate task to obtain a final similarity score between each task report, and constructing a development task association model based on multi-model fusion according to the final similarity score to generate a task report association tool;

the step of constructing a query task data set and a candidate task data set in the task report data set, and respectively utilizing a structural data analysis model, a text semantic representation model and a historical relevance model to obtain a similarity score between a calculation query task and each candidate task comprises the following steps:

calculating a structural information similarity Score _S between the query task and each candidate task using a structural data parsing model;

Calculating a text information similarity Score _T between the query task and each candidate task using a text semantic representation model;

historical information similarity Score _H between the query task and each candidate task is calculated using a historical relevance model.

2. The method of claim 1, wherein the step of constructing the active open source project set in the collaborative development community according to the preset index comprises:

3. The method of claim 1, wherein the step of collecting development task report data for all projects using an API in the active open source project set to build an alternative task report database comprises:

Collecting task report data of all projects in the active open source project set by utilizing an Issu API and a Pull Request API of Github, wherein the specific data collection content comprises task ID, task processing state, submitter, task title, task description, task comment, submission time, category, label, milestone and the like;

4. The method of claim 1, wherein the step of extracting URL link information in all task reports using regular expressions in the alternative task report database to generate a task report data set comprises:

5. The method of claim 1, wherein the step of constructing a query task data set and a candidate task data set in the task report data set, using a structural data parsing model, a text semantic representation model, and a historical relevance model, respectively, to obtain a similarity score between the computed query task and each candidate task, further comprises:

6. The method of claim 1, wherein the step of weighting and summing the similarity scores between the query task and each candidate task and obtaining a final similarity score between each task report, and constructing a multi-model fusion-based development task association model based on the similarity scores to generate task report association tools comprises:

7. A multimodal fusion development task association apparatus for implementing the multimodal fusion development task association method of any of claims 1 to 6, the apparatus comprising:

8. A multimodal fusion development task association apparatus comprising a memory, a processor and a multimodal fusion development task association program stored on the memory and executable on the processor, the multimodal fusion development task association program being configured to implement the steps of the multimodal fusion development task association method of any of claims 1 to 6.

9. A medium, wherein a multimodal fusion development task related program is stored on the medium, and the multimodal fusion development task related program, when executed by a processor, implements the steps of the multimodal fusion development task related method according to any of claims 1 to 6.