[go: up one dir, main page]

CN104809139B - Code file querying method and device - Google Patents

Code file querying method and device Download PDF

Info

Publication number
CN104809139B
CN104809139B CN201410042833.4A CN201410042833A CN104809139B CN 104809139 B CN104809139 B CN 104809139B CN 201410042833 A CN201410042833 A CN 201410042833A CN 104809139 B CN104809139 B CN 104809139B
Authority
CN
China
Prior art keywords
code
matrix
vector
code file
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410042833.4A
Other languages
Chinese (zh)
Other versions
CN104809139A (en
Inventor
刘博�
邬亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to CN201410042833.4A priority Critical patent/CN104809139B/en
Publication of CN104809139A publication Critical patent/CN104809139A/en
Application granted granted Critical
Publication of CN104809139B publication Critical patent/CN104809139B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of code file querying method and devices, belong to technical field of data processing.The described method includes: the transition matrix of Text eigenvector and code characteristic vector construction based on historical query term vector and each code file, the code characteristic vector of current queries term vector and each code file is converted, code characteristic vector is determined according to the code block of code file;According to the Text eigenvector of the inquiry term vector after current queries term vector, conversion, the code characteristic vector after the conversion of each code file and each code file, the similarity between each code file and query word is calculated, code file query result is obtained.The present invention considers the structure of code file to semantic influence, the content and structure for sufficiently excavating code file, improves the accuracy of feature extraction, and during inquiring code file, similarity is calculated based on text feature vector and code characteristic vector, improves inquiry precision.

Description

Code file query method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a code file query method and a code file query device.
Background
In order to maintain the software system, code files related to query terms need to be queried from the code library according to some given query terms, so that targeted maintenance is performed.
To this end, written by Giulino Antonol, Gerardo Canford, Gerardo Casazza, Andrea Delcia and Ettore Merlo, in the paper entitled "Recovering Business Links between Code and Documentation" published in 10 28 volumes of the IEEE Software Engineering Collection (journal of IEEETransactions on Software Engineering) in 2002, 10, a method for querying Code files was proposed, comprising in particular the following steps: extracting text characteristics of the code file according to text information included in the code file in the code library; and calculating the similarity between the text features of the extracted code files and the query words based on the given query words, and outputting the code file corresponding to the text feature with the maximum similarity as a query result of the query words.
However, even if two code files contain the same text information, the semantics of the two code files will be different if the structures of the two code files are different. The method for inquiring the code file extracts the text features only according to the text information included in the code file, has low accuracy of feature extraction and reduces the inquiry precision.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a code file query method and apparatus. The technical scheme is as follows:
in one aspect, a method for querying a code file is provided, where the method includes:
converting the current query word vector and the code characteristic vector of each code file based on a historical query word vector applied in a historical query process and a conversion matrix constructed by the text characteristic vector and the code characteristic vector of each code file in a code library to obtain a converted query word vector and a converted code characteristic vector of each code file, wherein the code characteristic vector of each code file is determined according to code blocks in the code files;
calculating the similarity between each code file and the query word according to the current query word vector, the converted code feature vector of each code file and the text feature vector of each code file;
and obtaining a code file query result according to the similarity between each code file and the query word.
Optionally, vector conversion is performed on the current query word vector and the code feature vector of each code file based on a conversion matrix constructed by the historical query word vector applied in the historical query process, the text feature vector of each code file, and the code feature vector, so as to obtain a converted query word vector and a converted code feature vector of each code file, where the conversion matrix comprises:
calculating the product of the transpose matrix of the first conversion matrix and the query word vector to obtain a converted query word vector, wherein the conversion matrix comprises the first conversion matrix and a second conversion matrix;
forming a code feature matrix by the code feature vectors of each code file;
calculating the product of the transpose matrix of the code characteristic matrix and the second conversion matrix to obtain a converted code characteristic matrix;
and extracting each vector in the transpose matrix of the converted code characteristic matrix as a converted code characteristic vector of the corresponding code file.
Optionally, calculating the similarity between each code file and the query word according to the current query word vector, the converted code feature vector of each code file, and the text feature vector of each code file includes:
calculating a first similarity between the text feature vector of each code file and the query word vector;
calculating a second similarity between the converted code feature vector of each code file and the converted query word vector;
and carrying out weighted summation on the first similarity and the second similarity of each code file to obtain the similarity between each code file and the query word.
Optionally, before converting the current query word vector and the code feature vector of each code file based on the historical query word vector applied in the historical query process and a conversion matrix constructed by the text feature vector and the code feature vector of each code file in the code library to obtain the converted query word vector and the converted code feature vector of each code file, the method further includes:
for each code file, acquiring a text feature vector of the code file according to the natural language description, the annotation and the variable name of the code file;
judging whether the code file comprises a code block of which the occurrence frequency is greater than a preset threshold value;
when the code file comprises code blocks with the occurrence times larger than the preset threshold, extracting the code blocks with the occurrence times larger than the preset threshold;
and combining the extracted occurrence times of different code blocks into a code feature vector of the code file.
Optionally, before vector conversion is performed on the current query word vector and the code feature vector of each code file based on a conversion matrix constructed by a historical query word vector applied in a historical query process, the text feature vector of each code file, and the code feature vector of each code file, and a converted query word vector and a converted code feature vector of each code file are obtained, the method further includes:
obtaining each historical query word vector applied in the historical query process to form a sample query matrix;
forming a sample text feature matrix by using the text feature vectors of each code file in the code base;
forming a sample code characteristic matrix by using the code characteristic vector of each code file in the code block;
determining a target function with a first conversion matrix and a second conversion matrix as independent variables according to the sample query matrix, the sample text feature matrix and the sample code feature matrix;
and solving the minimum value of the objective function to obtain a corresponding solution when the objective function is the minimum value.
Optionally, the objective function is:
wherein,andfor regularization parameters, U is the first transformation matrix, V is the second transformation matrix, e1(U, V) is used for representing the distance between the text feature vector and the code feature vector of the same code file or similar code files after conversion, and is epsilon2(U, V) for representing a difference between a distance between text feature vectors and code feature vectors of converted similar code files and a distance between text feature vectors and code feature vectors of dissimilar code files, g (U, V) for representing a sum of a distance between text feature vectors, a distance between code feature vectors, a distance between text feature vectors and code feature vectors, a distance between history query word vectors and text feature vectors of each code file, a distance between history query word vectors and code feature vectors of each code file, and a distance between similar history query word vectors, c (U, V) is used for representing the distance between any converted character segment and the code feature vector of the code file containing the character segment, and r (U, V) is used for controlling the complexity of the first conversion matrix U and the second conversion matrix V.
Alternatively,
wherein m is the number of code files, the ith code file and the jth code file are similar code files, the ith code file and the ith code file are dissimilar code files, n is the number of history query words applied in a history query process, X is the sample text feature matrix, Y is the sample code feature matrix, Q is the sample query matrix, W and R are similarity matrixes between the sample text feature matrix and the sample code feature matrix, and L is a normalized Laplace matrix of a matrix W.
In another aspect, an apparatus for querying a code file is provided, the apparatus including:
the conversion module is used for converting the current query word vector and the code characteristic vector of each code file based on a historical query word vector applied in the historical query process and a conversion matrix constructed by the text characteristic vector and the code characteristic vector of each code file in a code library to obtain the converted query word vector and the converted code characteristic vector of each code file, and the code characteristic vector of each code file is determined according to a code block in the code file;
the similarity calculation module is used for calculating the similarity between each code file and the query word according to the current query word vector, the converted code feature vector of each code file and the text feature vector of each code file;
and the result output module is used for obtaining the code file query result according to the similarity between each code file and the query word.
Optionally, the conversion module comprises:
the first conversion unit is used for calculating the product of a transpose matrix of a first conversion matrix and the query word vector to obtain a converted query word vector, wherein the conversion matrix comprises the first conversion matrix and a second conversion matrix;
the code characteristic forming unit is used for forming a code characteristic matrix from the code characteristic vectors of each code file;
the second conversion unit is used for calculating the product of the transposed matrix of the code characteristic matrix and the second conversion matrix to obtain a converted code characteristic matrix;
and the code characteristic vector extraction unit is used for extracting each vector in the transpose matrix of the converted code characteristic matrix as a converted code characteristic vector corresponding to the code file.
Optionally, the similarity calculation module includes:
the first calculation unit is used for calculating a first similarity between the text feature vector of each code file and the query word vector;
the second calculation unit is used for calculating a second similarity between the converted code feature vector of each code file and the converted query word vector;
and the weighted summation unit is used for carrying out weighted summation on the first similarity and the second similarity of each code file to obtain the similarity between each code file and the query word.
Optionally, the apparatus further comprises:
the text characteristic vector acquisition module is used for acquiring a text characteristic vector of each code file according to the natural language description, the annotation and the variable name of the code file;
the occurrence frequency judging module is used for judging whether the code file comprises a code block of which the occurrence frequency is greater than a preset threshold value;
the code block extraction module is used for extracting the code blocks of which the occurrence times are greater than the preset threshold when the code blocks of which the occurrence times are greater than the preset threshold are included in the code file;
and the code characteristic vector composition module is used for composing the extracted occurrence times of different code blocks into the code characteristic vector of the code file.
Optionally, the apparatus further comprises:
the first matrix composition module is used for obtaining each historical query word vector applied in the historical query process to form a sample query matrix;
the second matrix composition module is used for composing the text characteristic vector of each code file in the code base into a sample text characteristic matrix;
the third matrix composition module is used for composing the code characteristic vector of each code file in the code block into a sample code characteristic matrix;
the target function determining module is used for determining a target function with a first conversion matrix and a second conversion matrix as independent variables according to the sample query matrix, the sample text characteristic matrix and the sample code characteristic matrix;
and the objective function solving module is used for solving the minimum value of the objective function and obtaining a corresponding solution when the objective function is the minimum value.
Optionally, the objective function is:
wherein,andfor regularization parameters, U is the first transformation matrix, V is the second transformation matrix, e1(U, V) is used for representing the distance between the text feature vector and the code feature vector of the same code file or similar code files after conversion, and is epsilon2(U, V) for representing a difference between a distance between text feature vectors and code feature vectors of converted similar code files and a distance between text feature vectors and code feature vectors of dissimilar code files, g (U, V) for representing a sum of a distance between text feature vectors, a distance between code feature vectors, a distance between text feature vectors and code feature vectors, a distance between history query word vectors and text feature vectors of each code file, a distance between history query word vectors and code feature vectors of each code file, and a distance between similar history query word vectors, c (U, V) is used for representing the distance between any converted character segment and the code feature vector of the code file containing the character segment, and r (U, V) is used for controlling the complexity of the first conversion matrix U and the second conversion matrix V.
Alternatively,
wherein m is the number of code files, the ith code file and the jth code file are similar code files, the ith code file and the ith code file are dissimilar code files, n is the number of history query words applied in a history query process, X is the sample text feature matrix, Y is the sample code feature matrix, Q is the sample query matrix, W and R are similarity matrixes between the sample text feature matrix and the sample code feature matrix, and L is a normalized Laplace matrix of a matrix W.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
the method and the device provided by the embodiment of the invention take the influence of the structure of the code file on the semantics into consideration, fully excavate the content and the structure of the code file by acquiring the text characteristic vector and the code characteristic vector of the code file, improve the accuracy of characteristic extraction, and calculate the similarity between each code file and the query word based on the text characteristic vector and the code characteristic vector in the process of querying the code file so as to obtain the query result of the code file and improve the query precision.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a code file query method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a code provided by an embodiment of the invention;
FIG. 3 is a schematic diagram illustrating similarity calculation between a code file and a query term according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a code file querying device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a code file query method provided in an embodiment of the present invention, and referring to fig. 1, the method includes:
101. converting the query word vector and the code characteristic vector of each code file based on the historical query word vector applied in the historical query process and a conversion matrix constructed by the text characteristic vector and the code characteristic vector of each code file in a code library to obtain the converted query word vector and the converted code characteristic vector of each code file, wherein the code characteristic vector of each code file is determined according to the code blocks in the code files.
The code file comprises a plurality of text contents such as natural language description, annotation, variable name and the like from the aspect of text contents, and structurally comprises a plurality of code blocks, and each code block is composed of the text contents such as the natural language description, the annotation, the variable name and the like.
Firstly, the historical query term vector is a vector constructed by the query terms applied in the historical query process, and specifically may be composed of characteristic values of text contents in the query terms, and the query term vector is used for describing the text contents of the query terms.
Secondly, the text feature vector of the code file is used for describing the text content of the code file. And quantizing the text content to obtain the characteristic value of the text content of the code file, and obtaining the text characteristic vector according to the characteristic value of the text content. According to the similarity between the text feature vectors of different code files, the similarity of text contents between the code files can be measured.
Again, the code feature vector of the code file is used to describe the structure of the code file. And quantizing the code block to obtain the characteristic value of the code block in the code file, and obtaining the code characteristic vector according to the characteristic value of the code block. According to the similarity between the code feature vectors of different code files, the similarity of the structure between the code files can be measured.
In the embodiment of the invention, the query word vector and the text feature vector are both composed of feature values of text contents, and the two have comparability. When code files are inquired each time, acquiring inquiry word vectors according to inquiry words, calculating the similarity between the inquiry word vectors and the text characteristic vectors of any code file, and measuring the similarity of text contents between the inquiry words and the code files according to the similarity.
However, since the query term vector is determined according to the character segments in the query term, and the code feature vector is determined according to the code blocks in the code file, there is no comparability between the two. In order to make the query word vector and the code feature vector have comparability, the query word and the code feature vector need to be converted according to the constructed conversion matrix, so that the query word vector and the code feature vector are mapped to the same semantic space. In the semantic space, the similarity between the converted query word vector and the converted code feature vector is calculated, and at the moment, the similarity can be used for measuring the structural similarity between the query word and the code file.
102. And calculating the similarity between each code file and the query word according to the current query word vector, the converted code characteristic vector of each code file and the text characteristic vector of each code file.
In the embodiment of the present invention, according to the current query word vector and the text feature vector of each code file, the similarity of the text content between the query word and each code file can be obtained, and after the conversion, the converted query word vector and the converted code feature vector are in the same semantic space, and the similarity between the converted query word vector and the converted code feature vector can be calculated to obtain the similarity of the structure between the query word and each code file.
When the similarity between each code file and the query word is calculated, the cosine distance between the text feature vector of each code file or the code feature vector and the query word vector can be calculated, the cosine distance is used as the similarity between each code file and the query word, and the euclidean distance between the text feature vector of each code file or the code feature vector and the query word vector can be calculated, and the euclidean distance is used as the similarity between each code file and the query word, which is not limited in the embodiments of the present invention.
103. And obtaining a code file query result according to the similarity between each code file and the query word.
In the embodiment of the invention, according to the similarity between each code file and the query term, the code file similar to the query term can be obtained as the code file query result.
Preferably, in order to simplify the code file query result of the query term, a preset number of code files may be selected and output as the code file query result. And when the similarity calculation mode is different, the selected code files are different.
And when the cosine distance between the two vectors is calculated as the similarity between the code files and the query word, sequencing the similarity between each code file and the query word, and selecting a preset number of code files as the code file query result of the query word according to the sequence of the similarity from large to small.
And when the Euclidean distance between the two vectors is calculated to be used as the similarity between the code files and the query word, sequencing the similarity between each code file and the query word, and selecting a preset number of code files as the code file query result of the query word according to the sequence of the similarity from small to large.
Further, when the query result of the code file is output, the code file similar to the query word is displayed, specifically, the name of the code file may be displayed according to the name order, or the storage path of the code file may be displayed according to the creation time order of the code file, and the like, which is not limited in the embodiment of the present invention.
The method provided by the embodiment of the invention considers the influence of the structure of the code file on the semantics, fully excavates the content and the structure of the code file by acquiring the text characteristic vector and the code characteristic vector of the code file, improves the accuracy of characteristic extraction, and calculates the similarity between each code file and the query word based on the text characteristic vector and the code characteristic vector in the process of querying the code file so as to obtain the query result of the code file and improve the query precision.
Optionally, on the basis of the technical solution of the embodiment shown in fig. 1, before step 101 "converting the query word vector and the code feature vector of each code file in the code library based on the historical query word vector applied in the historical query process and a conversion matrix constructed by the text feature vector and the code feature vector of each code file to obtain a converted query word vector and a converted code feature vector of each code file, the code feature vector of the code file being determined according to a code block in the code file", the method further includes the following steps 100A to 100F:
100A, for each code file, acquiring text characteristic vectors of the code file according to the natural language description, the annotation and the variable name of the code file, and forming a sample text characteristic matrix by the acquired text characteristic vectors of different code files.
In the embodiment of the invention, in order to improve the accuracy of code file query, the query words applied each time and the code files in the code library can be used as samples, a machine learning algorithm is adopted to train the samples, a mathematical model is established to highlight the incidence relation between similar query words and code files, so that the query words are input into the mathematical model in the subsequent query process to output the code files. Before training the sample, a vector capable of describing the text content or structure of the query word and the code file is obtained.
Optionally, character segments in the natural language description, the comments and the variable name of the code file are obtained, a feature value of each character segment is obtained according to a preset corresponding relationship between the character segments and the feature values, and the feature values of different character segments form a text feature vector of the code file.
The feature value of the character segment may be a weight of the character segment or a number of occurrences of the character segment in the code file, which is not limited in the embodiment of the present invention.
For example, the code file includes 3 character segments "word 1", "word 2", and "word 3", and the preset correspondence between the character segments and the feature values is shown in table 1 below, so that the text feature vector of the code file is [0, 3, 5 ]]T
TABLE 1
Character segment word1 word2 word3
Characteristic value 0 3 5
If the text feature vectors of the 3 code files are respectively [0, 3, 5 ]]T、[1,1,2]TAnd [0, 2, 1 ]]TThen use the same asThe column vector in the text feature matrix represents the text feature vector of the corresponding code file, and the sample text feature matrix is
100B, judging whether the code file comprises the code blocks of which the occurrence times are greater than a preset threshold value, and extracting the code blocks of which the occurrence times are greater than the preset threshold value when the code file comprises the code blocks of which the occurrence times are greater than the preset threshold value.
The more the occurrence times of the code blocks in the code file are, the greater the influence of the code blocks on the semantics of the code file can be considered, so that in order to improve the accuracy of extracting the code feature vector, whether the code file comprises the code blocks of which the occurrence times are greater than a preset threshold value is judged, so as to extract the code blocks of which the occurrence times are greater than the preset threshold value, and the code feature vector is determined according to the code blocks of which the occurrence times are greater than the preset threshold value.
Specifically, traversing the code blocks in the code file, counting the occurrence frequency of each code block, judging whether the occurrence frequency of each code block is greater than the preset threshold, and when the occurrence frequency of any code block is greater than the preset threshold, extracting the code block of which the occurrence frequency is greater than the preset threshold. The preset threshold may be determined according to the code amount and the feature extraction precision of the code file, which is not limited in the embodiment of the present invention.
In practical applications, a plurality of code blocks are organized to form a code file, each code block has a certain semantic, and the code file formed by the code blocks after being organized also has a certain semantic. The semantic influence of the frequently-appearing code blocks in the code file on the semantics of the code file can be considered to be large, the number of occurrences of the code blocks in the code file is counted, the code blocks with the large number of occurrences are selected, and the code feature vector of the code file is determined according to the code blocks with the large number of occurrences.
And 100C, combining the extracted occurrence times of different code blocks into a code characteristic vector of the code file, and combining the code characteristic vectors of different code files into a sample text characteristic matrix.
Optionally, a start key character segment and a corresponding end identifier (or an end key character segment) of the code block are preset, character segments in the code file are traversed, and when the start key character segment and the corresponding end identifier are detected, text content between the start key character segment and the end identifier is determined as one code block. When counting the occurrence frequency of the code blocks, judging whether the starting key character segments and the termination marks of the code blocks are the same, and when the starting key character segments and the termination marks of any two code blocks are the same, determining that the two code blocks are the same code blocks. Of course, a code block identifier may also be added to the code block, the code block in the code file may be identified according to the code block identifier, and when the code block identifiers of any two code blocks are the same, it is determined that the two code blocks are the same code block.
If "is set as a start key character segment, and the first" } "mark after" if "is set as the termination mark corresponding to" if ", when the" } "mark is detected for the first time after" if "is detected, the text content between" if "and the first detected" } "mark is determined as a code block.
As another example, the code in the code file is as shown in fig. 2, when the assignment flag "=" is detected, the line where the assignment flag "=" is located is regarded as a code block, that is, the code block 1 "endEdge = coverage. nextsetbit (startEdge)," and the code block 3 "Double localH = h span scopes. getscore (startEdge, endEdge-1)," when the starting key character segment "if" is detected and the "}" flag is detected for the first time after the starting key character segment "if", the text content "if (Edge = -1) { Edge = height.
Assuming that the preset threshold is 1 and the corresponding relationship between the code blocks with the occurrence frequency greater than 1 and the occurrence frequency in the code file is shown in table 2 below, the occurrence frequencies of the code blocks 1, 2 and 3 are combined into the code feature vector, which is [3, 2, 5 ]]T
TABLE 2
Code block Code block 1 Code block 2 Code block 3
Number of occurrences 3 2 5
If the code feature vectors of the 3 code files 'D1', 'D2' and 'D3' are respectively [3, 2, 5]T、[1,3,2]TAnd a combination of [4 ", 2,4]Tthen, the column vector in the sample code feature matrix is used to represent the code feature vector of the corresponding code file, and the sample generationThe code feature matrix is
100D, obtaining each historical query word vector applied in the historical query process, and forming a sample query matrix by the obtained historical query word vectors.
For a query word, acquiring character segments in the query word, acquiring a characteristic value of each character segment according to a preset corresponding relation between the character segments and the characteristic values, and forming the characteristic values of different character segments into a query word vector, wherein the query word vector is used for representing text content of the query word.
For example, if the query word includes 3 character segments "C1", "C2", and "C3", and the preset correspondence between the character segments and the feature values is shown in table 3 below, the query word vector is [2, 3]T
TABLE 3
Character segment C1 C2 C3
Characteristic value 2 3 3
The character segment may be a character, a word composed of a plurality of characters, or a phrase composed of a plurality of words, and the correspondence between the character segment and the feature value may be determined according to the part of speech, the number of occurrences, and the like of the character segment, for example, the feature value of the verb may be set to be greater than the feature value of the noun, and the feature value of the noun may be greater than the feature value of the dummy word, or the number of occurrences of the character segment in the query word may be used as the feature value.
If the query word vectors corresponding to the 4 query words applied in the historical query process are [2, 3 ] respectively]T、[1,2, 5]T、[1,2,3]TAnd [4, 5, 6]TThen, thenThe query term vector for each query term is represented by a column vector in the sample query matrix,the sample query matrix is
And 100E, determining an objective function with the first conversion matrix and the second conversion matrix as arguments according to the sample query matrix, the sample text feature matrix and the sample code feature matrix.
In the process of constructing the conversion matrix, in order to enable the semantic space mapped by the query word and the code feature vector to highlight the incidence relation between similar query words and code files, an objective function is constructed according to the objective characteristics, the historical query word vector, the text feature vector and the code feature vector of each code file, so that the objective characteristics can be met when the objective function is the minimum value, and at the moment, the solution corresponding to the objective function is the constructed conversion matrix.
When the semantic space meets the target characteristics, the code file queried based on a query term is a code file similar to the query term, and can be used as a code file query result of the query term.
In an embodiment of the present invention, the target characteristics of the semantic space may include:
(1) the text feature vectors of the same code file or similar code files in the original space are similar to the code feature vectors in the semantic space.
In order to highlight the association relationship of the same code file, the text feature vector and the code feature vector of the same code file should be similar in the semantic space.
Similar code files also exist in the code library, and the similar code files mean that text feature vectors or code feature vectors of the two code files are different, but can realize the same function, namely the semantics of the two code files are the same. In order to facilitate identifying similar code files in the code library, file identifiers (labels) can be added to the code files, and the code files with the same file identifiers are similar code files.
In order to highlight the association relationship between the similar code files, the text feature vector of any one of the two similar code files and the text feature vector of the other code file should be similar in the semantic space, and the text feature vector of the any one code file and the code feature vector of the other code file should also be similar in the semantic space.
The similarity of the two vectors means that the cosine distance between the two vectors is greater than a preset threshold value, or the Euclidean distance between the two vectors is less than the preset threshold value and the like in the same space.
(2) Dissimilar code files in the original space are still dissimilar in the semantic space.
In order to highlight the association relationship between similar code files, the text feature vector of any code file and the text feature vector of another code file in the code files which are not similar in the original space should not be similar in the semantic space, and the text feature vector of any code file and the code feature vector of another code file should not be similar in the semantic space.
(3) Similar text feature vectors in the original space are still similar in the semantic space.
In order to highlight the association between similar text feature vectors, similar text feature vectors in the original space should be similar in the semantic space.
(4) Similar code feature vectors in the original space are still similar in the semantic space.
In order to highlight the association relationship between similar code feature vectors, similar code feature vectors in the original space should be similar in the semantic space.
(5) The segments marked as similar remain similar to the code blocks in the semantic space.
When a segment is included in the text content of a code block, the segment and the code block may be labeled as similar. In order to highlight the association between the character segments and the code blocks, the character segments marked as similar should be similar to the code blocks in the semantic space.
Optionally, the transformation matrix includes a first transformation matrix and a second transformation matrix, where U denotes the first transformation matrix, V denotes the second transformation matrix, and an objective function with U and V as arguments is:
wherein,andis a regularization parameter.
Five terms in the objective function are described below to illustrate how the semantic space satisfies the objective characteristics based on the trained first and second transformation matrices.
The first item:wherein X is the sample text feature matrix, and Y is the sampleThe code feature matrix.
In the embodiment of the invention, the distance between two matrixes is expressed by the F-norm of the difference of the two matrixes, and then X is in the semantic spaceTU is a converted text feature matrix, YTV is the transformed code feature matrix,i.e. the distance between the converted text feature matrix and the converted code feature matrix, then belongs to1(U, V) is used for expressing the distance between the text feature vector and the code feature vector of the same code file or similar code file after conversion, namely the distance between the text feature vector and the code feature vector of the same code file or similar code file in the original space in the semantic space, and when belonging to the element1When the (U, V) is minimum, the distance between the text feature vector and the code feature vector of the similar code file in the semantic space is minimum, namely the semantic space can realize that the text feature vector and the code feature vector of the same code file or the similar code file are similar.
The second term is:wherein m is the number of code files selected during training, i =1, 2 … m, l =1, 2 … m, the ith code file and the ith code filej code files are similar code files, the ith code file and the ith code file are dissimilar code files, omegai1Is composed ofThe similarity between the text content of any one of the two similar code files and the code block of the other code file can beTo be determined according to a similarity matrix W between the sample text feature matrix and the sample code feature matrix.
2And (U, V) is a loss function, which is used for expressing the difference between the distance between the text feature vector and the code feature vector of the converted similar code file and the distance between the text feature vector and the code feature vector of the dissimilar code file, namely the difference between the distance between the text feature vector and the code feature vector of the similar code file in the original space and the distance between the text feature vector and the code feature vector of the dissimilar code file in the original space in the semantic space. The smaller the distance between the text feature vector and the code feature vector of the similar code file is, the larger the distance between the text feature vector and the code feature vector of the dissimilar code file is, the more the similar code file can be highlightedThe relationship between the two elements is defined as2When the (U, V) is the minimum, it may be considered that the distance between the text feature vector of the similar code file and the code feature vector is the minimum, and the distance between the text feature vector of the dissimilar code file and the code feature vector is the maximum, that is, it may be realized that the text feature vector and the code feature vector of the similar code file in the original space are similar in the semantic space, and the dissimilar code file in the original space is still dissimilar in the semantic space.
And W is a similarity matrix between the sample text characteristic matrix and the sample code characteristic matrix and is used for representing the similarity between the character segment in one code file and the code block of another code file. The similarity matrix W may be obtained by determining whether a code block of any one of the two similar code files includes a character segment of another code file, for example, setting the similarity matrix W to use the character segment of one of the two similar code files as a row and the code block of the other code file as a column, when the code block corresponding to any element position in W includes the character segment corresponding to the element position, filling 1 in the element position, and when the code block corresponding to any element position does not include the character segment corresponding to the element position, filling 0 in the element position.
The third item: wherein Q is the sample query matrix and L is the normalization of the matrix WA laplacian matrix.
As will be appreciated by those skilled in the art, the F-norm of a matrix A is equal to ATAnd B, obtaining the distance between the two matrixes by calculating the traces of the matrixes by adopting a graph mining algorithm.
W is a similarity matrix between the sample text feature matrix and the sample code feature matrix and is used for representing the similarity between character segments in one code file and code blocks of another code file, if the character segments and the code blocks are regarded as nodes in the semantic space, a data structure of an undirected graph can be constructed according to the matrix W, a normalized Laplace matrix L of W can be obtained according to the data structure,where I is the identity matrix, D is a diagonal matrix, and each element value D in D isii=∑jwij. From this matrix L, a graph canonical g (U, V) of the graph structure formed by W after mapping to the semantic space can be calculated.
Wherein,the distance between the text feature vectors of the converted similar code files, namely the distance of the text feature vectors of the similar code files in the original space in the semantic space,andthe distance between the text feature vector and the code feature vector used for representing the converted similar code file, namely the distance between the text feature vector and the code feature vector of the similar code file in the original space in the semantic space,the distance between the code feature vectors for representing the converted similar code files, namely the distance of the code feature vectors of the similar code files in the original space in the semantic space,andused for representing the distance between the converted historical query word vector and the text feature vector of each code file, namely the distance between the historical query word vector and the text feature vector of each code file in the semantic space in the original space,andused for representing the distance between the converted historical query word vector and the code feature vector of each code file, namely the distance between the historical query word vector and the code feature vector of each code file in the semantic space in the original space,and the method is used for expressing the distance between the converted similar historical query word vectors, namely the distance between the similar historical query word vectors in the original space in the semantic space. When g (U, V) is the smallest, it can be considered that each distance represented by g (U, V) is the smallest, and the semantic space can realize that the text feature vectors of the similar code files in the original space and the code feature vectors are still similar in the semantic space, the similar text feature vectors in the original space are still similar in the semantic space, and the similar code feature vectors in the original space are still similar in the semantic space.
Since the constant term of the third term in the gradient is 2 when the gradient of the objective function is subsequently calculated, the coefficient of each distance in g (U, V) is set to be 2 in order to simplify the formulaThe coefficients do not represent a practical significance and may be replaced by other values or ignored.
The fourth item:c (U, V) for representing any converted character segment and containing the sameThe distance between the code feature vectors of the code files of the character segments, i.e. any character segment and the code file containing itWhen c (U, V) is minimum, if a code file contains a wordA character segment, the distance between the character segment and the code feature vector of the code file is minimum, and the semantic space can be realizedThe segments marked as similar remain similar to the code blocks in the semantic space.
And R is a similarity matrix between the sample text characteristic matrix and the sample code characteristic matrix and is used for representing the similarity between the character segments and the code blocks in the code file. The similarity matrix R can be determined by determining whether any character segment is included in any code block, for example, if the similarity matrix R is set to have character segments as rows and code blocks as columns, when a code block corresponding to any element position in R includes a character segment corresponding to the element position, 1 is filled in the element position, and when a code block corresponding to any element position does not include a character segment corresponding to the element position, 0 is filled in the element position.
For example, the dimension of the text feature vector of the code file is 3, and each vector value of the text feature vector is respectively shown in the tableFeature values indicating "character segment 1", "character segment 2", and "character segment 3", a dimension of a code feature vector of 4, the code feature vectorRespectively represents "code block 1", "code block 2", "code block 3" and "code block 4" in the corresponding code fileRespectively judging whether the 4 code blocks comprise the 3 character segments or not. As shown in table 4 below, "√ gaugeIndicating that the corresponding code block contains the corresponding character segment, and "x" indicating that the corresponding code block does not contain the corresponding character segment, when it is determined that"code block 2" includes "character field 2", "code block 3" includes "character field 3", and "code block 4" includes "character field 1" and "character fieldWhen 2', the similarity matrix R is
TABLE 4
Code block 1 Code block 2 Code block 3 Code block 4
Character segment 1 × × ×
Character segment 2 × ×
Character segment 3 × × ×
The fifth item:r (U, V) for controlling the first and the second conversion matrices U andthe complexity of the two transformation matrices V to prevent overfitting.
And 100F, solving the minimum value of the objective function, and obtaining a solution corresponding to the minimum value of the objective function.
In the embodiment of the present invention, for the argument U, the objective function satisfies that the function value of the midpoint of the two matrices U is greater than the average value of the function values of the two matrices U, and for the argument V, the objective function satisfies that the function value of the midpoint of the two matrices V is greater than the average value of the function values of the two matrices V, and then the objective function is non-convex to the first conversion matrix and the second conversion matrix, and cannot accurately solve the minimum value of the objective function.
The specific process of optimizing based on the gradient descent algorithm comprises the following steps: and calculating a first gradient of the target function to the first conversion matrix and a second gradient of the target function to the second conversion matrix, respectively iterating the first conversion matrix and the second conversion matrix along opposite gradient directions based on the first gradient and the second gradient until the iteration times reach the maximum iteration times or the target function converges, and outputting the first conversion matrix and the second conversion matrix at the moment. Further, the first gradient is used for optimizing the first transformation matrix, the second gradient is used for optimizing the second transformation matrix, and the value of one item of the first transformation matrix and the second transformation matrix can be fixed and the other item can be iteratively optimized each time iterative optimization is performed.
The first gradient of the objective function F to the first transformation matrix U is:
the second gradient of the objective function F to the second transformation matrix V is:
the iterative formula of the first conversion matrix is: the iterative formula of the second transformation matrix is:where η is the iteration step size.
Taking the example of fixing the second transformation matrix V first and performing iterative optimization on the first transformation matrix U, the first transformation matrix U is replaced by the second transformation matrix V at each iterationSubstituting the first conversion matrix into the objective function to obtain an optimized objective function, judging whether the optimized objective function is converged, and outputting the first conversion matrix when the optimized objective function is convergedAnd the second conversion matrix V, when the optimized target function does not converge, continuously replacing the current first conversion matrix U with the second conversion matrix VAnd iterating the first conversion matrix U until the objective function is converged, or until the iteration times reach the maximum iteration times of the first conversion matrix, fixing the first conversion matrix U, and carrying out iterative optimization on the second conversion matrix V.
It should be noted that, in the embodiment of the present invention, iterative optimization of the first conversion matrix and the second conversion matrix based on a gradient descent algorithm is taken as an example for description, and in a practical application process, other machine learning algorithms may also be used to optimize the first conversion matrix and the second conversion matrix, which is not limited in the embodiment of the present invention.
The co-occurrence relation of the code characteristic vector and the text characteristic vector in the training sample is analyzed based on a machine learning algorithm, so that the direct comparison between the code block and the character segment is realized, and the query precision is further improved.
Optionally, on the basis of the technical solution of the embodiment shown in fig. 1, the step 101 "performing vector transformation on the query word vector and the code feature vector of each code file based on the transformation matrix constructed by the historical query word vector, the text feature vector of each code file, and the code feature vector applied in the historical query process to obtain the transformed query word vector and the transformed code feature vector of each code file" includes the following steps 1011 and 1014:
1011. and calculating the product of the transpose matrix of the first conversion matrix and the query word vector to obtain the converted query word vector, wherein the conversion matrix comprises the first conversion matrix and the second conversion matrix.
In the embodiment of the present invention, in order to map the query word vector and the code feature vector to the same semantic space, the dimensions of the query word vector and the code feature vector need to be obtained, so as to perform a corresponding transpose operation according to the dimensions of the query word vector and the code feature vector, thereby implementing a matrix multiplication operation.
Assuming that the number of code files is m, the dimension of the text feature vector is dx, and the dimension of the query word vector q is dx 1. In the process of constructing the conversion matrix, mapping the sample text feature matrix X to the semantic space X according to the first conversion matrix UTU, then the dimension of the first conversion matrix U constructed is dx × k, k being an arbitrary value. Therefore, to implement matrix multiplication, the first conversion matrix U is transposed, and the product U of the transposed matrix of the first conversion matrix U and the query word vector q is calculatedTq。
For convenience of description, in the embodiment of the present invention, the obtained vector is a column vector, and when performing matrix multiplication and vector extraction in the following, the vector extracted after transposing the matrix is a row vector of the original matrix.
1012. And forming a code feature matrix by using the code feature vectors of each code file.
The process of forming the code feature matrix is similar to the process of forming the sample code feature matrix in step 100C, and is not repeated here.
1013. And calculating the product of the transpose matrix of the code characteristic matrix and the second conversion matrix to obtain the converted code characteristic matrix.
Based on the example of step 1011, in order to distinguish from the sample code feature matrix Y, the code feature matrix is denoted by Y ', and assuming that the dimension of the code feature vector is dy, the dimension of the code feature matrix Y' is dy × m. In the process of constructing the conversion matrix, mapping the sample code characteristic matrix Y to the semantic space Y according to the second conversion matrix VTV, then the dimension of the second transformation matrix is dy × k. Therefore, to implement matrix multiplication, the code feature matrix Y 'is transposed, and the product Y' of the transposed matrix of the code feature matrix Y 'and the second conversion matrix is calculated'TAnd V is the converted code characteristic matrix.
1014. And extracting each vector in the transpose matrix of the converted code characteristic matrix into a converted code characteristic vector of the corresponding code file.
Each column of the code feature matrix Y 'represents a code feature vector of a corresponding code file, and then the converted code feature matrix Y'TEach line of V represents the code characteristic vector of the converted corresponding code file, and the converted code characteristic matrix is converted and then the vector is extracted, namely each vector in the transposed matrix of the converted code characteristic matrix is extracted as the corresponding code textA transformed code feature vector of the piece.
Take m as 3, dx and dy as 3, and k as 2, in this case, the query word vector is [2, 3]TThe first conversion momentArray U isThe second transformation matrix V isCodes of 3 code files "D1", "D2", and "D3" in the code libraryThe feature vectors are [3, 2, 5 respectively]T、[1,3,2]TAnd [4, 2, 4 ]]TThen each column represents the code characteristics of the corresponding code fileVector of the code feature matrix YWhen the query word vector is converted, calculation is performedThe converted query term vector isConverting the code feature vectorTime, calculateThe feature vector of the code after the conversion of "D1" isThe feature vector of the code after the conversion of 'D2' isThe feature vector of the code after the conversion of 'D3' is
It should be noted that, the step 1011-. Therefore, when the query word vector is converted, whether the row number of the query word vector and one matrix in the first conversion matrix is equal to the row number of the other matrix is judged, and if the row number of the query word vector is equal to the column number of the first conversion matrix, the product of the query word vector and the first conversion matrix can be directly calculated to obtain the converted query word vector. And if the column number of the first conversion matrix is equal to the column number of the query word vector, converting the first conversion matrix, and calculating the product of the transposed matrix of the first conversion matrix and the query word vector to obtain the converted query word vector. The same applies to the conversion of the code feature vector, which is not described herein again.
Optionally, on the basis of the technical solution of the embodiment shown in fig. 1, the step 102 "calculating the similarity between each code file and the query word according to the query word vector, the converted code feature vector of each code file, and the text feature vector of each code file" includes the following steps 1021 1023:
1021. and calculating a first similarity between the text feature vector of each code file and the query word vector.
Optionally, a cosine distance between the text feature vector of each code file and the query word vector is calculated, and the cosine distance between the text feature vector and the query word vector is taken as the first similarity. Based on the example of step 1014, the query term vector q is [2, 3 ]]TAssume that the text feature vector x of "D1" is [1, 2, 1 ]]TThen the first similarity is
1022. And calculating a second similarity between the converted code feature vector of each code file and the converted query word vector.
Optionally, a calculation is made between each converted code feature vector of the code file and the converted query term vectorThe cosine distance between the code feature vector and the converted query word vector is taken as the second similarityAnd (4) degree. Based on the example of step 1014 and step 1021, the converted query term vector q isConverted generation of "D1The code feature vector y isThen the second similarity is
1023. And carrying out weighted summation on the first similarity and the second similarity of each code file to obtain the similarity between each code file and the query word.
In the embodiment of the present invention, the first similarity is used to indicate a similarity of text contents between the query term and the code file, and the second similarity is used to indicate a similarity of structures between the query term and the code file, so that the first similarity and the second similarity should be comprehensively considered when querying the code file to obtain a similarity between each code file and the query term. In practical applications, the first similarity and the second similarity may have different influences on the similarity between the code file and the query term, when the text contents of the two code files are similar but the structures of the two code files are different, the second similarity has a larger influence on the similarity between the code file and the query term, and when the structures of the two code files are similar but the text contents of the two code files are different, the first similarity has a larger influence on the similarity between the code file and the query term, the weight of the first similarity and the weight of the second similarity may be set, so as to express the influence of the first similarity and the second similarity on the similarity between the code file and the query term by the weights. And according to the set weight, carrying out weighted summation on the first similarity and the second similarity of each code file to obtain the similarity between each code file and the query word.
Based on the example of step 1021 and step 1022, assuming that the weight of the first cosine similarity is 0.6 and the weight of the second cosine similarity is 0.4, the similarity between code file "D1" and the query word is 0.957 × 0.6+0.944 × 0.4 — 0.9518.
All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
The process of calculating the similarity between the code file and the query term will be described below with reference to fig. 3.
Referring to fig. 3, text contents such as comments and variable names are extracted from the code file, a text feature vector is obtained according to a feature value of the text contents, code blocks with occurrence times larger than a preset threshold value are extracted from the code file, and the occurrence times of the code blocks with the occurrence times larger than the preset threshold value are combined into the code feature vector. When code file query is carried out, a query word vector is obtained according to a query word, first similarity between the query word vector and a text feature vector is calculated, the query word vector and the code feature vector are converted, second similarity between the converted query word vector and the converted code feature vector is calculated, and the first similarity and the second similarity are weighted and summed to obtain the similarity between the code file and the query word.
In the process, firstly, the text content and the code blocks with the occurrence times larger than a preset threshold value of the code file are extracted, and then the code file has two expression forms: text feature vectors based on text content and code feature vectors based on structure. When the query word is input, the query word is also converted into a query word vector, and at the moment, the text feature vector can be directly compared with the query word vector, so that the first similarity between the text feature vector and the query word vector is calculated. And in order to compare the code feature vector with the query word vector, firstly converting the code feature vector and the query word vector, mapping the code feature vector and the query word vector to the same semantic space, and calculating a second similarity between the converted code feature vector and the converted query word vector in the semantic space. And calculating the weighted sum of the first similarity and the second similarity to obtain the similarity between the code file and the query word. And sequencing the code files according to the finally obtained similarity of each code file to obtain a code file query result.
The method provided by the embodiment of the invention considers the influence of the structure of the code file on the semantics, fully excavates the content and the structure of the code file by acquiring the text characteristic vector and the code characteristic vector of the code file, improves the accuracy of characteristic extraction, and calculates the similarity between each code file and the query word based on the text characteristic vector and the code characteristic vector in the process of querying the code file so as to obtain the query result of the code file and improve the query precision. Further, iterative optimization is carried out on the first conversion matrix and the second conversion matrix based on a gradient descent algorithm, and direct comparison between the code block and the character segment is realized by analyzing the co-occurrence relation of the code feature vector and the text feature vector in the training sample, so that the query precision is further improved.
Fig. 4 is a schematic structural diagram of a code file querying device according to an embodiment of the present invention, and referring to fig. 4, the device includes: a conversion module 401, a similarity calculation module 402, a result output module 403,
the conversion module 401 is configured to convert, based on a historical query word vector applied in a historical query process and a conversion matrix constructed by a text feature vector and a code feature vector of each code file in a code library, a current query word vector and a code feature vector of each code file to obtain a converted query word vector and a converted code feature vector of each code file, where the code feature vector of each code file is determined according to a code block in the code file;
the similarity calculation module 402 is connected to the conversion module 401, and is configured to calculate a similarity between each code file and a query word according to the current query word vector, the converted code feature vector of each code file, and the text feature vector of each code file;
the result output module 403 is connected to the similarity calculation module 402, and is configured to obtain a code file query result according to the similarity between each code file and the query term.
Optionally, the conversion module 401 includes:
the first conversion unit is used for calculating the product of a transpose matrix of a first conversion matrix and the query word vector to obtain a converted query word vector, and the conversion matrix comprises the first conversion matrix and a second conversion matrix;
the code characteristic forming unit is used for forming a code characteristic matrix from the code characteristic vectors of each code file;
the second conversion unit is used for calculating the product of the transposed matrix of the code characteristic matrix and the second conversion matrix to obtain a converted code characteristic matrix;
and the code characteristic vector extraction unit is used for extracting each vector in the transpose matrix of the converted code characteristic matrix as a converted code characteristic vector corresponding to the code file.
Optionally, the similarity calculation module 402 includes:
the first calculation unit is used for calculating a first similarity between the text feature vector of each code file and the query word vector;
the second calculation unit is used for calculating a second similarity between the converted code feature vector of each code file and the converted query word vector;
and the weighted summation unit is used for carrying out weighted summation on the first similarity and the second similarity of each code file to obtain the similarity between each code file and the query word.
Optionally, the apparatus further comprises:
the text characteristic vector acquisition module is used for acquiring a text characteristic vector of each code file according to the natural language description, the annotation and the variable name of the code file;
the occurrence frequency judging module is used for judging whether the code file comprises a code block of which the occurrence frequency is greater than a preset threshold value;
the code block extraction module is used for extracting the code blocks of which the occurrence times are greater than the preset threshold when the code files comprise the code blocks of which the occurrence times are greater than the preset threshold;
and the code characteristic vector composition module is used for composing the extracted occurrence times of different code blocks into the code characteristic vector of the code file.
Optionally, the apparatus further comprises:
the first matrix composition module is used for obtaining each historical query word vector applied in the historical query process to form a sample query matrix;
the second matrix composition module is used for composing the text characteristic vector of each code file in the code base into a sample text characteristic matrix;
the third matrix composition module is used for composing the code characteristic vector of each code file in the code block into a sample code characteristic matrix;
the target function determining module is used for determining a target function with a first conversion matrix and a second conversion matrix as independent variables according to the sample query matrix, the sample text characteristic matrix and the sample code characteristic matrix;
and the objective function solving module is used for solving the minimum value of the objective function and obtaining a corresponding solution when the objective function is the minimum value.
Optionally, the objective function is:
wherein,andfor regularization parameters, U is the first transformation matrix, V is the second transformation matrix, e1(U, V) is used for representing the distance between the text feature vector and the code feature vector of the same code file or similar code files after conversion, and is epsilon2(U, V) for representing a difference between a distance between text feature vectors and code feature vectors of similar code files and a distance between text feature vectors and code feature vectors of dissimilar code files, g (U, V) for representing a sum of a distance between text feature vectors, a distance between code feature vectors, a distance between text feature vectors and code feature vectors, a distance between the history query word vector and text feature vector of each code file, a distance between the history query word vector and code feature vector of each code file, and a distance between similar history query word vectors, c (U, V) is used for representing the distance between any converted character segment and the code feature vector of the code file containing the character segment, and r (U, V) is used for controlling the complexity of the first conversion matrix U and the second conversion matrix V.
Alternatively,
wherein m is the number of code files, i =1, 2 … m, L =1, 2 … m, j → i, n is the number of history query words applied in the history query process, X is the sample text feature matrix, Y is the sample code feature matrix, Q is the sample query matrix, W and R are similarity matrices between the sample text feature matrix and the sample code feature matrix, and L is a normalized laplacian matrix of a matrix W.
The device provided by the embodiment of the invention considers the influence of the structure of the code file on the semantics, fully excavates the content and the structure of the code file by acquiring the text characteristic vector and the code characteristic vector of the code file, improves the accuracy of characteristic extraction, and calculates the similarity between each code file and the query word based on the text characteristic vector and the code characteristic vector in the process of querying the code file so as to obtain the query result of the code file and improve the query precision.
It should be noted that: the code file querying device provided in the above embodiment is only illustrated by dividing the functional modules when querying the code file, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the code file query device and the code file query method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by similar hardware instructed by a program, which may be stored in a computer-readable storage medium, such as a read-only memory, a magnetic or optical disk, and so on.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (12)

1. A code file query method, the method comprising:
calculating the product of a transpose matrix of a first conversion matrix included in a conversion matrix and a current query word vector to obtain a converted query word vector, wherein the conversion matrix includes the first conversion matrix and a second conversion matrix; the conversion matrix is used for converting the query word vector and the code feature vector, the first conversion matrix is used for converting the query word vector, and the second conversion matrix is used for converting the code feature vector of each code file, so that the query word vector and the code feature vector are mapped to the same semantic space;
forming a code feature matrix by using the code feature vector of each code file, wherein the code feature vector of each code file is determined according to a code block in the code file;
calculating the product of the transpose matrix of the code characteristic matrix and the second conversion matrix to obtain a converted code characteristic matrix;
extracting each vector in the transpose matrix of the converted code characteristic matrix as a converted code characteristic vector of a corresponding code file;
calculating the similarity between each code file and the query word according to the current query word vector, the converted code feature vector of each code file and the text feature vector of each code file;
and obtaining a code file query result according to the similarity between each code file and the query word.
2. The method of claim 1, wherein calculating the similarity between each code file and the query word according to the current query word vector, the converted code feature vector of each code file, and the text feature vector of each code file comprises:
calculating a first similarity between the text feature vector of each code file and the query word vector;
calculating a second similarity between the converted code feature vector of each code file and the converted query word vector;
and carrying out weighted summation on the first similarity and the second similarity of each code file to obtain the similarity between each code file and the query word.
3. The method of claim 1, wherein before computing the product of the transpose of the first transformation matrix comprised by the transformation matrix and the current query word vector to obtain the transformed query word vector, the method further comprises:
for each code file, acquiring a text feature vector of the code file according to the natural language description, the annotation and the variable name of the code file;
judging whether the code file comprises a code block of which the occurrence frequency is greater than a preset threshold value;
when the code file comprises code blocks with the occurrence times larger than the preset threshold, extracting the code blocks with the occurrence times larger than the preset threshold;
and combining the extracted occurrence times of different code blocks into a code feature vector of the code file.
4. The method of claim 1, wherein before computing the product of the transpose of the first transformation matrix comprised by the transformation matrix and the current query word vector to obtain the transformed query word vector, the method further comprises:
obtaining each historical query word vector applied in the historical query process to form a sample query matrix;
forming a sample text feature matrix by using the text feature vectors of each code file in the code base;
forming a sample code characteristic matrix by using the code characteristic vector of each code file in the code block;
determining a target function with a first conversion matrix and a second conversion matrix as independent variables according to the sample query matrix, the sample text feature matrix and the sample code feature matrix;
and solving the minimum value of the objective function to obtain a corresponding solution when the objective function is the minimum value.
5. The method of claim 4, wherein the objective function is:
wherein,andfor regularization parameters, U is the first transformation matrix, V is the second transformation matrix, e1(U, V) is used for representing the distance between the text feature vector and the code feature vector of the same code file or similar code files after conversion, and is epsilon2(U, V) for representing a difference between a distance between text feature vectors and code feature vectors of converted similar code files and a distance between text feature vectors and code feature vectors of dissimilar code files, g (U, V) for representing a sum of a distance between text feature vectors, a distance between code feature vectors, a distance between text feature vectors and code feature vectors, a distance between history query word vectors and text feature vectors of each code file, a distance between history query word vectors and code feature vectors of each code file, and a distance between similar history query word vectors, c (U, V) is used for representing the distance between any converted character segment and the code feature vector of the code file containing the character segment, and r (U, V) is used for controlling the complexity of the first conversion matrix U and the second conversion matrix V.
6. The method of claim 5,
wherein m is the number of code files, the ith code file and the jth code file are similar code files, the ith code file and the ith code file are dissimilar code files, n is the number of history query words applied in a history query process, X is the sample text feature matrix, Y is the sample code feature matrix, Q is the sample query matrix, W and R are similarity matrixes between the sample text feature matrix and the sample code feature matrix, and L is a normalized Laplace matrix of a matrix W.
7. An apparatus for querying a code file, the apparatus comprising:
the conversion module is used for calculating the product of a transpose matrix of a first conversion matrix included in a conversion matrix and a current query word vector to obtain a converted query word vector, wherein the conversion matrix includes the first conversion matrix and a second conversion matrix; the conversion matrix is used for converting the query word vector and the code feature vector, the first conversion matrix is used for converting the query word vector, and the second conversion matrix is used for converting the code feature vector of each code file, so that the query word vector and the code feature vector are mapped to the same semantic space; forming a code feature matrix by using the code feature vector of each code file, wherein the code feature vector of each code file is determined according to a code block in the code file; calculating the product of the transpose matrix of the code characteristic matrix and the second conversion matrix to obtain a converted code characteristic matrix; extracting each vector in the transpose matrix of the converted code characteristic matrix as a converted code characteristic vector of a corresponding code file;
the similarity calculation module is used for calculating the similarity between each code file and the query word according to the current query word vector, the converted code feature vector of each code file and the text feature vector of each code file;
and the result output module is used for obtaining the code file query result according to the similarity between each code file and the query word.
8. The apparatus of claim 7, wherein the similarity calculation module comprises:
the first calculation unit is used for calculating a first similarity between the text feature vector of each code file and the query word vector;
the second calculation unit is used for calculating a second similarity between the converted code feature vector of each code file and the converted query word vector;
and the weighted summation unit is used for carrying out weighted summation on the first similarity and the second similarity of each code file to obtain the similarity between each code file and the query word.
9. The apparatus of claim 7, further comprising:
the text characteristic vector acquisition module is used for acquiring a text characteristic vector of each code file according to the natural language description, the annotation and the variable name of the code file;
the occurrence frequency judging module is used for judging whether the code file comprises a code block of which the occurrence frequency is greater than a preset threshold value;
the code block extraction module is used for extracting the code blocks of which the occurrence times are greater than the preset threshold when the code blocks of which the occurrence times are greater than the preset threshold are included in the code file;
and the code characteristic vector composition module is used for composing the extracted occurrence times of different code blocks into the code characteristic vector of the code file.
10. The apparatus of claim 7, further comprising:
the first matrix composition module is used for obtaining each historical query word vector applied in the historical query process to form a sample query matrix;
the second matrix composition module is used for composing the text characteristic vector of each code file in the code base into a sample text characteristic matrix;
the third matrix composition module is used for composing the code characteristic vector of each code file in the code block into a sample code characteristic matrix;
the target function determining module is used for determining a target function with a first conversion matrix and a second conversion matrix as independent variables according to the sample query matrix, the sample text characteristic matrix and the sample code characteristic matrix;
and the objective function solving module is used for solving the minimum value of the objective function and obtaining a corresponding solution when the objective function is the minimum value.
11. The apparatus of claim 10, wherein the objective function is:
wherein,andfor regularization parameters, U is the first revolutionA transformation matrix, V is the second transformation matrix, epsilon1(U, V) is used for representing the distance between the text feature vector and the code feature vector of the same code file or similar code files after conversion, and is epsilon2(U, V) for representing a difference between a distance between text feature vectors and code feature vectors of similar code files and a distance between text feature vectors and code feature vectors of dissimilar code files, g (U, V) for representing a sum of a distance between text feature vectors, a distance between code feature vectors, a distance between text feature vectors and code feature vectors, a distance between the history query word vector and text feature vector of each code file, a distance between the history query word vector and code feature vector of each code file, and a distance between similar history query word vectors, c (U, V) is used for representing the distance between any converted character segment and the code feature vector of the code file containing the character segment, and r (U, V) is used for controlling the complexity of the first conversion matrix U and the second conversion matrix V.
12. The apparatus of claim 11,
wherein m is the number of code files, the ith code file and the jth code file are similar code files, the ith code file and the ith code file are dissimilar code files, n is the number of history query words applied in a history query process, X is the sample text feature matrix, Y is the sample code feature matrix, Q is the sample query matrix, W and R are similarity matrixes between the sample text feature matrix and the sample code feature matrix, and L is a normalized Laplace matrix of a matrix W.
CN201410042833.4A 2014-01-29 2014-01-29 Code file querying method and device Active CN104809139B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410042833.4A CN104809139B (en) 2014-01-29 2014-01-29 Code file querying method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410042833.4A CN104809139B (en) 2014-01-29 2014-01-29 Code file querying method and device

Publications (2)

Publication Number Publication Date
CN104809139A CN104809139A (en) 2015-07-29
CN104809139B true CN104809139B (en) 2019-03-19

Family

ID=53693964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410042833.4A Active CN104809139B (en) 2014-01-29 2014-01-29 Code file querying method and device

Country Status (1)

Country Link
CN (1) CN104809139B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045781B (en) * 2015-08-27 2020-06-23 广州神马移动信息科技有限公司 Query term similarity calculation method and device and query term search method and device
CN108320026B (en) * 2017-05-16 2022-02-11 腾讯科技(深圳)有限公司 Machine learning model training method and device
CN107463683B (en) * 2017-08-09 2018-07-24 深圳壹账通智能科技有限公司 The naming method and terminal device of code element
CN111190642A (en) * 2019-11-25 2020-05-22 深圳壹账通智能科技有限公司 Code prompting method, apparatus, computer equipment and storage medium
CN112102124A (en) * 2020-08-31 2020-12-18 湖北美和易思教育科技有限公司 Big data basic algorithm learning automatic evaluation method and system
CN113918734A (en) * 2021-10-22 2022-01-11 平安科技(深圳)有限公司 A data retrieval method, device, electronic device and storage medium
CN116401336B (en) * 2023-03-31 2024-03-29 华院计算技术(上海)股份有限公司 Cognitive intelligent query method and device, computer readable storage medium and terminal

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101233512A (en) * 2005-07-29 2008-07-30 微软公司 Intelligent SQL generation for persistent object retrieval
CN101454776A (en) * 2005-10-04 2009-06-10 汤姆森环球资源公司 Systems, methods, and software for identifying relevant legal documents

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2001264363A1 (en) * 2000-07-06 2002-02-13 Si Han Kim Information searching system and method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101233512A (en) * 2005-07-29 2008-07-30 微软公司 Intelligent SQL generation for persistent object retrieval
CN101454776A (en) * 2005-10-04 2009-06-10 汤姆森环球资源公司 Systems, methods, and software for identifying relevant legal documents

Also Published As

Publication number Publication date
CN104809139A (en) 2015-07-29

Similar Documents

Publication Publication Date Title
CN104809139B (en) Code file querying method and device
CN104615767B (en) Training method, search processing method and the device of searching order model
EP3499384A1 (en) Word and sentence embeddings for sentence classification
US20180158078A1 (en) Computer device and method for predicting market demand of commodities
CN107346327A (en) The zero sample Hash picture retrieval method based on supervision transfer
CN109145083B (en) Candidate answer selecting method based on deep learning
CN116932730B (en) Document question-answering method and related equipment based on multi-way tree and large-scale language model
WO2020179378A1 (en) Information processing system, information processing method, and recording medium
CN109858031B (en) Neural network model training and context prediction method and device
CN113515519A (en) Training method, device, device and storage medium for graph structure estimation model
CN104469374B (en) Method for compressing image
CN114021573A (en) A natural language processing method, apparatus, device and readable storage medium
CN111611395A (en) Entity relationship identification method and device
CN115129883B (en) Entity linking method and device, storage medium and electronic equipment
CN111651660A (en) Method for cross-media retrieval of difficult samples
CN109840308A (en) A kind of region wind power probability forecast method and system
JP6586026B2 (en) Word vector learning device, natural language processing device, method, and program
CN104268217B (en) A kind of determination method and device of user behavior temporal correlation
JP2018041300A (en) Machine learning model generation device and program
CN113469111A (en) Image key point detection method and system, electronic device and storage medium
CN113761151A (en) Synonym mining, question answering method, apparatus, computer equipment and storage medium
JP5533272B2 (en) Data output device, data output method, and data output program
CN117540004A (en) Industrial domain intelligent question-answering method and system based on knowledge graph and user behavior
CN113139382A (en) Named entity identification method and device
CN113779415B (en) Training method, training device, training equipment and training storage medium for news recommendation model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant