CN111159343A

CN111159343A - Text similarity searching method, device, equipment and medium based on text embedding

Info

Publication number: CN111159343A
Application number: CN201911370136.0A
Authority: CN
Inventors: 朱悦; 张嘉锐; 周喆; 刘晋元; 赵燕; 吴洁; 孙虎; 李敏; 袁晓夏; 崔丽春
Original assignee: Shanghai Science And Technology Development Co Ltd
Current assignee: Shanghai Science And Technology Development Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-15

Abstract

According to the text similarity searching method, device, equipment and medium based on text embedding, a query character string is obtained; converting the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document. The text embedding technology in the natural language processing field is applied to a distributed search engine, and the method takes semantics into consideration to accurately capture the real intention behind a sentence input by a user and searches according to the real intention, so that a search result which best meets the requirement of the user is more accurately returned to the user, beyond the traditional word-level similarity matching method; the query is accelerated by utilizing the distributed characteristic of the Elasticissearch and the support of high-dimensional vectors.

Description

Text similarity searching method, device, equipment and medium based on text embedding

Technical Field

The present application relates to the field of text embedding search technologies, and in particular, to a text similarity search method, apparatus, device, and medium based on text embedding.

Background

In the traditional information retrieval, most text search algorithms are based on word segmentation, the weight of words in a text is calculated through TF-IDF, after the text is converted into a vector, the text similarity is calculated through a vector space model algorithm based on cosine similarity, and the calculation method only considers words and does not consider the inherent meaning of the text.

When the search word or sentence is different from the target word or sentence in the aspect of characters, such as some specific words of Chinese and English combined with English, there will not be any similarity between sentences if using the traditional algorithm, but actually the two semantics are completely similar.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present application to provide a text similarity search method, apparatus, device and medium based on text embedding to solve the problems in the prior art.

To achieve the above and other related objects, the present application provides a text similarity search method based on text embedding, the method comprising: acquiring a query character string; converting the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document.

In an embodiment of the present application, the method for constructing the offline model includes: acquiring text data for training; and training the Doc2Vec model as an offline model, performing text embedding processing on the text data by using a natural language processing technology so as to convert the segmented words or fields into word vectors, and specifying the dimensionality of the vectors during training.

In an embodiment of the application, the performing text embedding processing on the text data by using a natural language processing technology includes: processing the text data according to a data format of one line and one paragraph; reading text data according to lines, and filtering Chinese participles and stop words for each line according to a Chinese stop word list; and storing the word segmentation result.

In an embodiment of the present application, the training based on the Doc2Vec model as an offline model includes: carrying out unsupervised learning on the text data by using a Gensim toolkit of Python to obtain a word vector of a main body of a text hidden layer; and after the Doc2Vec model is trained, offline loading is carried out to form an offline model so as to provide conversion service of the input character string output word vector for the outside.

In an embodiment of the present application, the preset method of the search engine includes: and presetting a mapping structure of the search engine, setting a data type as a vector for a field needing text embedding, and keeping the data type consistent with the vector dimension of the offline model.

In an embodiment of the present application, the preset method of the search engine further includes: calculating a word vector of a field to be searched through the offline mode; the text data and the word vectors are indexed to a search engine.

In an embodiment of the present application, the parsing the query result and outputting the relevant document includes: calculating a vector angle according to the query vector and the word vector trained by the offline model; obtaining the matched correlation according to the vector angle; and sorting the analyzed query results according to the relevance to output the searched related documents.

To achieve the above and other related objects, the present application provides a text similarity search apparatus based on text embedding, the apparatus comprising: the acquisition module is used for acquiring the query character string; the processing module is used for converting the query character string into a query vector through an offline model trained on the basis of text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting related documents, and sorting and outputting the searched related documents according to the relevance.

To achieve the above and other related objects, the present application provides a computer system, the apparatus comprising: a memory, and a processor; the memory is to store computer instructions; the processor executes computer instructions to implement the method as described above.

To achieve the above and other related objects, the present application provides a computer storage medium storing a computer program which, when executed, performs the method as described above.

In summary, the present application provides a text similarity search method, apparatus, device and medium based on text embedding, by obtaining a query string; converting the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document.

Has the following beneficial effects:

the text embedding technology in the natural language processing field is applied to a distributed search engine, and the semantic meaning is taken into consideration to accurately capture the real intention behind a sentence input by a user and search is carried out according to the real intention, so that the search result which best meets the requirement of the user is returned to the user more accurately, compared with the traditional word-level similarity matching method; the query is accelerated by utilizing the distributed characteristic of the Elasticissearch and the support of high-dimensional vectors.

Drawings

Fig. 1 is a flowchart illustrating an information processing method for voice conversion according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating an off-line model building method according to an embodiment of the present disclosure.

Fig. 3 is a schematic structural diagram of a text similarity search apparatus based on text embedding according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings so that those skilled in the art to which the present application pertains can easily carry out the present application. The present application may be embodied in many different forms and is not limited to the embodiments described herein.

In order to clearly explain the present application, components that are not related to the description are omitted, and the same reference numerals are given to the same or similar components throughout the specification.

Throughout the specification, when a component is referred to as being "connected" to another component, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another element interposed therebetween. In addition, when a component is referred to as "including" a certain constituent element, unless otherwise stated, it means that the component may include other constituent elements, without excluding other constituent elements.

When an element is referred to as being "on" another element, it can be directly on the other element, or intervening elements may also be present. When a component is referred to as being "directly on" another component, there are no intervening components present.

Although the terms first, second, etc. may be used herein to describe various elements in some instances, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, the first interface and the second interface, etc. are described. Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a", "an" and "the" include plural forms as long as the words do not expressly indicate a contrary meaning. The term "comprises/comprising" when used in this specification is taken to specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but does not exclude the presence or addition of other features, regions, integers, steps, operations, elements, and/or components.

Terms indicating "lower", "upper", and the like relative to space may be used to more easily describe a relationship of one component with respect to another component illustrated in the drawings. Such terms are intended to include not only the meanings indicated in the drawings, but also other meanings or operations of the device in use. For example, if the device in the figures is turned over, elements described as "below" other elements would then be oriented "above" the other elements. Thus, the exemplary terms "under" and "beneath" all include above and below. The device may be rotated 90 or other angles and the terminology representing relative space is also to be interpreted accordingly.

In view of the above, in order to solve the problems in the prior art, the present application provides a text similarity search method, apparatus, device and medium based on text embedding, and a text similarity calculation method based on text embedding is combined with a distributed search engine elastic search to implement a similarity search algorithm.

Fig. 1 is a flowchart illustrating a text similarity search method based on text embedding according to an embodiment of the present application. As shown, the method comprises:

step S101: and acquiring a query character string.

In this embodiment, the method is suitable for a search scenario of big data. Here, the query string is the input query word or phrase. For example, the search may correspond to research data such as paper data, book data, patent data, and the like.

In this embodiment, the Query string may be sent to the server side by the user through the terminal or the browser URL, and is also generally referred to as Query, and represents the Query intention of the user.

Step S102: the query string is converted to a query vector by an offline model trained based on text data.

As shown in fig. 2, the method is a schematic flow chart of an offline model building method in an embodiment of the present application, and as shown in the figure, the method includes:

step S201: text data for training is acquired.

In this embodiment, text data for training a model is prepared, and scientific data is taken as an example, and may be paper data, book data, patent data, and the like.

Step S202: and training the Doc2Vec model as an offline model, performing text embedding processing on the text data by using a natural language processing technology so as to convert the segmented words or fields into word vectors, and specifying the dimensionality of the vectors during training.

In this embodiment, the method mainly utilizes natural language processing technology training for text embedding, uses Doc2Vec model to train prepared text data into an offline model, and specifies the dimension of a generated vector when training the model.

Wherein the text embedding processing of the text data by using the natural language processing technology includes:

A. processing the text data according to a data format of one line and one paragraph;

B. reading text data according to lines, and filtering Chinese participles and stop words for each line according to a Chinese stop word list;

C. and storing the word segmentation result.

In brief, the data format is line by line and paragraph, reading text data line by line, filtering Chinese word segmentation and stop words for each line, and storing the word segmentation result into a new text file.

Generally, word embedding is a representation form of learning a text algorithm, and can be understood as a storage form of a word in the algorithm. Word embedding is the digitizing of text to facilitate fitting algorithms. This way of digitally representing words or documents is considered one of the most challenging problems of deep learning in natural language processing tasks.

Text embedding (text embedding) technology, which encodes words and sentences as numerical vectors that represent linguistic content designed to capture text and that can be used to evaluate similarity between queries and documents, is also proposed by the field of natural language processing. For example, the two sentences of "research Deep Learning" and "Learning Deep Learning" are literally the same without any word, if a traditional algorithm is used, the sentences have no similarity, but actually the semantics of the two sentences are completely close to each other, and the sentences can be well searched by text embedding. One of the limitations of conventional keyword-based search is that if the keyword does not match, no result is returned, and after using the word vector, the answer of the best match can be obtained by directly inputting similar descriptive language.

Doc2vec, also called Paragraph Vector, is proposed by Tomas Mikolov based on a word2vec model, which has some advantages, such as that sentences with different lengths are received as training samples without fixing the lengths of the sentences, Doc2vec is an unsupervised learning algorithm which is used for predicting a Vector to represent different documents, and the structure of the model potentially overcomes the defects of a bag-of-words model.

The Doc2vec model is inspired by the word2vec model, when a word vector is predicted in the word2vec, the predicted word contains word senses, for example, the word vector 'powerfull' mentioned above is closer to 'strong' than 'Paris', and the same structure is also constructed in the Doc2 vec. Doc2vec overcomes the disadvantage of no semantics in the bag of words model. Assuming that there are now training samples, each sentence is a training sample

In addition, the training based on the Doc2Vec model is an offline model, and includes:

A. carrying out unsupervised learning on the text data by using a Gensim toolkit of Python to obtain a text vector of a text hidden layer;

B. the Doc2Vec model can be loaded offline after being trained to form an offline model for providing conversion service of input character string output word vectors.

In brief, the Doc2Vec model uses a Gensim toolkit of Python to perform unsupervised learning on text data to obtain a theme vector of a text hidden layer, and the theme vector can be loaded offline after mode generation, can provide conversion service to the outside, inputs character strings and outputs word vectors.

Step S103: and sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and the word vector trained by the offline model.

In the present application, the search engine is preferably an Elasticsearch.

The Elasticissearch is a distributed, high-expansion and high-real-time search and data analysis engine. It can conveniently make a large amount of data have the capability of searching, analyzing and exploring. The horizontal flexibility of the elastic search is fully utilized, so that the data becomes more valuable in a production environment. The implementation principle of the Elastic Search is mainly divided into the following steps that firstly, a user submits data to an Elastic Search database, then a word controller divides words of corresponding sentences, the weights and word division results are stored in the data, when the user searches data, the results are ranked and scored according to the weights, and then returned results are presented to the user.

In this embodiment, after the off-line model training is completed, the mapping design of the search engine is performed. The mapping is equivalent to a table structure in a relational database, and the data needs to be designed in advance before being written.

For example, for a field in which text embedding is required, that is, for an original field and a corresponding word vector field, the parameter type is set to be a vector type dense _ vector.

The following procedure, for example, sets the type as dense _ vector and sets the vector dimension to 300 dimensions:

in an embodiment of the present application, the preset method of the search engine further includes:

A. calculating a word vector of a field to be searched through the offline mode;

B. the text data and the word vectors are indexed to a search engine.

In this embodiment, after the mapping design is completed, the word vector of the field to be retrieved is calculated through the offline mode, and then the original text and the word vector are indexed to the elastic search. Therefore, word vectors which are calculated in an off-line mode can be quickly found out according to search words provided by a user in the following process, and word vectors of words or sentences with close relevance are found out in the original text.

In the present application, the purpose of calculating the word vector of the field to be searched in advance in the offline mode is to: setting the field to be searched in a range field, for example, a medical field or a physical field, and then calculating a word vector of the corresponding field in advance through an offline mode for the field which may be searched in the corresponding field. On one hand, mass text data can be limited to accelerate the calculation or the searching speed; on the other hand, word vectors are calculated in advance for fields which are possibly searched, so that corresponding word vectors can be quickly found after the searched character strings are obtained, and the calculation and search speed is further increased.

Step S104: and analyzing the query result and outputting a related document.

In an embodiment of the present application, the step S104 specifically includes:

A. calculating a vector angle according to the query vector and the word vector trained by the offline model;

B. obtaining the matched correlation according to the vector angle;

C. and sorting the analyzed query results according to the relevance to output the searched related documents.

In this embodiment, the query string is used as an input of the offline model through the foregoing steps, a word vector corresponding to the query string is output, and the word vector corresponding to the user query is sent to the Elasticsearch for querying. Then, the Elasticissearch calculates the similarity between the query vector and the stored vector field, and returns the most similar N documents according to the relevance ranking.

In summary, the application provides a text similarity calculation method based on text embedding, and a similarity search algorithm is realized by combining a distributed search engine elastic search. The advantages are that: 1) the text embedding technology in the natural language processing field is applied to a distributed search engine, and the semantic meaning is taken into consideration to accurately capture the real intention behind a sentence input by a user and search is carried out according to the real intention, so that the search result which best meets the requirement of the user is returned to the user more accurately, compared with the traditional word-level similarity matching method; 2) the query is accelerated by utilizing the distributed characteristic of the Elasticissearch and the support of high-dimensional vectors.

Fig. 3 is a block diagram of an information processing apparatus for voice conversion according to an embodiment of the present invention. As shown, the apparatus 300 includes:

an obtaining module 301, configured to obtain a query string;

a processing module 302, configured to convert the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document.

It should be noted that, because the contents of information interaction, execution process, and the like between the modules/units of the apparatus are based on the same concept as the method embodiments described in the present application, the technical effect brought by the contents is the same as the method embodiments of the present application, and specific contents can be referred to the descriptions in the method embodiments described in the foregoing description of the present application.

It should be further noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these units can be implemented entirely in software, invoked by a processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware.

For example, the processing module 302 may be a separate processing element, or may be integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the processing module 302. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown, the computer device 400 includes: a memory 401, and a processor 402; the memory 401 is used for storing computer instructions; the processor 402 executes computer instructions to implement the method described in fig. 1.

In some embodiments, the number of the memories 401 in the computer device 400 may be one or more, the number of the processors 402 may be one or more, and fig. 4 is taken as an example.

In an embodiment of the present application, the processor 402 in the computer device 400 loads one or more instructions corresponding to processes of an application program into the memory 401 according to the steps described in fig. 1, and the processor 402 executes the application program stored in the memory 402, thereby implementing the method described in fig. 1.

The Memory 401 may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 401 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for implementing various operations. The operating system may include various system programs for implementing various basic services and for handling hardware-based tasks.

The Processor 402 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic device, a discrete hardware component, etc.

In some specific applications, the various components of the computer device 400 are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. But for clarity of explanation the various busses are shown in fig. 4 as a bus system.

In an embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the text similarity search method based on text embedding as described in fig. 1.

The computer readable storage medium is preferably a non-volatile computer storage medium.

Those of ordinary skill in the art will understand that: the embodiment for realizing the functions of the system and each unit can be realized by hardware related to computer programs. The aforementioned computer program may be stored in a computer readable storage medium. When the program is executed, the embodiment including the functions of the system and the units is executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

It should be noted that, in the implementation of the system, the computer device, and the like in the above embodiments, all the related computer programs may be loaded on a computer readable storage medium, and the computer readable storage medium may be a tangible device that can hold and store the instructions used by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

In summary, the text similarity search method, device, equipment and medium based on text embedding provided by the application obtains the query character string; converting the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document.

The application effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical concepts disclosed in the present application shall be covered by the claims of the present application.

Claims

1. A text similarity search method based on text embedding is characterized by comprising the following steps:

acquiring a query character string;

converting the query string into a query vector through an offline model trained based on text data;

sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model;

and analyzing the query result and outputting a related document.

2. The method of claim 1, wherein the offline model is constructed by:

acquiring text data for training;

and training the Doc2Vec model as an offline model, performing text embedding processing on the text data by using a natural language processing technology so as to convert the segmented words or fields into word vectors, and specifying the dimensionality of the vectors during training.

3. The method according to claim 2, wherein the text embedding the text data by using natural language processing technology comprises:

processing the text data according to a data format of one line and one paragraph;

reading text data according to lines, and filtering Chinese participles and stop words for each line according to a Chinese stop word list;

and storing the word segmentation result.

4. The method of claim 2, wherein the training based on the Doc2Vec model is an offline model comprising:

carrying out unsupervised learning on the text data by using a Gensim toolkit of Python to obtain a word vector of a main body of a text hidden layer;

and after the Doc2Vec model is trained, offline loading is carried out to form an offline model so as to provide conversion service of the input character string output word vector for the outside.

5. The method of claim 2, wherein the preset method of the search engine comprises:

and presetting a mapping structure of the search engine, setting a data type as a vector for a field needing text embedding, and keeping the data type consistent with the vector dimension of the offline model.

6. The method of claim 2, wherein the preset method of the search engine further comprises:

calculating a word vector of a field to be searched through the offline mode;

the text data and the word vectors are indexed to a search engine.

7. The method of claim 1, wherein parsing the query results and outputting relevant documents comprises:

calculating a vector angle according to the query vector and the word vector trained by the offline model;

obtaining the matched correlation according to the vector angle;

and sorting the analyzed query results according to the relevance to output the searched related documents.

8. An apparatus for searching text similarity based on text embedding, the apparatus comprising:

the acquisition module is used for acquiring the query character string;

the processing module is used for converting the query character string into a query vector through an offline model trained on the basis of text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting related documents, and sorting and outputting the searched related documents according to the relevance.

9. A computer device, the device comprising: a memory, and a processor; the memory is to store computer instructions; the processor executes computer instructions to implement the method of any one of claims 1 to 7.

10. A computer storage medium, characterized in that a computer program is stored which, when executed, performs the method of any one of claims 1 to 7.