[go: up one dir, main page]

CN111159343A - Text similarity searching method, device, equipment and medium based on text embedding - Google Patents

Text similarity searching method, device, equipment and medium based on text embedding Download PDF

Info

Publication number
CN111159343A
CN111159343A CN201911370136.0A CN201911370136A CN111159343A CN 111159343 A CN111159343 A CN 111159343A CN 201911370136 A CN201911370136 A CN 201911370136A CN 111159343 A CN111159343 A CN 111159343A
Authority
CN
China
Prior art keywords
query
vector
text
word
offline model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911370136.0A
Other languages
Chinese (zh)
Inventor
朱悦
张嘉锐
周喆
刘晋元
赵燕
吴洁
孙虎
李敏
袁晓夏
崔丽春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Science And Technology Development Co Ltd
Original Assignee
Shanghai Science And Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Science And Technology Development Co Ltd filed Critical Shanghai Science And Technology Development Co Ltd
Priority to CN201911370136.0A priority Critical patent/CN111159343A/en
Publication of CN111159343A publication Critical patent/CN111159343A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

According to the text similarity searching method, device, equipment and medium based on text embedding, a query character string is obtained; converting the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document. The text embedding technology in the natural language processing field is applied to a distributed search engine, and the method takes semantics into consideration to accurately capture the real intention behind a sentence input by a user and searches according to the real intention, so that a search result which best meets the requirement of the user is more accurately returned to the user, beyond the traditional word-level similarity matching method; the query is accelerated by utilizing the distributed characteristic of the Elasticissearch and the support of high-dimensional vectors.

Description

Text similarity searching method, device, equipment and medium based on text embedding
Technical Field
The present application relates to the field of text embedding search technologies, and in particular, to a text similarity search method, apparatus, device, and medium based on text embedding.
Background
In the traditional information retrieval, most text search algorithms are based on word segmentation, the weight of words in a text is calculated through TF-IDF, after the text is converted into a vector, the text similarity is calculated through a vector space model algorithm based on cosine similarity, and the calculation method only considers words and does not consider the inherent meaning of the text.
When the search word or sentence is different from the target word or sentence in the aspect of characters, such as some specific words of Chinese and English combined with English, there will not be any similarity between sentences if using the traditional algorithm, but actually the two semantics are completely similar.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, it is an object of the present application to provide a text similarity search method, apparatus, device and medium based on text embedding to solve the problems in the prior art.
To achieve the above and other related objects, the present application provides a text similarity search method based on text embedding, the method comprising: acquiring a query character string; converting the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document.
In an embodiment of the present application, the method for constructing the offline model includes: acquiring text data for training; and training the Doc2Vec model as an offline model, performing text embedding processing on the text data by using a natural language processing technology so as to convert the segmented words or fields into word vectors, and specifying the dimensionality of the vectors during training.
In an embodiment of the application, the performing text embedding processing on the text data by using a natural language processing technology includes: processing the text data according to a data format of one line and one paragraph; reading text data according to lines, and filtering Chinese participles and stop words for each line according to a Chinese stop word list; and storing the word segmentation result.
In an embodiment of the present application, the training based on the Doc2Vec model as an offline model includes: carrying out unsupervised learning on the text data by using a Gensim toolkit of Python to obtain a word vector of a main body of a text hidden layer; and after the Doc2Vec model is trained, offline loading is carried out to form an offline model so as to provide conversion service of the input character string output word vector for the outside.
In an embodiment of the present application, the preset method of the search engine includes: and presetting a mapping structure of the search engine, setting a data type as a vector for a field needing text embedding, and keeping the data type consistent with the vector dimension of the offline model.
In an embodiment of the present application, the preset method of the search engine further includes: calculating a word vector of a field to be searched through the offline mode; the text data and the word vectors are indexed to a search engine.
In an embodiment of the present application, the parsing the query result and outputting the relevant document includes: calculating a vector angle according to the query vector and the word vector trained by the offline model; obtaining the matched correlation according to the vector angle; and sorting the analyzed query results according to the relevance to output the searched related documents.
To achieve the above and other related objects, the present application provides a text similarity search apparatus based on text embedding, the apparatus comprising: the acquisition module is used for acquiring the query character string; the processing module is used for converting the query character string into a query vector through an offline model trained on the basis of text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting related documents, and sorting and outputting the searched related documents according to the relevance.
To achieve the above and other related objects, the present application provides a computer system, the apparatus comprising: a memory, and a processor; the memory is to store computer instructions; the processor executes computer instructions to implement the method as described above.
To achieve the above and other related objects, the present application provides a computer storage medium storing a computer program which, when executed, performs the method as described above.
In summary, the present application provides a text similarity search method, apparatus, device and medium based on text embedding, by obtaining a query string; converting the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document.
Has the following beneficial effects:
the text embedding technology in the natural language processing field is applied to a distributed search engine, and the semantic meaning is taken into consideration to accurately capture the real intention behind a sentence input by a user and search is carried out according to the real intention, so that the search result which best meets the requirement of the user is returned to the user more accurately, compared with the traditional word-level similarity matching method; the query is accelerated by utilizing the distributed characteristic of the Elasticissearch and the support of high-dimensional vectors.
Drawings
Fig. 1 is a flowchart illustrating an information processing method for voice conversion according to an embodiment of the present application.
Fig. 2 is a flowchart illustrating an off-line model building method according to an embodiment of the present disclosure.
Fig. 3 is a schematic structural diagram of a text similarity search apparatus based on text embedding according to an embodiment of the present application.
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings so that those skilled in the art to which the present application pertains can easily carry out the present application. The present application may be embodied in many different forms and is not limited to the embodiments described herein.
In order to clearly explain the present application, components that are not related to the description are omitted, and the same reference numerals are given to the same or similar components throughout the specification.
Throughout the specification, when a component is referred to as being "connected" to another component, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another element interposed therebetween. In addition, when a component is referred to as "including" a certain constituent element, unless otherwise stated, it means that the component may include other constituent elements, without excluding other constituent elements.
When an element is referred to as being "on" another element, it can be directly on the other element, or intervening elements may also be present. When a component is referred to as being "directly on" another component, there are no intervening components present.
Although the terms first, second, etc. may be used herein to describe various elements in some instances, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, the first interface and the second interface, etc. are described. Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a", "an" and "the" include plural forms as long as the words do not expressly indicate a contrary meaning. The term "comprises/comprising" when used in this specification is taken to specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but does not exclude the presence or addition of other features, regions, integers, steps, operations, elements, and/or components.
Terms indicating "lower", "upper", and the like relative to space may be used to more easily describe a relationship of one component with respect to another component illustrated in the drawings. Such terms are intended to include not only the meanings indicated in the drawings, but also other meanings or operations of the device in use. For example, if the device in the figures is turned over, elements described as "below" other elements would then be oriented "above" the other elements. Thus, the exemplary terms "under" and "beneath" all include above and below. The device may be rotated 90 or other angles and the terminology representing relative space is also to be interpreted accordingly.
In view of the above, in order to solve the problems in the prior art, the present application provides a text similarity search method, apparatus, device and medium based on text embedding, and a text similarity calculation method based on text embedding is combined with a distributed search engine elastic search to implement a similarity search algorithm.
Fig. 1 is a flowchart illustrating a text similarity search method based on text embedding according to an embodiment of the present application. As shown, the method comprises:
step S101: and acquiring a query character string.
In this embodiment, the method is suitable for a search scenario of big data. Here, the query string is the input query word or phrase. For example, the search may correspond to research data such as paper data, book data, patent data, and the like.
In this embodiment, the Query string may be sent to the server side by the user through the terminal or the browser URL, and is also generally referred to as Query, and represents the Query intention of the user.
Step S102: the query string is converted to a query vector by an offline model trained based on text data.
As shown in fig. 2, the method is a schematic flow chart of an offline model building method in an embodiment of the present application, and as shown in the figure, the method includes:
step S201: text data for training is acquired.
In this embodiment, text data for training a model is prepared, and scientific data is taken as an example, and may be paper data, book data, patent data, and the like.
Step S202: and training the Doc2Vec model as an offline model, performing text embedding processing on the text data by using a natural language processing technology so as to convert the segmented words or fields into word vectors, and specifying the dimensionality of the vectors during training.
In this embodiment, the method mainly utilizes natural language processing technology training for text embedding, uses Doc2Vec model to train prepared text data into an offline model, and specifies the dimension of a generated vector when training the model.
Wherein the text embedding processing of the text data by using the natural language processing technology includes:
A. processing the text data according to a data format of one line and one paragraph;
B. reading text data according to lines, and filtering Chinese participles and stop words for each line according to a Chinese stop word list;
C. and storing the word segmentation result.
In brief, the data format is line by line and paragraph, reading text data line by line, filtering Chinese word segmentation and stop words for each line, and storing the word segmentation result into a new text file.
Generally, word embedding is a representation form of learning a text algorithm, and can be understood as a storage form of a word in the algorithm. Word embedding is the digitizing of text to facilitate fitting algorithms. This way of digitally representing words or documents is considered one of the most challenging problems of deep learning in natural language processing tasks.
Text embedding (text embedding) technology, which encodes words and sentences as numerical vectors that represent linguistic content designed to capture text and that can be used to evaluate similarity between queries and documents, is also proposed by the field of natural language processing. For example, the two sentences of "research Deep Learning" and "Learning Deep Learning" are literally the same without any word, if a traditional algorithm is used, the sentences have no similarity, but actually the semantics of the two sentences are completely close to each other, and the sentences can be well searched by text embedding. One of the limitations of conventional keyword-based search is that if the keyword does not match, no result is returned, and after using the word vector, the answer of the best match can be obtained by directly inputting similar descriptive language.
Doc2vec, also called Paragraph Vector, is proposed by Tomas Mikolov based on a word2vec model, which has some advantages, such as that sentences with different lengths are received as training samples without fixing the lengths of the sentences, Doc2vec is an unsupervised learning algorithm which is used for predicting a Vector to represent different documents, and the structure of the model potentially overcomes the defects of a bag-of-words model.
The Doc2vec model is inspired by the word2vec model, when a word vector is predicted in the word2vec, the predicted word contains word senses, for example, the word vector 'powerfull' mentioned above is closer to 'strong' than 'Paris', and the same structure is also constructed in the Doc2 vec. Doc2vec overcomes the disadvantage of no semantics in the bag of words model. Assuming that there are now training samples, each sentence is a training sample
In addition, the training based on the Doc2Vec model is an offline model, and includes:
A. carrying out unsupervised learning on the text data by using a Gensim toolkit of Python to obtain a text vector of a text hidden layer;
B. the Doc2Vec model can be loaded offline after being trained to form an offline model for providing conversion service of input character string output word vectors.
In brief, the Doc2Vec model uses a Gensim toolkit of Python to perform unsupervised learning on text data to obtain a theme vector of a text hidden layer, and the theme vector can be loaded offline after mode generation, can provide conversion service to the outside, inputs character strings and outputs word vectors.
Step S103: and sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and the word vector trained by the offline model.
In the present application, the search engine is preferably an Elasticsearch.
The Elasticissearch is a distributed, high-expansion and high-real-time search and data analysis engine. It can conveniently make a large amount of data have the capability of searching, analyzing and exploring. The horizontal flexibility of the elastic search is fully utilized, so that the data becomes more valuable in a production environment. The implementation principle of the Elastic Search is mainly divided into the following steps that firstly, a user submits data to an Elastic Search database, then a word controller divides words of corresponding sentences, the weights and word division results are stored in the data, when the user searches data, the results are ranked and scored according to the weights, and then returned results are presented to the user.
In an embodiment of the present application, the preset method of the search engine includes: and presetting a mapping structure of the search engine, setting a data type as a vector for a field needing text embedding, and keeping the data type consistent with the vector dimension of the offline model.
In this embodiment, after the off-line model training is completed, the mapping design of the search engine is performed. The mapping is equivalent to a table structure in a relational database, and the data needs to be designed in advance before being written.
For example, for a field in which text embedding is required, that is, for an original field and a corresponding word vector field, the parameter type is set to be a vector type dense _ vector.
The following procedure, for example, sets the type as dense _ vector and sets the vector dimension to 300 dimensions:
Figure BDA0002339458000000061
in an embodiment of the present application, the preset method of the search engine further includes:
A. calculating a word vector of a field to be searched through the offline mode;
B. the text data and the word vectors are indexed to a search engine.
In this embodiment, after the mapping design is completed, the word vector of the field to be retrieved is calculated through the offline mode, and then the original text and the word vector are indexed to the elastic search. Therefore, word vectors which are calculated in an off-line mode can be quickly found out according to search words provided by a user in the following process, and word vectors of words or sentences with close relevance are found out in the original text.
In the present application, the purpose of calculating the word vector of the field to be searched in advance in the offline mode is to: setting the field to be searched in a range field, for example, a medical field or a physical field, and then calculating a word vector of the corresponding field in advance through an offline mode for the field which may be searched in the corresponding field. On one hand, mass text data can be limited to accelerate the calculation or the searching speed; on the other hand, word vectors are calculated in advance for fields which are possibly searched, so that corresponding word vectors can be quickly found after the searched character strings are obtained, and the calculation and search speed is further increased.
Step S104: and analyzing the query result and outputting a related document.
In an embodiment of the present application, the step S104 specifically includes:
A. calculating a vector angle according to the query vector and the word vector trained by the offline model;
B. obtaining the matched correlation according to the vector angle;
C. and sorting the analyzed query results according to the relevance to output the searched related documents.
In this embodiment, the query string is used as an input of the offline model through the foregoing steps, a word vector corresponding to the query string is output, and the word vector corresponding to the user query is sent to the Elasticsearch for querying. Then, the Elasticissearch calculates the similarity between the query vector and the stored vector field, and returns the most similar N documents according to the relevance ranking.
In summary, the application provides a text similarity calculation method based on text embedding, and a similarity search algorithm is realized by combining a distributed search engine elastic search. The advantages are that: 1) the text embedding technology in the natural language processing field is applied to a distributed search engine, and the semantic meaning is taken into consideration to accurately capture the real intention behind a sentence input by a user and search is carried out according to the real intention, so that the search result which best meets the requirement of the user is returned to the user more accurately, compared with the traditional word-level similarity matching method; 2) the query is accelerated by utilizing the distributed characteristic of the Elasticissearch and the support of high-dimensional vectors.
Fig. 3 is a block diagram of an information processing apparatus for voice conversion according to an embodiment of the present invention. As shown, the apparatus 300 includes:
an obtaining module 301, configured to obtain a query string;
a processing module 302, configured to convert the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document.
It should be noted that, because the contents of information interaction, execution process, and the like between the modules/units of the apparatus are based on the same concept as the method embodiments described in the present application, the technical effect brought by the contents is the same as the method embodiments of the present application, and specific contents can be referred to the descriptions in the method embodiments described in the foregoing description of the present application.
It should be further noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these units can be implemented entirely in software, invoked by a processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware.
For example, the processing module 302 may be a separate processing element, or may be integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the processing module 302. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown, the computer device 400 includes: a memory 401, and a processor 402; the memory 401 is used for storing computer instructions; the processor 402 executes computer instructions to implement the method described in fig. 1.
In some embodiments, the number of the memories 401 in the computer device 400 may be one or more, the number of the processors 402 may be one or more, and fig. 4 is taken as an example.
In an embodiment of the present application, the processor 402 in the computer device 400 loads one or more instructions corresponding to processes of an application program into the memory 401 according to the steps described in fig. 1, and the processor 402 executes the application program stored in the memory 402, thereby implementing the method described in fig. 1.
The Memory 401 may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 401 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for implementing various operations. The operating system may include various system programs for implementing various basic services and for handling hardware-based tasks.
The Processor 402 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic device, a discrete hardware component, etc.
In some specific applications, the various components of the computer device 400 are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. But for clarity of explanation the various busses are shown in fig. 4 as a bus system.
In an embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the text similarity search method based on text embedding as described in fig. 1.
The computer readable storage medium is preferably a non-volatile computer storage medium.
Those of ordinary skill in the art will understand that: the embodiment for realizing the functions of the system and each unit can be realized by hardware related to computer programs. The aforementioned computer program may be stored in a computer readable storage medium. When the program is executed, the embodiment including the functions of the system and the units is executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
It should be noted that, in the implementation of the system, the computer device, and the like in the above embodiments, all the related computer programs may be loaded on a computer readable storage medium, and the computer readable storage medium may be a tangible device that can hold and store the instructions used by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
In summary, the text similarity search method, device, equipment and medium based on text embedding provided by the application obtains the query character string; converting the query string into a query vector through an offline model trained based on text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting a related document.
The application effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical concepts disclosed in the present application shall be covered by the claims of the present application.

Claims (10)

1. A text similarity search method based on text embedding is characterized by comprising the following steps:
acquiring a query character string;
converting the query string into a query vector through an offline model trained based on text data;
sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model;
and analyzing the query result and outputting a related document.
2. The method of claim 1, wherein the offline model is constructed by:
acquiring text data for training;
and training the Doc2Vec model as an offline model, performing text embedding processing on the text data by using a natural language processing technology so as to convert the segmented words or fields into word vectors, and specifying the dimensionality of the vectors during training.
3. The method according to claim 2, wherein the text embedding the text data by using natural language processing technology comprises:
processing the text data according to a data format of one line and one paragraph;
reading text data according to lines, and filtering Chinese participles and stop words for each line according to a Chinese stop word list;
and storing the word segmentation result.
4. The method of claim 2, wherein the training based on the Doc2Vec model is an offline model comprising:
carrying out unsupervised learning on the text data by using a Gensim toolkit of Python to obtain a word vector of a main body of a text hidden layer;
and after the Doc2Vec model is trained, offline loading is carried out to form an offline model so as to provide conversion service of the input character string output word vector for the outside.
5. The method of claim 2, wherein the preset method of the search engine comprises:
and presetting a mapping structure of the search engine, setting a data type as a vector for a field needing text embedding, and keeping the data type consistent with the vector dimension of the offline model.
6. The method of claim 2, wherein the preset method of the search engine further comprises:
calculating a word vector of a field to be searched through the offline mode;
the text data and the word vectors are indexed to a search engine.
7. The method of claim 1, wherein parsing the query results and outputting relevant documents comprises:
calculating a vector angle according to the query vector and the word vector trained by the offline model;
obtaining the matched correlation according to the vector angle;
and sorting the analyzed query results according to the relevance to output the searched related documents.
8. An apparatus for searching text similarity based on text embedding, the apparatus comprising:
the acquisition module is used for acquiring the query character string;
the processing module is used for converting the query character string into a query vector through an offline model trained on the basis of text data; sending a search request to a preset search engine by taking the query vector as a parameter, and performing correlation matching on the query vector and a word vector trained by the offline model; and analyzing the query result and outputting related documents, and sorting and outputting the searched related documents according to the relevance.
9. A computer device, the device comprising: a memory, and a processor; the memory is to store computer instructions; the processor executes computer instructions to implement the method of any one of claims 1 to 7.
10. A computer storage medium, characterized in that a computer program is stored which, when executed, performs the method of any one of claims 1 to 7.
CN201911370136.0A 2019-12-26 2019-12-26 Text similarity searching method, device, equipment and medium based on text embedding Pending CN111159343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911370136.0A CN111159343A (en) 2019-12-26 2019-12-26 Text similarity searching method, device, equipment and medium based on text embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911370136.0A CN111159343A (en) 2019-12-26 2019-12-26 Text similarity searching method, device, equipment and medium based on text embedding

Publications (1)

Publication Number Publication Date
CN111159343A true CN111159343A (en) 2020-05-15

Family

ID=70558361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911370136.0A Pending CN111159343A (en) 2019-12-26 2019-12-26 Text similarity searching method, device, equipment and medium based on text embedding

Country Status (1)

Country Link
CN (1) CN111159343A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613311A (en) * 2020-06-09 2020-09-01 广东珠江智联信息科技股份有限公司 Intelligent AI (Artificial intelligence) diagnosis guide realization technology
CN111950840A (en) * 2020-06-19 2020-11-17 国网山东省电力公司 A method and system for intelligent operation and maintenance knowledge retrieval of metrological verification device
CN112445904A (en) * 2020-12-15 2021-03-05 税友软件集团股份有限公司 Knowledge retrieval method, knowledge retrieval device, knowledge retrieval equipment and computer readable storage medium
CN112784046A (en) * 2021-01-20 2021-05-11 北京百度网讯科技有限公司 Text clustering method, device and equipment and storage medium
CN112966007A (en) * 2021-04-02 2021-06-15 新华智云科技有限公司 Search terminal control method and search terminal
CN113010771A (en) * 2021-02-19 2021-06-22 腾讯科技(深圳)有限公司 Training method and device for personalized semantic vector model in search engine
CN113254586A (en) * 2021-05-31 2021-08-13 中国科学院深圳先进技术研究院 Unsupervised text retrieval method based on deep learning
CN113596601A (en) * 2021-01-19 2021-11-02 腾讯科技(深圳)有限公司 Video picture positioning method, related device, equipment and storage medium
CN114003798A (en) * 2021-10-29 2022-02-01 平安国际智慧城市科技股份有限公司 Data update method, device, device and storage medium for search engine
CN114548100A (en) * 2022-03-01 2022-05-27 深圳市医未医疗科技有限公司 Clinical scientific research auxiliary method and system based on big data technology
CN115618042A (en) * 2022-10-12 2023-01-17 广州广电运通信息科技有限公司 Retrieval method, equipment and storage medium for establishing image information index library based on ES
CN115905879A (en) * 2021-09-30 2023-04-04 电子湾有限公司 Artificial intelligence-based similarity in e-commerce marketplaces
CN116340467A (en) * 2023-05-11 2023-06-27 腾讯科技(深圳)有限公司 Text processing method, device, electronic device, and computer-readable storage medium
CN117056446A (en) * 2023-08-11 2023-11-14 北京百度网讯科技有限公司 Trajectory data query method, device, electronic equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec
CN108170684A (en) * 2018-01-22 2018-06-15 京东方科技集团股份有限公司 Text similarity computing method and system, data query system and computer product
US20190260694A1 (en) * 2018-02-16 2019-08-22 Mz Ip Holdings, Llc System and method for chat community question answering
CN110263127A (en) * 2019-06-21 2019-09-20 北京创鑫旅程网络技术有限公司 Text search method and device is carried out based on user query word
US20190303395A1 (en) * 2018-03-30 2019-10-03 State Street Corporation Techniques to determine portfolio relevant articles

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec
CN108170684A (en) * 2018-01-22 2018-06-15 京东方科技集团股份有限公司 Text similarity computing method and system, data query system and computer product
US20190260694A1 (en) * 2018-02-16 2019-08-22 Mz Ip Holdings, Llc System and method for chat community question answering
US20190303395A1 (en) * 2018-03-30 2019-10-03 State Street Corporation Techniques to determine portfolio relevant articles
CN110263127A (en) * 2019-06-21 2019-09-20 北京创鑫旅程网络技术有限公司 Text search method and device is carried out based on user query word

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JULIE TIBSHIRANI: "text-embeddings", 《HTTPS://GITHUB.COM/JTIBSHIRANI/TEXT-EMBEDDINGS/TREE/4B91D928455213D0D156ED80605382594703367E》 *
岳文应: "基于Doc2Vec 与SVM 的聊天内容过滤", 《计算机系统应用》 *
李晓军: "基于语义相似度的中文文本分类", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
贺益侗: "基于doc2vec 和TF-IDF 的相似文本识别", 《电子制作》 *
邹瑛: "《网络信息安全及管理研究》", 31 March 2018, 北京理工大学出版社 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613311A (en) * 2020-06-09 2020-09-01 广东珠江智联信息科技股份有限公司 Intelligent AI (Artificial intelligence) diagnosis guide realization technology
CN111950840A (en) * 2020-06-19 2020-11-17 国网山东省电力公司 A method and system for intelligent operation and maintenance knowledge retrieval of metrological verification device
CN112445904A (en) * 2020-12-15 2021-03-05 税友软件集团股份有限公司 Knowledge retrieval method, knowledge retrieval device, knowledge retrieval equipment and computer readable storage medium
CN113596601A (en) * 2021-01-19 2021-11-02 腾讯科技(深圳)有限公司 Video picture positioning method, related device, equipment and storage medium
CN112784046A (en) * 2021-01-20 2021-05-11 北京百度网讯科技有限公司 Text clustering method, device and equipment and storage medium
CN112784046B (en) * 2021-01-20 2024-05-28 北京百度网讯科技有限公司 Text clustering method, device, equipment and storage medium
CN113010771A (en) * 2021-02-19 2021-06-22 腾讯科技(深圳)有限公司 Training method and device for personalized semantic vector model in search engine
CN113010771B (en) * 2021-02-19 2023-08-22 腾讯科技(深圳)有限公司 Training method and device for personalized semantic vector model in search engine
CN112966007A (en) * 2021-04-02 2021-06-15 新华智云科技有限公司 Search terminal control method and search terminal
CN112966007B (en) * 2021-04-02 2022-06-17 新华智云科技有限公司 Search terminal control method and search terminal
CN113254586A (en) * 2021-05-31 2021-08-13 中国科学院深圳先进技术研究院 Unsupervised text retrieval method based on deep learning
CN115905879A (en) * 2021-09-30 2023-04-04 电子湾有限公司 Artificial intelligence-based similarity in e-commerce marketplaces
CN114003798A (en) * 2021-10-29 2022-02-01 平安国际智慧城市科技股份有限公司 Data update method, device, device and storage medium for search engine
CN114003798B (en) * 2021-10-29 2025-04-08 平安国际智慧城市科技股份有限公司 Data updating method, device and equipment of search engine and storage medium
CN114548100A (en) * 2022-03-01 2022-05-27 深圳市医未医疗科技有限公司 Clinical scientific research auxiliary method and system based on big data technology
CN115618042A (en) * 2022-10-12 2023-01-17 广州广电运通信息科技有限公司 Retrieval method, equipment and storage medium for establishing image information index library based on ES
CN116340467A (en) * 2023-05-11 2023-06-27 腾讯科技(深圳)有限公司 Text processing method, device, electronic device, and computer-readable storage medium
CN116340467B (en) * 2023-05-11 2023-11-17 腾讯科技(深圳)有限公司 Text processing method, text processing device, electronic equipment and computer readable storage medium
CN117056446A (en) * 2023-08-11 2023-11-14 北京百度网讯科技有限公司 Trajectory data query method, device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN111159343A (en) Text similarity searching method, device, equipment and medium based on text embedding
US12093648B2 (en) Systems and methods for producing a semantic representation of a document
CN117235226A (en) A question answering method and device based on large language model
JP2022115815A (en) Semantic code search based on augmented programming language corpus
WO2021017721A1 (en) Intelligent question answering method and apparatus, medium and electronic device
US20230119161A1 (en) Efficient Index Lookup Using Language-Agnostic Vectors and Context Vectors
US12541543B2 (en) Large language model-based information retrieval for large datasets
WO2021061233A1 (en) Inter-document attention mechanism
CN110727769B (en) Corpus generation method and device and man-machine interaction processing method and device
KR20160007040A (en) Method and system for searching by using natural language query
CN113505196A (en) Part-of-speech-based text retrieval method and device, electronic equipment and storage medium
WO2021225775A1 (en) Creating and interacting with data records having semantic vectors and natural language expressions produced by a machine-trained model
WO2018121198A1 (en) Topic based intelligent electronic file searching
CN114756733A (en) Similar document searching method and device, electronic equipment and storage medium
CN120045750A (en) Retrieval enhancement generation method and system based on large language model
CN119988563A (en) Intelligent question-answering method, device, electronic device, and storage medium based on multimodal information processing
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN117271712A (en) Retrieval method and system based on vector database and electronic equipment
CN106933824A (en) The method and apparatus that the collection of document similar to destination document is determined in multiple documents
CN114462378A (en) Method, system, computer equipment and storage medium for duplication checking of scientific and technological projects
US20250252320A1 (en) Improvement of ai predictions using context localization
CN118734870A (en) Text conversion method, device, electronic device and storage medium based on large model
CN114692610A (en) Keyword determination method and device
CN114880469B (en) Answer acquisition method, device, computer equipment and storage medium
KR101593214B1 (en) Method and system for searching by using natural language query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200515