[go: up one dir, main page]

CN112182145B - Text similarity determination method, device, equipment and storage medium - Google Patents

Text similarity determination method, device, equipment and storage medium Download PDF

Info

Publication number
CN112182145B
CN112182145B CN201910600981.6A CN201910600981A CN112182145B CN 112182145 B CN112182145 B CN 112182145B CN 201910600981 A CN201910600981 A CN 201910600981A CN 112182145 B CN112182145 B CN 112182145B
Authority
CN
China
Prior art keywords
similarity
word
word frequency
text
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910600981.6A
Other languages
Chinese (zh)
Other versions
CN112182145A (en
Inventor
王艳花
邱龙泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910600981.6A priority Critical patent/CN112182145B/en
Publication of CN112182145A publication Critical patent/CN112182145A/en
Application granted granted Critical
Publication of CN112182145B publication Critical patent/CN112182145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a text similarity determining method, a device, equipment and a storage medium. The method comprises the steps of obtaining a target text and an alternative text with similarity to be determined, determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word parts and/or word frequency inverse word frequency similarity with word parts, and determining the text similarity of the target text and the alternative text according to preset word sense weights, preset word frequency inverse word frequency weights, the word sense similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weights comprise preset word frequency inverse word frequency weights without word parts and/or preset word frequency inverse word frequency weights with word parts. Through the technical scheme, the similarity of the text is more accurately determined.

Description

Text similarity determination method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to a big data mining technology, in particular to a text similarity determining method, a device, equipment and a storage medium.
Background
With the development of the internet, more and more data are presented on the internet in the form of text, such as microblog messages, news headlines, bar-attached language, commodity comments in an e-commerce platform, commodity questions, and answers of purchasers to commodity questions. Applying machine learning technology to internet text data, mining valuable information from it provides useful convenience for people's life, so as to meet the needs of different aspects, and is a very popular topic in the current big data application technology.
Taking questions and answers to goods in the e-commerce platform as an example. When making a purchase decision, a shopper usually asks a person who purchased the commodity, or browses the question and answer data under the commodity, or asks a customer service to comprehensively understand the real information of the commodity. From the perspective of users, the users need to browse the historical question-answering data based on questions which the users want to ask, see if similar questions are asked, if the number of questions is large, satisfactory answers can be difficult to find, from the perspective of customer service, the questions which the users ask and the questions which are in the question bank need to be calculated in similarity, and a plurality of questions which are most similar to the questions are found, so that the users can answer the questions by means of the answers of the similar questions. Thus, there is a need to calculate the similarity of user questions to questions already in the question bank.
Currently, a similarity determination scheme for texts such as the above problem mainly adopts a vector space model, that is, each word in the text is mapped into a vector space, a cosine distance between vectors is calculated, and the smaller the distance is, the larger the similarity of the words is indicated. There are mainly two kinds of similarity determining schemes based on vector space models, one is a similarity determining scheme based on word importance, for example, a word Frequency inverse word Frequency (Term Frequency-Inverse Document Frequency, TF-IDF) model. TF-IDF is used to evaluate the importance of a word to a document in a text library, which increases in proportion to the number of times it appears in the document, but decreases in inverse proportion to the frequency of its occurrence in the corpus. The flow of the TF-IDF based similarity determination (i.e., problem matching) scheme is shown in fig. 1. The typical application of the scheme is that in a search engine, search questions are input from a user, word segmentation is carried out on the search questions, mapping relations between the word segmentation and documents are established, similarity scores and ranks of the search questions of the user and the documents in a knowledge base are calculated according to a similarity algorithm (namely a TF/IDF algorithm) after all the documents are found, and a result is returned according to the ranks of the documents. Another is a similarity determination scheme based on natural language processing (Natural Language Processing, NLP), and the common model is the Word2vec model. Word2vec model is trained based on Harris's distribution hypothesis (i.e. similar Word semantics), and the neural network model obtained by training can map each Word into K-dimensional space to represent the Word with a vector, and then measure the similarity of the Word on semantics through the similarity between vectors in the vector space. The scheme flow based on the semantic similarity model of NLP is shown in FIG. 2.
In the process of realizing the invention, the inventor finds that at least the following problems exist in the prior art, namely 1) a similarity determination scheme based on TF-IDF only measures the importance of one word to a document by using word frequency inverse word frequency, and the similarity accuracy between texts determined by the similarity determination scheme is lower and cannot meet the practical application requirements of finding out the most similar texts in a question-answering system without considering the position information of the word or semantic information of the word. 2) Based on the similarity determination scheme of NLP, the similarity calculation effect is good only at word level, and when the similarity calculation is extended to sentence level, due to the complexity of a syntax structure, the ideal effect cannot be achieved in practical application when sentence vectors are simply represented by accumulating or splicing word vectors of all words in sentences.
Disclosure of Invention
The embodiment of the invention provides a text similarity determining method, device, equipment and storage medium, so as to realize more accurate text similarity determination.
In a first aspect, an embodiment of the present invention provides a text similarity determining method, including:
Obtaining a target text and an alternative text of similarity to be determined;
determining word sense similarity and word frequency inverse word frequency similarity of the target text and the candidate text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word parts and/or word frequency inverse word frequency similarity with word parts;
Determining the text similarity of the target text and the candidate text according to preset word semantic weights, preset word frequency inverse word frequency weights, the word meaning similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weights comprise preset word frequency inverse word frequency weights without word parts and/or preset word frequency inverse word frequency weights with word parts.
In a second aspect, an embodiment of the present invention further provides a text similarity determining apparatus, where the apparatus includes:
the target text acquisition module is used for acquiring a target text and an alternative text of the similarity to be determined;
The first similarity determining module is used for determining word sense similarity and word frequency inverse word frequency similarity of the target text and the candidate text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word parts and/or word frequency inverse word frequency similarity with word parts;
The second similarity determining module is configured to determine a text similarity between the target text and the candidate text according to a preset word sense weight, a preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity, where the preset word frequency inverse word frequency weight includes a preset non-part-of-speech word frequency inverse word frequency weight and/or a preset part-of-speech word frequency inverse word frequency weight.
In a third aspect, an embodiment of the present invention further provides an apparatus, including:
One or more processors;
Storage means for storing one or more programs,
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the text similarity determination method provided by any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the text similarity determination method provided by any embodiment of the present invention.
According to the embodiment of the invention, the word meaning similarity of the target text and the candidate text and the word frequency inverse word frequency similarity containing the word frequency inverse word frequency similarity without word parts and/or the word frequency inverse word frequency similarity containing word parts are generated, and the text similarity of the target text and the candidate text is determined according to the preset word meaning weight, the preset word frequency inverse word frequency weight containing the preset word frequency inverse word frequency weight without word parts and/or the preset word frequency inverse word frequency weight containing word parts, and the word meaning similarity and the word frequency inverse word frequency similarity. The method and the device have the advantages that the sentence integral semantics of the text are mastered by utilizing the word meaning and the word frequency and/or multiple dimension characteristics of the word part, so that the text similarity is represented by different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is greatly improved.
Drawings
FIG. 1 is a flow chart of a prior art similarity determination method based on a word frequency inverse word frequency model;
FIG. 2 is a flow chart of a similarity determination method based on a semantic similarity model of NLP in the prior art;
FIG. 3a is a flow chart of a text similarity determination method in accordance with a first embodiment of the present invention;
FIG. 3b is a logical frame diagram of a text similarity determination method according to a first embodiment of the present invention;
FIG. 4a is a flowchart of a text similarity determination method according to a second embodiment of the present invention;
FIG. 4b is a schematic diagram of CBOW model structure in a second embodiment of the present invention;
FIG. 4c is a schematic diagram of a Skip-Gram model structure in a second embodiment of the present invention;
FIG. 5 is a flow chart of a text similarity determination method in accordance with a third embodiment of the present invention;
FIG. 6a is a flowchart of a text similarity determination method in accordance with a fourth embodiment of the present invention;
FIG. 6b is a logical frame diagram of a text similarity determination method in accordance with a fourth embodiment of the present invention;
fig. 7 is a schematic structural diagram of a text similarity determining apparatus according to a fifth embodiment of the present invention;
Fig. 8 is a schematic structural view of an apparatus according to a sixth embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Example 1
The text similarity determination method provided by the embodiment can be suitable for the similarity calculation of texts such as topics, messages, replies, consultations, suggestions and opinion feedback in the network forum, intelligent question and answer based on the network, instant chat records and the like. The method may be performed by a text similarity determination device, which may be implemented in software and/or hardware, which may be integrated in a device with a large-scale data manipulation function, such as a personal computer or a server, etc. In the embodiment of the invention, intelligent question answering is taken as an example for explanation. Referring to fig. 3a, the method of this embodiment specifically includes the following steps:
S110, obtaining a target text and an alternative text of the similarity to be determined.
The target text is text for which similarity needs to be calculated, and may be an existing text or a new text acquired from the outside. The candidate text is text for calculating similarity with the target text. The text may be short text or long text. Short text refers to text of a short length such as a sentence or short few sentences (small paragraphs).
In particular implementations, content entered by a user may be received as target text. Meanwhile, candidate texts are obtained from the available texts which can be collected, and the candidate texts can be one or a plurality of candidate texts. It is to be understood that the text database may be built up in advance from existing text that may be collected, and then the alternative text may be retrieved from the text database.
S120, determining word sense similarity and word frequency inverse word frequency similarity of the target text and the candidate text.
Where word sense similarity is a similarity determined from the semantic dimension of a word. Word frequency inverse word frequency similarity is a similarity determined from the importance dimension of a word. The measurement basis of the word frequency similarity is based on whether the two sentences have the same word and have similar frequencies, and if so, the two sentences are similar. The inverse word frequency measures the importance of each word to the sentence, mainly based on the more times the word appears in the sentence and the less times it appears in other sentences, the more important the word is to the sentence. The word frequency and the inverse word frequency jointly characterize the characteristics of each word in the sentence, so that the similarity of the sentences is calculated.
Illustratively, the term frequency inverse term frequency similarity comprises a non-part-of-speech term frequency inverse term frequency similarity and/or a part-of-speech term frequency inverse term frequency similarity. The non-part-of-speech word frequency inverse word frequency similarity refers to a word frequency inverse word frequency similarity that does not distinguish the part of speech of each word in the target text, and the part-of-speech word frequency inverse word frequency similarity refers to a word frequency inverse word frequency similarity that distinguishes the part of speech of each word in the target text. Since word performance reflects the semantics of words to some extent, word-part-of-speech-containing word frequency-inverse word frequency similarity can be understood as similarity determined from two dimensions of word semantics and word importance.
In the related art, the similarity can only be determined from one dimension of the word, and when the similarity of texts with more words and more sentence semantics is calculated, the similarity determining method in the related art cannot determine the similarity of the texts more accurately. Therefore, in the embodiment of the invention, the similarity of the text is determined by utilizing the similarity of at least two dimensions, specifically, the word meaning dimension and the word importance dimension are utilized, and the word importance dimension can be divided into the word importance dimension without word part and the word importance dimension with word part, so that the similarity of the text can be more completely represented by fusing the similarity of a plurality of dimensions although the text similarity measurement is still performed from the granularity of the word.
In specific implementation, the target text and the candidate text can be used as the input of a word sense similarity model for calculating the word sense similarity, and the word sense similarity between the target text and the candidate text can be obtained through the similarity calculation of the word sense similarity model. Similarly, the target text and the candidate text can be used as the input of a word frequency inverse word frequency similarity model for calculating the word frequency inverse word frequency similarity, and the word frequency inverse word frequency similarity between the target text and the candidate text can be obtained through the similarity calculation of the word frequency inverse word frequency similarity model. The term meaning similarity model may be a similarity model based on NLP, for example, a Word2vec model, glove model, or a Word vector learning model such as Bert model. The word frequency inverse word frequency similarity model needs to calculate the word frequency inverse word frequency similarity without word parts and the word frequency inverse word frequency similarity with word parts, so the word frequency inverse word frequency similarity model can be divided into the word frequency inverse word frequency similarity model without word parts and the word frequency inverse word frequency similarity model with word parts, the two word frequency inverse word frequency similarity models can be the same model, for example, can be a TF-IDF model, are different only in input data, for example, one input phrase without word parts and the other input phrase with word parts, and can also be two different models, and then the inside of the word frequency inverse word frequency similarity model with word parts is required to be subjected to specific processing aiming at the word parts.
In the above process, the similarity calculation is performed for a single candidate text, and if the number of candidate texts is multiple, then a plurality of loop operations of the above process are needed to complete the similarity calculation of all candidate texts and the target text. In order to increase the calculation speed and thus increase the determination efficiency of candidate texts which are more similar to the target text in the plurality of candidate texts, all candidate texts can be calculated in parallel, for example, each candidate text is expressed in vectors with different dimensions in advance, and then all candidate texts are expressed in different matrix forms. In particular, referring to fig. 3b, the process may be referred to as stock calculation, where all candidate texts are pre-characterized as word sense feature matrices and word frequency inverse word frequency feature matrices (including non-part-of-speech word frequency inverse word frequency feature matrices and/or part-of-speech word frequency inverse word frequency feature matrices). When the target text is determined, the same model is calculated based on the stock to characterize the target text as a word sense feature vector and a word frequency inverse word frequency feature vector (comprising a non-part-of-speech word frequency inverse word frequency feature vector and/or a part-of-speech word frequency inverse word frequency feature vector), which is referred to as real-time delta calculation. Finally, word sense similarity calculation can be performed by using the word sense feature vector and the word sense feature matrix, word frequency inverse word frequency similarity calculation can be performed by using the word frequency inverse word frequency feature vector and the word frequency inverse word frequency feature matrix, and the word sense similarity and the word frequency inverse word frequency similarity between the target text and each candidate text are determined.
It should be noted that the non-part-of-speech word frequency inverse word frequency similarity and the part-of-speech word frequency inverse word frequency similarity do not have to be all calculated, but only one of them may be selected. That is, one or both of a non-part-of-speech inverse word frequency similarity model and a part-of-speech inverse word frequency similarity model may be required for this operation. At this time, the similarity determined in the present operation may include a word sense similarity and a word frequency-less word frequency-inverse word frequency similarity, a word sense similarity and a word frequency-containing word frequency-inverse word frequency similarity, and a word sense similarity, a word frequency-less word frequency-inverse word frequency similarity and a word frequency-containing word frequency-inverse similarity.
S130, determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity.
The preset word semantic weight and the preset word frequency inverse word frequency weight are preset weight values, are respectively used for determining the proportion of the word semantic similarity and the word frequency inverse word frequency similarity in the similarity fusion process, and can be preset by human experience. The text similarity refers to the similarity after the similarity of different dimensions is fused, and the integrated similarity of the target text and the candidate text can be represented.
Illustratively, the preset word frequency inverse word frequency weight includes a preset no-part-of-word frequency inverse word frequency weight and/or a preset part-of-word frequency inverse word frequency weight. It should be noted that, the preset word frequency inverse word frequency weight corresponds to the word frequency inverse word frequency similarity. The method comprises the steps of obtaining a preset word frequency inverse word frequency similarity, obtaining a preset word frequency inverse word frequency weight, obtaining a preset word frequency non-part-of-speech inverse word frequency weight if the word frequency inverse word frequency similarity is the word frequency non-part-of-speech inverse word frequency similarity, obtaining a preset word frequency containing word frequency inverse word frequency weight if the word frequency inverse word frequency similarity is the word frequency non-part-of-speech inverse word frequency similarity, and obtaining a preset word frequency inverse word frequency weight and a preset word frequency containing word frequency inverse word frequency weight if the word frequency inverse word frequency similarity is the non-part-of-speech inverse word frequency similarity and the word frequency containing word frequency inverse word frequency similarity.
After obtaining the similarity between the target text and the candidate text in different dimensions, the similarity of each dimension needs to be fused so as to obtain the multi-dimensional text similarity. In specific implementation, according to the weight of the similarity in each dimension, the obtained similarity in different dimensions is weighted and summed to determine the text similarity, and a weighted and summed formula for determining the text similarity is as formula (1):
score=w1·score1+w2·score2 (1)
Wherein score represents the text similarity between the target text and the candidate text, w 1 and w 2 represent the preset word semantic weight and the preset word frequency inverse word frequency weight, respectively, and score 1 and score 2 represent the word semantic similarity and the word frequency inverse word frequency similarity, respectively.
According to the technical scheme, the word meaning similarity of the target text and the candidate text and the word frequency inverse word frequency similarity comprising the non-part-of-speech word frequency inverse word frequency similarity and/or the part-of-speech word frequency inverse word frequency similarity are generated, and the text similarity of the target text and the candidate text is determined according to the preset word meaning weight, the preset word frequency inverse word frequency weight comprising the preset non-part-of-speech word frequency inverse word frequency weight and/or the preset part-of-speech word frequency inverse word frequency weight, the word meaning similarity and the word frequency inverse word frequency similarity. The method and the device have the advantages that the sentence integral semantics of the text are mastered by utilizing the word meaning and the word frequency and/or multiple dimension characteristics of the word part, so that the text similarity is represented by different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is greatly improved.
Example two
The embodiment further optimizes "determining word sense similarity and word frequency inverse word frequency similarity of the target text and the candidate text" and "determining text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity" based on the first embodiment. Wherein the explanation of the same or corresponding terms as those of the above embodiments is not repeated herein. Referring to fig. 4a, the text similarity determining method provided in the present embodiment includes:
S210, acquiring a target text and an alternative text of the similarity to be determined.
S220, word segmentation is carried out on the target text, and each target word without part of speech corresponding to the target text is obtained.
Since the target text contains at least one sentence and the determination of the similarity is performed on the granularity of words, it is necessary to perform word segmentation processing on the target text first in order to split the target text into a plurality of words. Because each word after word segmentation is a word without part of speech, each word obtained by word segmentation is a target word without part of speech corresponding to the target text.
S230, determining the word sense similarity of the target text and the candidate text based on a pre-trained word sense similarity model according to each target word without part of speech and the candidate text.
According to the description of the above embodiment, the Word sense similarity model may be a Word2vec model, which uses a shallow layer neural network to predict a target Word by inputting the context of the target Word (which may correspond to CBOW model structures), or to predict the context of the Word by inputting the target Word (which may correspond to Skip-Gram model structures), so as to train the text, and obtain parameters of the hidden layer of the network, that is, obtain a trained Word2vec model. The trained Word2vec model can map each Word to a vector space, characterizing the Word as a corresponding feature vector.
Referring to fig. 4b, the cbow model has three layers, namely an input layer, a hidden layer and an output layer, which are respectively a prediction P (w t|wt-k,wt-(k-1),L,wt-1,wt+1,wt+2,L,wt+k), wherein w t is a target word to be predicted, w t-k,wt-(k-1),L,wt-1,wt+1,wt+2,L,wt+k (k=2) is the context of the target word, i.e. the first two words and the last two words of the target word are selected as contexts, the operations from the input layer to the hidden layer are the addition of context vectors, and the hidden layer to the output layer adopts a measurement level Softmax or negative sampling (NEGATIVE SIMPLING). Referring to FIG. 4c, the Skip-Gram model has three layers as well, namely an input layer, a hidden layer and an output layer, but in contrast to the CBOW model, the Skip-Gram model is a predictive P (w i|wt), where t-c.ltoreq.i.ltoreq.t+c and i.noteq.t, c is the window size (a constant representing the context size), assuming that there is a sequence of phrases of w 1,w2,w3,…,wT, the Skip-Gram's goal is to maximize:
After determining the Word2vec model structure, model training is required, and the model training process can be referred to as the stock calculation flow of fig. 3 b.
Illustratively, the Word2vec model is pre-trained based on a short text database pre-constructed based on business data corresponding to business scenarios and a long text database pre-constructed based on business data corresponding to business requirements under business scenarios.
Wherein, the short text database refers to a database formed by a large number of short texts, and the collection sources of the short text data in the short text database are related to specific business requirements. For example, if the service requirement is to determine the reference answer according to the similar questions, the data source of the short text database can be only the question in the intelligent question-answering system corresponding to the target text because the emphasis points of the answers in different intelligent question-answering systems are different. A long text database refers to text with more sentences, such as articles or product specifications, and the like, and the collection sources of the data are related to specific business requirements. For example, the business scenario is intelligent question-answering, and then a long text database can be constructed according to the collection of long texts with more logical sentences such as the recommended articles of the dagger, the introduction of the product or the description of the product from any intelligent question-answering system.
Because the target text can be a short text or a long text, in order to enhance text compatibility of the Word2vec model, in this embodiment, a short text database and a long text database are used for model training at the same time, and in order to improve semantic expression degree of the Word2vec model, the more complete the Word quantity in the long text database, the better the training effect of the Word2vec model. In the implementation, firstly, training data needs to be acquired, namely, a long text database and a short text database are acquired according to service scenes and service requirements. It is then pre-processed, such as word segmentation and data cleansing, to remove some stop words and punctuation marks, leaving only valid data. And inputting the preprocessed long text database and the preprocessed short text database into a Word2vec model for model training, and obtaining a trained Word2vec model.
And in the increment calculation part, inputting each target Word without part of speech into a Word sense similarity model Word2vec model obtained through training to obtain a row vector representation of each target Word without part of speech. And then, carrying out average value calculation on the corresponding columns of the plurality of row vectors to obtain a row vector with a column average value, wherein the row vector is used as a word sense feature vector of the target text. Likewise, word sense feature vectors corresponding to the candidate text may be obtained. And finally, calculating the vector cosine of the word sense feature vector of the target text and the word sense feature vector corresponding to the candidate text, and obtaining the word sense similarity of the target text and the candidate text.
Illustratively, S230 comprises the steps of inputting each target word without part of speech into a word sense similarity model to generate a word sense feature vector corresponding to a target text, and determining the word sense similarity of the target text and an alternative text corresponding to the row vector according to the word sense feature vector and the row vector in a word sense feature matrix, wherein the word sense feature matrix is generated according to a word without part of speech segmentation result and the word sense similarity model of a text database.
When each text in the text database is an alternative text, in order to improve the operation efficiency, preprocessing of word segmentation and data cleaning can be performed on each alternative text in the text database in advance, and a corresponding word segmentation result without word parts can be obtained. And then inputting the text into a Word2vec model to obtain a Word sense feature matrix of the text database, wherein each row vector in the Word sense feature matrix represents a feature vector of an alternative text. And finally, calculating the vector cosine between the word sense feature vector and each row vector in the word sense feature matrix respectively, so as to obtain the word sense similarity between the target text and each candidate text in the text database.
S240, determining the non-part-of-speech word frequency inverse word frequency similarity of the target text and the candidate text based on the word frequency inverse word frequency similarity model according to each target non-part-of-speech word, target text and candidate text.
The word frequency inverse word frequency similarity model in this embodiment adopts a word frequency inverse word frequency model TF-IDF model, which is constructed based on the word frequency TF value and the inverse word frequency IDF value of the word. The Term Frequency (TF) is the frequency of occurrence of a given word in the text, and the TF value is calculated by normalizing the term frequency to prevent longer text. Reverse document frequency (inverse document frequency, IDF) is a measure of the general importance of a word. The specific calculation formula is as follows:
Where n i,j represents the number of occurrences of word t i in text D j, Σ knk,j represents the sum of the number of occurrences of all words in text D j, |d| is the total number of text in the text database, |{ j: t i∈dj } | represents the number of text in the text database containing word t i. To ensure that the denominator is not 0, |{ j: t i∈dj } |+1 is typically used.
According to the calculation formula of the TF-IDF value, the word frequency inverse word frequency model needs to be trained in advance according to a text database to determine the |D|. And determining the vector dimension of the vector characterization of one word according to the number of the words without repetition in the text database. After the trained TF-IDF model is obtained, each target word without part of speech, target text and text database can be input into the TF-IDF model, and the TF-IDF value of each target word without part of speech is obtained to form a word frequency inverse word frequency feature vector without part of speech of the target text, the number of columns of the vector is consistent with the number of dimensions of the vector determined above, wherein the corresponding element positions of the target word without part of speech are filled with the TF-IDF value, and the rest element positions are filled with 0. Likewise, word-part-of-speech-free inverse word-frequency feature vectors corresponding to the candidate text may be obtained. And finally, calculating the vector cosine of the non-part-of-speech inverse word frequency feature vector of the target text and the non-part-of-speech inverse word frequency feature vector corresponding to the candidate text, so as to obtain the non-part-of-speech inverse word frequency similarity of the target text and the candidate text.
The S240 includes inputting each target word without part of speech and each target text into a word frequency and inverse word frequency similarity model to generate a word frequency and inverse word frequency feature vector without part of speech corresponding to the target text, determining word frequency and inverse word frequency similarity of candidate texts corresponding to the target text and the row vectors according to the word frequency and inverse word frequency feature vector without part of speech and the row vectors in the word frequency and inverse word frequency feature vector without part of speech, wherein the word frequency and inverse word frequency feature matrix without part of speech is generated according to word segmentation results without part of speech and the word frequency and inverse word frequency similarity model of the text database.
When each text in the text database is an alternative text, in order to improve the operation efficiency, preprocessing of word segmentation and data cleaning can be performed on each alternative text in the text database in advance, and a corresponding word segmentation result without word parts can be obtained. And inputting the character into a TF-IDF model to obtain a word-frequency-free inverse word-frequency characteristic matrix of the text database, wherein the column number of the word-frequency-free inverse word-frequency characteristic matrix is consistent with the vector dimension, the element position of the corresponding word without word is not complemented by 0, and each row vector in the matrix represents the characteristic vector of an alternative text. And finally, respectively calculating vector cosine between each line vector in the non-part-of-speech word frequency inverse word frequency feature vector and the non-part-of-speech word frequency inverse word frequency feature matrix, so as to obtain the non-part-of-speech word frequency inverse word frequency similarity of the target text and each candidate text.
S250, determining the text similarity of the target text and the candidate text according to the preset word sense weight, the preset word frequency non-part-of-speech inverse word frequency weight, the word sense similarity and the word frequency non-part-of-speech inverse word frequency similarity.
The preset word frequency inverse word frequency weight and the word frequency inverse word frequency similarity in the formula (1) are respectively determined to be the preset word frequency inverse word frequency weight w 21 and the word frequency inverse word frequency similarity score 21, and the text similarity of the target text and the candidate text can be obtained according to the following formula:
score=w1·score1+w21·score21 (7)
According to the technical scheme, the text similarity of the target text and the candidate text is determined by determining the word sense similarity and the word frequency non-part-of-speech inverse word frequency similarity of the target text and the candidate text and according to the preset word sense weight, the preset word frequency non-part-of-speech inverse word frequency weight, the word sense similarity and the word frequency non-part-of-speech inverse word frequency similarity, the text similarity of the target text is determined from two dimensions of the word sense and the word importance non-part-of-speech, and the determination accuracy of the text similarity is improved to a certain extent.
Example III
The embodiment further optimizes "determining word sense similarity and word frequency inverse word frequency similarity of the target text and the candidate text" and "determining text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity" based on the first embodiment. Wherein the explanation of the same or corresponding terms as those of the above embodiments is not repeated herein. Referring to fig. 5, the text similarity determining method provided in the present embodiment includes:
s310, obtaining target texts and alternative texts with similarity to be determined.
S320, word segmentation and part-of-speech tagging are carried out on the target text, and each target non-part-of-speech word and each target part-of-speech tagging word corresponding to the target text are obtained.
In this embodiment, the part of speech of each word needs to be distinguished, so after the target text is segmented, the obtained part of speech of each target word without part of speech needs to be labeled to obtain the target part of speech labeled word. If a target part-of-speech word has two or more parts of speech, the target part-of-speech word generates two or more target part-of-speech tagged words.
S330, determining the word sense similarity of the target text and the candidate text based on a pre-trained word sense similarity model according to each target word without part of speech and the candidate text.
S340, determining word-part-containing word frequency inverse word frequency similarity of the target text and the candidate text based on the word frequency inverse word frequency similarity model according to each target word-part-of-speech tagged word, the target text and the candidate text.
In this embodiment, the model for determining the similarity of the word-frequency-containing word-frequency inverse word-frequency is the same as the model for sampling the TF-IDF model, except that the model parameters Σ knk,j and |{ j: t i∈dj } | in the TF-IDF model are increased due to the introduction of the word-part. The process of determining the word-part-of-speech-containing inverse word frequency similarity is to input each target word-part-of-speech tagging word, target text and text database into a TF-IDF model to obtain TF-IDF values of each target word-part tagging word to form word-part-of-speech-containing inverse word frequency feature vectors of the target text, wherein the number of columns of the vectors is consistent with the number of dimensions of the determined vectors, the corresponding element positions of the target word-part-of-speech-containing words are filled with TF-IDF values, and the rest element positions are filled with 0. Similarly, word segmentation, part-of-speech tagging and part-of-speech tagging are performed on the candidate text, so that a part-of-speech-containing inverse word frequency feature vector corresponding to the candidate text is obtained. And finally, calculating the vector cosine of the part-of-speech-containing inverse part-of-speech feature vector of the target text and the part-of-speech-containing inverse part-of-speech feature vector corresponding to the candidate text, so as to obtain the part-of-speech-containing inverse part-of-speech similarity of the target text and the candidate text.
The S340 comprises the steps of inputting each target part-of-speech tagged word and each target text into a part-of-speech inverse word frequency similarity model to generate part-of-speech inverse word frequency feature vectors corresponding to the target text, determining part-of-speech inverse word frequency similarity of candidate texts corresponding to the target text and the line vectors according to the part-of-speech inverse word frequency feature vectors and the line vectors in the part-of-speech inverse word frequency feature matrices, and generating the part-of-speech inverse word frequency feature matrices according to part-of-speech tagged word segmentation results and the part-of-speech inverse word frequency similarity model of a text database.
When each text in the text database is an alternative text, in order to improve the operation efficiency, preprocessing of word segmentation, part-of-speech tagging and data cleaning can be performed on each alternative text in the text database in advance, so that a corresponding word segmentation result containing part of speech is obtained. And then inputting the character string into a TF-IDF model to obtain a word-part-of-speech-containing inverse word-frequency feature matrix of the text database, wherein the number of columns of the word-part-of-speech-containing inverse word-frequency feature matrix is consistent with the number of non-repeated words without repeated word parts in the text database, the element positions without corresponding word-part-of-speech words are complemented by 0, and each row vector in the matrix represents a feature vector of an alternative text. And finally, respectively calculating vector cosine between each line vector in the part-of-speech-containing word frequency inverse word frequency feature vector and the part-of-speech-containing word frequency inverse word frequency feature matrix, so as to obtain the part-of-speech-frequency inverse word frequency similarity of the target text and each candidate text.
S350, determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency-containing word frequency inverse weight, the word sense similarity and the word frequency-containing word frequency inverse similarity.
The preset word frequency inverse word frequency weight and the word frequency inverse word frequency similarity in the formula (1) are respectively determined to be the preset word frequency inverse word frequency weight w 22 containing word parts and the word frequency inverse word frequency similarity score 22 containing word parts, and the text similarity of the target text and the candidate text can be obtained according to the following formula:
score=w1·score1+w22·score22 (8)
According to the technical scheme, the text similarity of the target text and the candidate text is determined by determining the word meaning similarity and the word frequency-containing word frequency similarity of the target text and the candidate text and according to the preset word meaning weight, the preset word frequency-containing word frequency weight and the word meaning similarity and the word frequency-containing word frequency similarity, the text similarity of the target text and the candidate text is determined, and the determination accuracy of the text similarity is improved to a certain extent.
Example IV
The embodiment further optimizes "determining word sense similarity and word frequency inverse word frequency similarity of the target text and the candidate text" and "determining text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity" based on the first embodiment. Wherein the explanation of the same or corresponding terms as those of the above embodiments is not repeated herein. Referring to fig. 6a, the text similarity determining method provided in the present embodiment includes:
S410, obtaining target texts and alternative texts with similarity to be determined.
S420, performing word segmentation and part-of-speech tagging on the target text to obtain each target non-part-of-speech word and each target part-of-speech tagged word corresponding to the target text.
Referring to fig. 6b, the target text is subjected to word segmentation processing to obtain each target word without part of speech, and is subjected to word segmentation and part of speech tagging processing to obtain each target part of speech tagging.
S430, determining word sense similarity of the target text and the candidate text based on a pre-trained word sense similarity model according to each target word without part of speech and the candidate text.
The Word sense feature vector of the target text can be obtained by each target Word without part of speech and the trained Word2vec model, and Word sense similarity calculation is carried out on the Word sense feature vector and a row vector in the Word sense feature vector corresponding to the text database, so that the Word sense similarity score 1 of the target text and the candidate text in the text database can be obtained. There are n candidate texts in the text database, and n score 1 can be obtained.
S440, determining the non-part-of-speech word frequency inverse word frequency similarity of the target text and the candidate text based on the word frequency inverse word frequency similarity model according to each target non-part-of-speech word, target text and candidate text.
The non-part-of-speech word frequency inverse word frequency feature vector of the target text can be obtained through each target non-part-of-speech word and the trained TF-IDF model, the non-part-of-speech word frequency inverse word frequency similarity calculation is carried out on the line vectors in the non-part-of-speech word frequency inverse word frequency feature vector corresponding to the text database, and the non-part-of-speech word frequency inverse word frequency similarity score 21 of the target text and the candidate text in the text database can be obtained. There are n candidate texts in the text database, and n score 21 can be obtained.
S450, determining word-part-containing word frequency inverse word frequency similarity of the target text and the candidate text based on the word frequency inverse word frequency similarity model according to each target word-part-of-speech tagged word, the target text and the candidate text.
The part-of-speech-containing word frequency inverse word frequency feature vectors of the target text can be obtained through the target part-of-speech tagged words and the trained TF-IDF model, the part-of-speech-containing word frequency inverse word frequency similarity calculation is carried out on each row vector in the part-of-speech-containing word frequency inverse word frequency feature vectors corresponding to the text database, and the part-of-speech-containing word frequency inverse word frequency similarity score 22 of the target text and the candidate text in the text database can be obtained. There are n candidate texts in the text database, and n score 22 can be obtained.
S460, determining the text similarity of the target text and the candidate text according to the preset word meaning weight, the preset word frequency non-part-of-speech inverse word frequency weight, the word meaning similarity, the word frequency non-part-of-speech inverse word frequency similarity and the word frequency inverse word frequency similarity.
Determining the preset word frequency inverse word frequency weight in the formula (1) as a preset word frequency inverse word frequency weight w 21 and a preset word frequency inverse word frequency weight w 22 containing word frequency, and determining the word frequency inverse word frequency similarity as a word frequency inverse word frequency similarity score 21 and a word frequency inverse word frequency similarity score 22 containing word frequency, so that the text similarity of the target text and the candidate text can be obtained according to the following formula:
score=w1·score1+w21·score21+w22·score22 (9)
There are n candidate texts in the text database, and n score can be obtained through formula (9), where each score represents the text similarity between the target text and the corresponding candidate text.
S470, arranging the candidate texts in a descending order according to the similarity of the texts, and generating a sequencing result.
After determining the similarity of each text, a plurality of candidate texts whose similarity satisfies the business requirement may be determined from all the candidate texts. At this time, in order to further improve the service operation efficiency, all the candidate texts may be arranged in a descending order according to the text similarity, so as to generate a ranking result.
S480, extracting a preset number of candidate texts from the sorting result to serve as similar texts of the target text.
And determining the number of the candidate texts to be selected, namely the preset number, according to the service requirement. And then extracting a preset number of candidate texts with the top ranking from the ranking result to serve as similar texts which are similar to the target text.
S490, when the business scene is an intelligent question-answer scene and the business requirement is an alternative answer for determining the target text, extracting an answer corresponding to each similar text from the short text database to serve as the alternative answer for the target text.
In the business scenario of intelligent question-answering, the target text is the target short text, since the question-answering is usually a short text. If the traffic demand is only similar short text, then S480 may end. But if the business requirement is an alternative answer to the determination of the target text, then the alternative text needs to be short text in the short text database. At this time, the answer short text for each similar text needs to be extracted from the short text database as an alternative answer to the target text, so as to provide a more accurate answer to the user in a faster and more convenient manner in the intelligent customer service, or provide an answer to a similar question in the auxiliary manual customer service.
According to the technical scheme, the text similarity between the target text and the candidate text is determined by determining the word sense similarity, the non-part-of-speech word frequency inverse word frequency similarity and the part-of-speech word frequency inverse word frequency similarity of the target text, and determining the text similarity between the target text and the candidate text according to the preset word semantic weight, the preset non-part-of-speech word frequency inverse word frequency weight, the preset part-of-speech word frequency inverse word frequency weight, the word sense similarity, the non-part-of-speech word frequency inverse word frequency similarity and the part-of-speech word frequency inverse word frequency similarity, so that the text similarity between the target text is determined from three dimensions of word sense, the non-part-of-speech word importance and the part-of-speech word importance is realized, and the determination accuracy of the text similarity is improved to a greater extent. The method has the advantages that the candidate texts are ranked according to the text similarity, and answers of the similar texts with the top ranking are determined to serve as the candidate answers of the target texts, so that the determination accuracy and efficiency of the candidate answers in the intelligent question-answering system can be improved.
Example five
The present embodiment provides a text similarity determining apparatus, referring to fig. 7, which specifically includes:
a target text determining module 710, configured to obtain a target text and an alternative text of a similarity to be determined;
A first similarity determining module 720, configured to determine a word sense similarity and a word frequency inverse word frequency similarity of the target text and the candidate text, where the word frequency inverse word frequency similarity includes a word frequency inverse word frequency similarity without part of speech and/or a word frequency inverse word frequency similarity with part of speech;
The second similarity determining module 730 is configured to determine a text similarity between the target text and the candidate text according to a preset word sense weight, a preset word frequency inverse word frequency weight, and a word sense similarity and a word frequency inverse word frequency similarity, where the preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without word parts and/or a preset word frequency inverse word frequency weight with word parts.
Optionally, the first similarity determining module 720 is specifically configured to:
Word segmentation is carried out on the target text, and each target word without part of speech corresponding to the target text is obtained;
Determining word sense similarity of the target text and the candidate text based on a pre-trained word sense similarity model according to each target word without word part and the candidate text;
determining the non-part-of-speech word frequency reverse word frequency similarity of the target text and the candidate text based on the word frequency reverse word frequency similarity model according to each target non-part-of-speech word, target text and candidate text;
Accordingly, the second similarity determination module 730 is specifically configured to:
And determining the text similarity of the target text and the candidate text according to the preset word meaning weight, the preset word frequency non-part-of-speech inverse word frequency weight, the word meaning similarity and the word frequency non-part-of-speech inverse word frequency similarity.
Optionally, the first similarity determining module 720 is specifically configured to:
performing word segmentation and part-of-speech tagging on the target text to obtain each target word without part-of-speech and each target part-of-speech tagged word corresponding to the target text;
Determining word sense similarity of the target text and the candidate text based on a pre-trained word sense similarity model according to each target word without word part and the candidate text;
Determining word-part-containing word-frequency inverse word-frequency similarity of the target text and the candidate text based on the word-frequency inverse word-frequency similarity model according to each target word-part-of-speech tagged word, the target text and the candidate text;
Accordingly, the second similarity determination module 730 is specifically configured to:
and determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word-part-of-speech-containing frequency inverse word frequency weight, the word sense similarity and the word-part-of-speech-containing frequency inverse word frequency similarity.
Optionally, the first similarity determining module 720 is specifically configured to:
performing word segmentation and part-of-speech tagging on the target text to obtain each target word without part-of-speech and each target part-of-speech tagged word corresponding to the target text;
Determining word sense similarity of the target text and the candidate text based on a pre-trained word sense similarity model according to each target word without word part and the candidate text;
determining the non-part-of-speech word frequency reverse word frequency similarity of the target text and the candidate text based on the word frequency reverse word frequency similarity model according to each target non-part-of-speech word, target text and candidate text;
Determining word-part-containing word-frequency inverse word-frequency similarity of the target text and the candidate text based on the word-frequency inverse word-frequency similarity model according to each target word-part-of-speech tagged word, the target text and the candidate text;
Accordingly, the second similarity determination module 730 is specifically configured to:
Determining the text similarity of the target text and the candidate text according to the preset word meaning weight, the preset word frequency non-part-of-speech inverse word frequency weight, the word meaning similarity, the word frequency non-part-of-speech inverse word frequency similarity and the word frequency non-part-of-speech inverse word frequency similarity.
Optionally, the Word frequency inverse Word frequency similarity model is a Word frequency inverse Word frequency model, and the Word sense similarity model is a Word2vec model;
The Word2vec model is pre-trained based on a short text database and a long text database, wherein the long text database is pre-constructed based on service data corresponding to a service scene, and the short text database is pre-constructed based on service data corresponding to service requirements under the service scene.
Optionally, on the basis of the above device, the device further includes a similar text determining module, configured to:
When a plurality of candidate texts are provided, after determining the text similarity between the target text and the candidate texts, descending order of the candidate texts according to the text similarity to generate a sequencing result;
and extracting a preset number of candidate texts from the sequencing result to serve as similar texts of the target text.
Further, on the basis of the above device, the device further comprises an alternative answer determining module, configured to:
When the service scene is an intelligent question-answer scene and the service requirement is to determine the candidate answers of the target text, the target text is a target short text, the candidate texts are short texts in a short text database, a preset number of candidate texts are extracted from the sequencing result to serve as the similar texts of the target text, and then the answer corresponding to each similar text is extracted from the short text database to serve as the candidate answer of the target text.
According to the text similarity determining device provided by the embodiment of the invention, the sentence integral semantics of the text are mastered by utilizing the word meaning and the word frequency and/or multiple dimension characteristics of the word part, so that the text similarity is represented by different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is greatly improved.
The text similarity determining device provided by the embodiment of the invention can execute the text similarity determining method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.
It should be noted that, in the embodiment of the text similarity determining apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented, and the specific names of the functional units are only for convenience of distinguishing each other, and are not used for limiting the protection scope of the present invention.
Example six
Referring to fig. 8, the present embodiment provides an apparatus, which includes one or more processors 820, and a storage device 810 for storing one or more programs, where the one or more programs are executed by the one or more processors 820, so that the one or more processors 820 implement the text similarity determining method provided by the embodiment of the present invention, and the method includes:
Obtaining a target text and an alternative text of similarity to be determined;
Determining word sense similarity and word frequency inverse word frequency similarity of the target text and the candidate text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word parts and/or word frequency inverse word frequency similarity with word parts;
Determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word meaning similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises the preset word frequency inverse word frequency weight without word parts and/or the preset word frequency inverse word frequency weight with word parts.
Of course, those skilled in the art will appreciate that the processor 820 may also implement the technical solution of the text similarity determination method provided in any embodiment of the present invention.
The device shown in fig. 8 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention. As shown in fig. 8, the apparatus includes a processor 820, a storage device 810, an input device 830, and an output device 840, the number of processors 820 in the apparatus may be one or more, one processor 820 being shown in fig. 8 as an example, and the processor 820, the storage device 810, the input device 830, and the output device 840 in the apparatus may be connected by a bus or other means, and the connection is shown in fig. 8 as an example by the bus 850.
The storage device 810 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the text similarity determining method in the embodiment of the present invention (for example, the target text obtaining module, the first similarity determining module, and the second similarity determining module in the text similarity determining device).
The storage device 810 may mainly include a storage program area that may store an operating system, an application program required for at least one function, and a storage data area that may store data created according to the use of the terminal, etc. In addition, storage 810 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, storage 810 may further include memory located remotely from processor 820, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 830 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the apparatus. The output device 840 may include a display device such as a display screen.
Example seven
The present embodiment provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a text similarity determination method comprising:
Obtaining a target text and an alternative text of similarity to be determined;
Determining word sense similarity and word frequency inverse word frequency similarity of the target text and the candidate text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word parts and/or word frequency inverse word frequency similarity with word parts;
Determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word meaning similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises the preset word frequency inverse word frequency weight without word parts and/or the preset word frequency inverse word frequency weight with word parts.
Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above method operations, and may also perform the related operations in the text similarity determination method provided in any embodiment of the present invention.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, etc., and include several instructions for causing a device (which may be a personal computer, a server, or a network device, etc.) to perform the text similarity determination method provided by the embodiments of the present invention.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1.一种文本相似度确定方法,其特征在于,包括:1. A method for determining text similarity, comprising: 获取待确定相似度的目标文本和备选文本;Obtaining a target text and an alternative text whose similarity is to be determined; 确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度,其中,所述词频逆词频相似度包含无词性词频逆词频相似度和/或含词性词频逆词频相似度;Determine the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text, wherein the word frequency inverse word frequency similarity includes word frequency inverse word frequency similarity without part of speech and/or word frequency inverse word frequency similarity with part of speech; 依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度,其中,所述预设词频逆词频权重包含预设无词性词频逆词频权重和/或预设含词性词频逆词频权重;所述预设词语义权重和所述预设词频逆词频权重分别用于确定所述词语义相似度和所述词频逆词频相似度在所述文本相似度的确定过程中所占的比重。The text similarity between the target text and the alternative text is determined based on a preset word semantic weight, a preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight includes a preset part-of-speech word frequency inverse word frequency weight and/or a preset part-of-speech word frequency inverse word frequency weight; the preset word semantic weight and the preset word frequency inverse word frequency weight are respectively used to determine the proportion of the word semantic similarity and the word frequency inverse word frequency similarity in the process of determining the text similarity. 2.根据权利要求1所述的方法,其特征在于,确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度包括:2. The method according to claim 1, characterized in that determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises: 对所述目标文本进行分词,获得所述目标文本对应的各目标无词性词;Segmenting the target text to obtain target part-of-speech words corresponding to the target text; 依据每个所述目标无词性词和所述备选文本,基于预先训练的词语义相似度模型,确定所述目标文本与所述备选文本的词语义相似度;According to each of the target part-of-speech words and the candidate texts, based on a pre-trained word semantic similarity model, determine the word semantic similarity between the target text and the candidate text; 依据每个所述目标无词性词、所述目标文本和所述备选文本,基于词频逆词频相似度模型,确定所述目标文本与所述备选文本的无词性词频逆词频相似度;According to each of the target part-of-speech words, the target text and the candidate texts, based on a word frequency inverse word frequency similarity model, determine the inverse word frequency similarity of the part-of-speech words between the target text and the candidate texts; 相应地,依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度包括:Accordingly, determining the text similarity between the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity includes: 依据所述预设词语义权重、所述预设无词性词频逆词频权重、以及所述词语义相似度和所述无词性词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度。The text similarity between the target text and the candidate text is determined based on the preset word semantic weight, the preset inverse word frequency weight of the word without part of speech, the word semantic similarity and the inverse word frequency similarity of the word without part of speech. 3.根据权利要求1所述的方法,其特征在于,确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度包括:3. The method according to claim 1, characterized in that determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises: 对所述目标文本进行分词及词性标注,获得所述目标文本对应的各目标无词性词和各目标词性标注词;Performing word segmentation and part-of-speech tagging on the target text to obtain target part-of-speech words and target part-of-speech tagged words corresponding to the target text; 依据每个所述目标无词性词和所述备选文本,基于预先训练的词语义相似度模型,确定所述目标文本与所述备选文本的词语义相似度;According to each of the target part-of-speech words and the candidate texts, based on a pre-trained word semantic similarity model, determine the word semantic similarity between the target text and the candidate text; 依据每个所述目标词性标注词、所述目标文本和所述备选文本,基于词频逆词频相似度模型,确定所述目标文本与所述备选文本的含词性词频逆词频相似度;According to each of the target part-of-speech tagged words, the target text and the candidate text, based on a word frequency inverse word frequency similarity model, determine the word frequency inverse word frequency similarity between the target text and the candidate text; 相应地,依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度包括:Accordingly, determining the text similarity between the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity includes: 依据所述预设词语义权重、所述预设含词性词频逆词频权重、以及所述词语义相似度和所述含词性词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度。The text similarity between the target text and the candidate text is determined based on the preset word semantic weight, the preset part-of-speech word frequency inverse word frequency weight, the word semantic similarity and the part-of-speech word frequency inverse word frequency similarity. 4.根据权利要求1所述的方法,其特征在于,确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度包括:4. The method according to claim 1, characterized in that determining the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text comprises: 对所述目标文本进行分词及词性标注,获得所述目标文本对应的各目标无词性词和各目标词性标注词;Performing word segmentation and part-of-speech tagging on the target text to obtain target part-of-speech words and target part-of-speech tagged words corresponding to the target text; 依据每个所述目标无词性词和所述备选文本,基于预先训练的词语义相似度模型,确定所述目标文本与所述备选文本的词语义相似度;According to each of the target part-of-speech words and the candidate texts, based on a pre-trained word semantic similarity model, determine the word semantic similarity between the target text and the candidate text; 依据每个所述目标无词性词、所述目标文本和所述备选文本,基于词频逆词频相似度模型,确定所述目标文本与所述备选文本的无词性词频逆词频相似度;According to each of the target part-of-speech words, the target text and the candidate texts, based on a word frequency inverse word frequency similarity model, determine the inverse word frequency similarity of the part-of-speech words between the target text and the candidate texts; 依据每个所述目标词性标注词、所述目标文本和所述备选文本,基于词频逆词频相似度模型,确定所述目标文本与所述备选文本的含词性词频逆词频相似度;According to each of the target part-of-speech tagged words, the target text and the candidate text, based on a word frequency inverse word frequency similarity model, determine the word frequency inverse word frequency similarity between the target text and the candidate text; 相应地,依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度包括:Accordingly, determining the text similarity between the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity includes: 依据所述预设词语义权重、所述预设无词性词频逆词频权重、所述预设含词性词频逆词频权重、以及所述词语义相似度、所述无词性词频逆词频相似度和所述含词性词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度。The text similarity between the target text and the candidate text is determined based on the preset word semantic weight, the preset inverse word frequency weight of the word without part of speech, the preset inverse word frequency weight of the word with part of speech, and the word semantic similarity, the inverse word frequency similarity of the word without part of speech and the inverse word frequency similarity of the word with part of speech. 5.根据权利要求2-4任一项所述的方法,其特征在于,所述词频逆词频相似度模型为词频逆词频模型,所述词语义相似度模型为Word2vec模型;5. The method according to any one of claims 2 to 4, characterized in that the word frequency inverse word frequency similarity model is a word frequency inverse word frequency model, and the word semantic similarity model is a Word2vec model; 所述Word2vec模型基于短文本数据库和长文本数据库预先训练,其中,所述长文本数据库基于业务场景对应的业务数据而预先构建,所述短文本数据库基于所述业务场景下业务需求对应的业务数据而预先构建。The Word2vec model is pre-trained based on a short text database and a long text database, wherein the long text database is pre-constructed based on business data corresponding to a business scenario, and the short text database is pre-constructed based on business data corresponding to business needs in the business scenario. 6.根据权利要求1所述的方法,其特征在于,当所述备选文本为多个时,在所述确定所述目标文本与所述备选文本的文本相似度之后,还包括:6. The method according to claim 1, characterized in that when there are multiple candidate texts, after determining the text similarity between the target text and the candidate texts, it further comprises: 依据各所述文本相似度对各所述备选文本进行降序排列,生成排序结果;Arrange the candidate texts in descending order according to the similarities of the texts to generate a sorting result; 从所述排序结果中提取预设数量的所述备选文本,作为所述目标文本的相似文本。A preset number of the candidate texts are extracted from the sorting results as similar texts to the target text. 7.根据权利要求6所述的方法,其特征在于,在业务场景为智能问答场景,业务需求为确定所述目标文本的备选答案时,所述目标文本为目标短文本,所述备选文本为短文本数据库中的短文本,在所述从所述排序结果中提取预设数量的所述备选文本,作为所述目标文本的相似文本之后,还包括:7. The method according to claim 6, characterized in that when the business scenario is an intelligent question-answering scenario and the business requirement is to determine alternative answers for the target text, the target text is a target short text, and the alternative texts are short texts in a short text database, after extracting a preset number of the alternative texts from the sorting results as similar texts to the target text, the method further comprises: 从所述短文本数据库中提取每个所述相似文本对应的答案,作为所述目标文本的备选答案。The answer corresponding to each of the similar texts is extracted from the short text database as an alternative answer to the target text. 8.一种文本相似度确定装置,其特征在于,包括:8. A device for determining text similarity, comprising: 目标文本获取模块,用于获取待确定相似度的目标文本和备选文本;A target text acquisition module, used to acquire a target text and an alternative text whose similarity is to be determined; 第一相似度确定模块,用于确定所述目标文本与所述备选文本的词语义相似度和词频逆词频相似度,其中,所述词频逆词频相似度包含无词性词频逆词频相似度和/或含词性词频逆词频相似度;A first similarity determination module is used to determine the word semantic similarity and word frequency inverse word frequency similarity between the target text and the candidate text, wherein the word frequency inverse word frequency similarity includes word frequency inverse word frequency similarity without part of speech and/or word frequency inverse word frequency similarity with part of speech; 第二相似度确定模块,用于依据预设词语义权重、预设词频逆词频权重、以及所述词语义相似度和所述词频逆词频相似度,确定所述目标文本与所述备选文本的文本相似度,其中,所述预设词频逆词频权重包含预设无词性词频逆词频权重和/或预设含词性词频逆词频权重;所述预设词语义权重和所述预设词频逆词频权重分别用于确定所述词语义相似度和所述词频逆词频相似度在所述文本相似度的确定过程中所占的比重。The second similarity determination module is used to determine the text similarity between the target text and the alternative text based on a preset word semantic weight, a preset word frequency inverse word frequency weight, and the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight includes a preset part-of-speech word frequency inverse word frequency weight and/or a preset part-of-speech word frequency inverse word frequency weight; the preset word semantic weight and the preset word frequency inverse word frequency weight are respectively used to determine the proportion of the word semantic similarity and the word frequency inverse word frequency similarity in the process of determining the text similarity. 9.一种设备,其特征在于,所述设备包括:9. A device, characterized in that the device comprises: 一个或多个处理器;one or more processors; 存储装置,用于存储一个或多个程序,a storage device for storing one or more programs, 当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一所述的文本相似度确定方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the text similarity determination method as described in any one of claims 1-7. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-7中任一所述的文本相似度确定方法。10. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method for determining text similarity as described in any one of claims 1 to 7 is implemented.
CN201910600981.6A 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium Active CN112182145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910600981.6A CN112182145B (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910600981.6A CN112182145B (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112182145A CN112182145A (en) 2021-01-05
CN112182145B true CN112182145B (en) 2025-01-17

Family

ID=73915404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910600981.6A Active CN112182145B (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112182145B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505196B (en) * 2021-06-30 2024-01-30 和美(深圳)信息技术股份有限公司 Text retrieval method, device, electronic equipment and storage medium based on part of speech
CN113837594A (en) * 2021-09-18 2021-12-24 深圳壹账通智能科技有限公司 Quality evaluation method, system, device and medium for customer service in multiple scenes
CN114282967A (en) * 2021-12-21 2022-04-05 中国农业银行股份有限公司 Method and device for determining target product, electronic equipment and storage medium
CN115329742B (en) * 2022-10-13 2023-02-03 深圳市大数据研究院 Scientific research project output evaluation acceptance method and system based on text analysis
CN116167352B (en) * 2023-04-03 2023-07-21 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN116228249A (en) * 2023-05-08 2023-06-06 陕西拓方信息技术有限公司 Customer service system based on information technology

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284490A (en) * 2018-09-13 2019-01-29 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2487403C1 (en) * 2011-11-30 2013-07-10 Федеральное государственное бюджетное учреждение науки Институт системного программирования Российской академии наук Method of constructing semantic model of document
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN108595425A (en) * 2018-04-20 2018-09-28 昆明理工大学 Based on theme and semantic dialogue language material keyword abstraction method
CN109271626B (en) * 2018-08-31 2023-09-26 北京工业大学 Text semantic analysis method
CN109344236B (en) * 2018-09-07 2020-09-04 暨南大学 A problem similarity calculation method based on multiple features

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284490A (en) * 2018-09-13 2019-01-29 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"改进TF-IDF结合余弦定理计算中文语句相似度";张俊飞;《现代计算机》;20171130(第32期);第1页左栏第1段-第4页右栏第2段 *

Also Published As

Publication number Publication date
CN112182145A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
TWI732271B (en) Human-machine dialog method, device, electronic apparatus and computer readable medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN112182145B (en) Text similarity determination method, device, equipment and storage medium
CN106649818B (en) Application search intent identification method, device, application search method and server
CN111221962B (en) Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN110019732B (en) Intelligent question answering method and related device
CN104615767B (en) Training method, search processing method and the device of searching order model
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN108647205A (en) Fine granularity sentiment analysis model building method, equipment and readable storage medium storing program for executing
CN106126619A (en) A kind of video retrieval method based on video content and system
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
Hu et al. Text sentiment analysis: A review
CN109298796B (en) Word association method and device
Tiwari et al. Ensemble approach for twitter sentiment analysis
CN111090771A (en) Song searching method and device and computer storage medium
CN117851444A (en) An advanced search method based on semantic understanding
Mozafari et al. Emotion detection by using similarity techniques
CN108287875A (en) Personage's cooccurrence relation determines method, expert recommendation method, device and equipment
CN112463914B (en) Entity linking method, device and storage medium for internet service
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN113468311B (en) Knowledge graph-based complex question and answer method, device and storage medium
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant