[go: up one dir, main page]

CN114418016A - Efficient short text similarity determination method and device - Google Patents

Efficient short text similarity determination method and device Download PDF

Info

Publication number
CN114418016A
CN114418016A CN202210078359.5A CN202210078359A CN114418016A CN 114418016 A CN114418016 A CN 114418016A CN 202210078359 A CN202210078359 A CN 202210078359A CN 114418016 A CN114418016 A CN 114418016A
Authority
CN
China
Prior art keywords
word
texts
short
corpus
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210078359.5A
Other languages
Chinese (zh)
Other versions
CN114418016B (en
Inventor
刘东亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Digital Service Technology Co ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202210078359.5A priority Critical patent/CN114418016B/en
Publication of CN114418016A publication Critical patent/CN114418016A/en
Application granted granted Critical
Publication of CN114418016B publication Critical patent/CN114418016B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开的一方面涉及一种高效的短文本相似性确定方法,包括对语料库中的短文本进行分词以获得对应的词序列;基于所述语料库中的短文本总数确定惩罚,所述惩罚随所述语料库中的短文本总数增大而减小;确定所述词序列中每个词的词频和调整后的逆文档频率,其中所述调整后的逆文档频率基于所述惩罚来计算;用所述调整后的逆文档频率对每个词的词频进行加权;组合所述词序列中每个词的加权词频以确定所述短文本的词频向量;以及基于所述词频向量来确定所述短文本与其他短文本的相似性。本公开还涉及其他相关方面。

Figure 202210078359

One aspect of the present disclosure relates to an efficient short text similarity determination method, which includes segmenting short texts in a corpus to obtain corresponding word sequences; determining a penalty based on the total number of short texts in the corpus, and the penalty varies with decrease as the total number of short texts in the corpus increases; determine the word frequency and adjusted inverse document frequency for each word in the word sequence, wherein the adjusted inverse document frequency is calculated based on the penalty; use all weighting the word frequency of each word by describing the adjusted inverse document frequency; combining the weighted word frequency of each word in the word sequence to determine a word frequency vector for the short text; and determining the short text based on the word frequency vector Similarity to other short texts. The present disclosure also relates to other related aspects.

Figure 202210078359

Description

Efficient short text similarity determination method and device
Technical Field
The present application relates generally to Natural Language Processing (NLP), and more particularly to efficient short text similarity determination.
Background
Text similarity measures are a common problem in the field of NLP. Different measurement methods have been studied separately in academic and industrial circles for long and short texts.
For similarity measurement methods for long texts, there are generally two paradigms: firstly, vectorization representation is performed on words or phrases, similarity is calculated after vector representation of long texts is obtained through aggregation, and the following methods are common: word2vec, Bow model, etc.; and secondly, introducing a deep learning network structure, learning vectors of sentences or texts according to context semantics, wherein Elmo, Bert and the like are common, and directly calculating the similarity through the constructed sentence vectors.
For the similarity measurement method of short text, there are also two classical paradigms in general: firstly, directly calculating similarity coefficients at a character level without vectorizing and characterizing sentences, wherein the similarity coefficients commonly include jaccard similarity, Sorensen similarity coefficient, Levenshtein distance, Hamming distance and the like; and secondly, vectorizing and characterizing the sentences after word segmentation, such as classical one-hot and TF-idf algorithms.
However, there are several problems with similarity measures for short text, including: 1) the similarity measurement at the character level ignores the word order factor; 2) for a large-scale corpus, the vector dimension of one-hot is increased linearly, meanwhile, the similarity is not distinguished obviously due to data sparsity, and for a small-magnitude corpus, the idf value of low-frequency words is larger due to the TF-idf calculation formula, salient points exist in text vectors, and the similarity vibration is larger. Accordingly, there is a need in the art for improved more efficient, more accurate short text similarity determination techniques.
Disclosure of Invention
One aspect of the disclosure relates to a method for determining similarity of short texts, which includes performing word segmentation on short texts in a corpus to obtain corresponding word sequences; determining a penalty based on a total number of short texts in the corpus, the penalty decreasing as the total number of short texts in the corpus increases; determining a word frequency and an adjusted inverse document frequency for each word in the sequence of words, wherein the adjusted inverse document frequency is calculated based on the penalty; weighting the word frequency of each word by using the adjusted inverse document frequency; combining the weighted word frequencies of each word in the word sequence to determine a word frequency vector of the short text; and determining similarity of the short text to other short texts based on the word frequency vector.
According to some exemplary embodiments, determining the adjusted inverse document frequency for each word comprises determining a total number of texts in the corpus; determining the number of texts containing the word in the corpus; adjusting the number of texts in the corpus containing the word based on the penalty, so that the number of texts in the corpus containing the word is exponentially increased when the total number of short texts in the corpus is smaller than a first threshold; and determining the adjusted inverse document frequency based on the total number of texts and the adjusted number of texts containing the word in the corpus.
According to some exemplary embodiments, adjusting the number of texts in the corpus that contain the word based on the penalty further causes the number of texts in the corpus that contain the word not to be increased when the total number of short texts in the corpus is greater than a second threshold.
According to some of the example embodiments, the penalty comprises an exponential smoothing factor, and the adjusting comprises adding the exponential smoothing factor to the number of texts in the corpus containing the word.
According to some exemplary embodiments, weighting the word frequency of each word with the adjusted inverse document frequency comprises multiplying the word frequency by the adjusted inverse document frequency based on the penalty calculation.
According to some exemplary embodiments, if the word sequence obtained by segmenting the short text is different from the word sequence of the other short text in length, the word sequence of the short text or the word sequence of the other short text is padded or cut so that the lengths of the two are the same.
According to some exemplary embodiments, determining the similarity of the short text to other short texts based on the word frequency vector comprises calculating a cosine distance between the word frequency vector of the short text and the word frequency vector of the other short text.
Other aspects of the disclosure also include, among other things, apparatuses, devices, and computer-readable storage media that implement the functionality of the respective methods.
Drawings
Fig. 1 illustrates a schematic diagram of a short text similarity determination system in accordance with an aspect of the present disclosure.
Fig. 2 illustrates a schematic diagram of a word frequency determination apparatus in accordance with an aspect of the present disclosure.
Fig. 3 illustrates a flow diagram of a short text similarity determination method in accordance with an aspect of the present disclosure.
Fig. 4 illustrates a block diagram of a short text similarity determination apparatus in accordance with an aspect of the present disclosure.
Detailed Description
The Term Frequency-Inverse Document Frequency (TF-IDF) technique is a common weighting technique used for data retrieval and text mining, and can be used to evaluate the importance degree of a single word to a certain text in a text library or corpus. The importance of a word increases in proportion to the number of times it appears in the document, i.e. the word frequency (TF), but at the same time decreases in inverse proportion to the frequency of its occurrence in the corpus (IDF). If a word is rare but it appears multiple times in the article, it is likely to reflect the characteristics of the article, and it is the desired keyword.
In order to count the keywords of the text, the text may be segmented, and then the word frequency of each word may be counted. Word frequency refers to the number of times a given word appears in the text. Keywords of text appear in the text with a high word frequency. However, high frequency words with insignificant meaning such as "of", "get", "ground", "is", "also" are easily mistaken for keywords. Thus, each word needs to be weighted to reduce the weight of such high-frequency words that are not significant, while words that are less common to the average meaning in the corpus but significant in the text increase their weight, which can be referred to as Inverse Document Frequency (IDF).
The inverse document frequency is a measure of the importance of a word and is inversely proportional to how common the word is in a general sense. For example, the inverse document frequency may be calculated by dividing the total number of texts in the corpus by the number of texts in the corpus that contain the word, and taking the logarithm of the resulting quotient.
After the TF and IDF are calculated, the two values are multiplied to obtain the TF-IDF value of the word. The larger the TF-IDF value, the higher the importance of the word to the text. And taking a plurality of words with the highest TF-IDF value in the text to obtain the key words of the text.
According to an exemplary embodiment, the short text similarity measure of the present disclosure may be determined by generating respective word frequency vectors of two short texts and calculating cosine similarity of the two word frequency vectors. According to other exemplary embodiments, the short text similarity metric of the present disclosure may be determined by generating respective word frequency vectors of two short texts and calculating other similarity metrics of the two word frequency vectors, such as including but not limited to jaccard similarity, Sorensen similarity coefficient, Levenshtein distance, and hamming distance, among others.
Fig. 1 illustrates a schematic diagram of a short text similarity determination system 100 in accordance with an aspect of the present disclosure. As shown in fig. 1, two or more short texts (e.g., short text 1 and short text 2) may be input into the word segmentation unit 102 for segmentation, respectively, to obtain a word sequence. Word segmentation may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.
The segmented short text (i.e., the sequence of words of the short text) may include a different number of words. For example, short text 1 may be participled as the sequence of words "word 1, word2, … …, word N" and short text 2 may be participled as the sequence of words "word 1, word2, … …, word M", where N may not equal M. In this case, the two lengths can be made equal by padding or cutting. In addition, as will be appreciated, expressions such as "word n" are merely intended to indicate that the word has the nth word order in the short text, and "word n" in different short texts need not be the same word.
The segmented short text may be input into the optimized word frequency calculation unit 104. The optimized word frequency calculating unit 104 may calculate the word frequency of each word in each participled short text. According to an exemplary embodiment, the word frequency in the present disclosure may be based on an optimized TF-IDF. That is, the optimized TF-IDF value for each word may be calculated and a vector of optimized TF-IDF values formed in an order corresponding to the word order of each word in the short text becomes a word frequency vector corresponding to the short text. The calculation method of the optimized TF-IDF value is further described below.
Word frequency vector 1 and word frequency vector 2 corresponding to short text 1 and short text 2, respectively, may be input to similarity unit 106. Similarity unit 106 may determine the similarity of the word frequency vectors. According to an exemplary embodiment, the similarity of the word frequency vectors may be measured using the cosine distance of the vectors. Finally, the similarity unit 106 may provide the similarity measure of the word frequency vector as the similarity measure of the short text.
Short text similarity metrics according to the present application may be applied to various application scenarios, such as including but not limited to transaction-common ship-to address aggregation, and the like.
Although the way of calculating the similarity of two short texts at a time is described above in conjunction with fig. 1, the present disclosure is not limited thereto, but may cover an embodiment in which the similarity of more short texts is compared at a time, as long as it is based on the technical spirit disclosed herein.
Fig. 2 illustrates a schematic diagram of a word frequency determination apparatus 200 in accordance with an aspect of the present disclosure. The word frequency determining apparatus 200 may comprise or constitute the optimized word frequency calculating unit 104 described above in connection with fig. 1.
According to an exemplary embodiment, the word frequency determining apparatus 200 may include, but is not limited to, for example, a word frequency calculating unit 202, an inverse document frequency calculating unit 204, a word frequency weighting unit 206, and a word frequency vector generating unit 208, etc. Although the units described above are described herein as separate functional modules, it will be appreciated by those of ordinary skill in the art that the functions described in connection with the various units may be implemented by various techniques, such as software, firmware, general purpose hardware, special purpose hardware, or the like. Furthermore, each unit need not be implemented in separate software or hardware, but may be implemented, for example, by a general-purpose processor and memory. The functional division is only for the convenience of understanding of the ordinary skilled person in the art, and does not constitute any limitation to the implementation of the present disclosure.
According to an exemplary embodiment, the word frequency calculation unit 202 may obtain the segmented short text 230 from the text corpus 210 to calculate the original word frequency of a single word. For example, the original word frequency may be calculated as:
Figure BDA0003485025310000051
according to an exemplary embodiment, the inverse document frequency calculation unit 204 may calculate the inverse document frequency of the word in the corpus. For example, in general, the inverse document frequency may be calculated as:
Figure BDA0003485025310000052
where n is the total number of texts in the corpus 220 obtained based on the corpus 210, and m is the number of texts in the corpus containing the word. The number of texts containing the word is increased by 1 to avoid the situation where the denominator is 0.
TF-IDF based word frequencies may have problems that depend strongly on corpus size and text length. For example, for a large-scale corpus, short text vectors are very sparse and the presence of low frequency words severely interferes with the similarity metric. On the other hand, for a small corpus, the IDF value is greatly jittered, which results in the indistinct similarity.
From this, an optimized IDF can be calculated. According to at least some example embodiments, calculating an optimized IDF may include introducing a penalty such that when the total number of texts in the corpus is small, a greater penalty is given to the IDF to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.
According to an exemplary embodiment, the penalty may include a non-negative smoothing factor δm. The smoothing factor delta is used when calculating the IDF of a particular wordmCan be used to adjust the number of texts in the corpus that contain the word.
For example, when the total text in the corpusThe smoothing factor δ is smaller (e.g., less than or equal to the first threshold value)mLarger values may be desirable. The number of texts in the corpus containing the word is subject to the smoothing factor deltamThe adjustment is significantly increased, so that the IDF is reduced, and the jitter of the IDF value is relieved.
For example, according to some exemplary embodiments, the smoothing factor δ is when the total number of texts in the corpus is small (e.g., less than or equal to a first threshold value)mLarger constants or other empirical values may be taken. According to further exemplary embodiments, the smoothing factor δ is a factor that is used when the total number of texts in the corpus is small (e.g., less than or equal to a first threshold value)mA variance value may be taken that drops significantly as the total number of texts increases.
On the other hand, when the total number of texts in the corpus is large (e.g., larger than the first threshold), the smoothing factor δmValues may be taken that decrease slowly as the total number of texts in the corpus increases. The number of texts in the corpus containing the word is subject to the smoothing factor deltamThe degree of increase in adjustment decreases more slowly as the total number of texts in the corpus increases. According to some exemplary embodiments, the smoothing factor δ is such that when the total number of texts in the corpus increases to a certain extent (e.g., above a second threshold value), the smoothing factor δm0 can be taken directly.
According to some exemplary embodiments, the smoothing factor δmMay include a value that exponentially decreases with the total number of texts in the corpus.
According to some exemplary embodiments, the smoothing factor δ is usedmAdjusting the number of texts in the corpus that contain the word may include adding the smoothing factor δ to the number of texts in the corpus that contain the wordm
According to some exemplary embodiments, the smoothing factor δ is usedmAdjusting the number of texts in the corpus that contain the word may include multiplying the number of texts in the corpus that contain the word by the smoothing factor δm
According to some example embodiments, the first threshold for the total number of texts in the corpus may comprise, for example, 5. The present disclosure is not limited to the preferred embodiments. The first threshold may be larger or smaller.
According to some example embodiments, the second threshold for the total number of texts in the corpus may comprise, for example, 10. The present disclosure is not limited to the preferred embodiments. The second threshold may be larger or smaller.
According to some exemplary embodiments, the smoothing factor δmThe following can be calculated:
Figure BDA0003485025310000061
where n is the total number of texts in the corpus. Accordingly, by the smoothing factor δmThe smoothed IDF may be calculated as follows:
Figure BDA0003485025310000062
according to further exemplary embodiments, the smoothing factor δmThe following can be calculated:
δm=ae(5-n),m,n>=0 (5)
where n is the total number of texts in the corpus, m is the number of texts in the corpus containing the word, and a is a constant greater than 0. Accordingly, by the smoothing factor δmThe smoothed IDF may be calculated as follows:
Figure BDA0003485025310000063
as can be seen, an optimized IDF may include giving the IDF a greater penalty to mitigate jitter in IDF values when the total number of texts in the corpus is small; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.
The invention provides a method for calculating the similarity of short texts by combining word segmentation and improved inverse document frequency.
Fig. 3 illustrates a flow diagram of a short text similarity determination method 300 in accordance with an aspect of the present disclosure. The short text similarity determination method 300 may include, for example, at block 302, tokenizing the short text to obtain a sequence of words.
According to an example embodiment, the word segmentation may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.
According to an exemplary embodiment, the short text similarity determination method 300 may further include obtaining a text library and performing word segmentation on the short text in the text library as described at block 302, thereby obtaining a word segmentation list.
According to an exemplary embodiment, the short text similarity determination method 300 may optionally further include removing stop words based on the participle list to construct a corpus. The stop word refers to a specific character or word automatically filtered in the information retrieval related to the natural language, so as to save storage space and improve search efficiency.
The short text similarity determination method 300 may further include determining a word frequency (TF) and an adjusted Inverse Document Frequency (IDF) for each word in the short text, for example, at block 304. The adjusted Inverse Document Frequency (IDF) may be determined based on, for example, the manner described above in connection with fig. 1 and/or fig. 2, to give a greater penalty to IDF when the total number of texts in the corpus is smaller, to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.
The short text similarity determination method 300 may further include determining an optimized TF-IDF value for each word in the short text as a weighted word frequency for the word, for example, based on the word frequency (TF) and the optimized Inverse Document Frequency (IDF), at block 306. The optimized TF-IDF value may be a product of a word frequency (TF) of each word in the short text and an optimized Inverse Document Frequency (IDF).
After determining the weighted word frequency for each word in the short text, the short text similarity determination method 300 may further include determining a word frequency vector for the short text, for example, at block 308. Determining the word frequency vector for the short text may include combining the optimized TF-IDF values for each word into a vector in an order corresponding to the sequence of words of the short text into the word frequency vector corresponding to the short text.
The short text similarity determination method 300 may further include comparing the similarity between two or more short texts based on the word frequency vector of the short texts, for example, at block 310. According to some example embodiments, the similarity between two or more short texts may be determined based on cosine similarity of their respective word frequency vectors. According to some example embodiments, the similarity between two or more short texts may be determined based on other similarity measures of their respective word frequency vectors.
Fig. 4 illustrates a block diagram of a short text similarity determination apparatus 400 in accordance with an aspect of the present disclosure. The short text similarity determination apparatus 400 may include, for example, a module 402 for tokenizing short text.
According to an exemplary embodiment, the module for tokenizing short text 402 may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.
According to an exemplary embodiment, the short text similarity determination apparatus 400 may further include a module (not shown) for acquiring a text library. The module for segmenting the short text 402 may segment the short text in the acquired text library, thereby obtaining a segmentation list.
According to an exemplary embodiment, the short text similarity determining apparatus 400 may optionally be further configured to include a module (not shown) for removing stop words to construct a corpus based on the participle list. The stop word refers to a specific character or word automatically filtered in the information retrieval related to the natural language, so as to save storage space and improve search efficiency.
The short text similarity determination apparatus 400 may further include a module 404 for determining a word frequency (TF) and an adjusted Inverse Document Frequency (IDF) for each word in the short text, for example. The adjusted Inverse Document Frequency (IDF) may be determined based on, for example, the manner described above in connection with fig. 1 and/or fig. 2, to give a greater penalty to IDF when the total number of texts in the corpus is smaller, to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.
The short text similarity determination apparatus 400 may further include a module 406 for determining an optimized TF-IDF value for each word in the short text as a weighted word frequency for the word based on the word frequency (TF) and the adjusted Inverse Document Frequency (IDF). The optimized TF-IDF value may be a product of a word frequency (TF) of each word in the short text and an adjusted Inverse Document Frequency (IDF).
The short text similarity determination apparatus 400 may further include a module 408 for determining a word frequency vector for the short text after determining a weighted word frequency for each word in the short text. The module for determining a word frequency vector for the short text 408 may include, for example, a module for constructing a vector of optimized TF-IDF values in an order corresponding to the word order of each word in the short text into a word frequency vector corresponding to the short text.
The short text similarity determination apparatus 400 may further include a module 410 for comparing the similarity between two or more short texts based on the word frequency vectors of the short texts, for example. According to some example embodiments, the similarity between two or more short texts may be determined based on cosine similarity of their respective word frequency vectors. According to some example embodiments, the similarity between two or more short texts may be determined based on other similarity measures of their respective word frequency vectors.
The short text similarity determination apparatus 400 may be implemented in software, firmware, and/or hardware, or any combination thereof. For example, the various modules of the short text similarity determination apparatus 400 may be implemented in the form of processor-executable instructions stored on a computer-readable storage medium such that, when the processor-executable instructions are read and executed by one or more processors of a computer, the computer performs the functions of the various modules 402 and 410 of the short text similarity determination apparatus 400, and so on. For another example, the short text similarity determination apparatus 400 may be implemented by a combination of a processor and a memory, the processor being coupled with the memory and executing programs and/or instructions stored in the memory to implement the functions of the modules 402 and 410 of the short text similarity determination apparatus 400, and the like. As another example, short text similarity determination apparatus 400 may be implemented in various other dedicated or general firmware/hardware implementations.
According to the short text similarity efficient calculation method and device combining word segmentation and the improved inverse text frequency according to various embodiments of the aspects of the disclosure, the influence of the size of the corpus and the length of the text on the similarity measurement between short texts is eliminated or reduced, and the inverse text frequency calculation method is improved, so that the method can adaptively adjust the inverse text frequency value according to the size of the corpus and the length of the text, smooth the weight coefficients of the high-frequency words and the low-frequency words, and the obtained text vector can pay attention to the difference of the low-frequency words and the similarity of the high-frequency words. The short text similarity efficient calculation method and device of the improved inverse text frequency have the advantages that the similarity between short texts calculated by the method and device is smoother, more stable and easier to explain.
What has been described above is merely exemplary embodiments of the present invention. The scope of the invention is not limited thereto. Any changes or substitutions that may be easily made by those skilled in the art within the technical scope of the present disclosure are intended to be included within the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of storage medium known in the art. Some examples of storage media that may be used include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, and so forth. A software module may comprise a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
The methods disclosed herein comprise one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
The processor may execute software stored on a machine-readable medium. A processor may be implemented with one or more general and/or special purpose processors. Examples include microprocessors, microcontrollers, DSP processors, and other circuitry capable of executing software. Software should be construed broadly to mean instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. By way of example, a machine-readable medium may include RAM (random access memory), flash memory, ROM (read only memory), PROM (programmable read only memory), EPROM (erasable programmable read only memory), EEPROM (electrically erasable programmable read only memory), registers, a magnetic disk, an optical disk, a hard drive, or any other suitable storage medium, or any combination thereof. The machine-readable medium may be embodied in a computer program product. The computer program product may include packaging material.
In a hardware implementation, the machine-readable medium may be a part of the processing system that is separate from the processor. However, as those skilled in the art will readily appreciate, the machine-readable medium, or any portion thereof, may be external to the processing system. By way of example, a machine-readable medium may include a transmission line, a carrier wave modulated by data, and/or a computer product separate from the wireless node, all of which may be accessed by a processor through a bus interface. Alternatively or additionally, the machine-readable medium or any portion thereof may be integrated into a processor, such as a cache and/or a general register file, as may be the case.
The processing system may be configured as a general purpose processing system having one or more microprocessors that provide processor functionality, and an external memory that provides at least a portion of the machine readable medium, all linked together with other supporting circuitry through an external bus architecture. Alternatively, the processing system may be implemented with an ASIC (application specific integrated circuit) having a processor, a bus interface, a user interface (in the case of an access terminal), support circuitry, and at least a portion of a machine readable medium integrated in a single chip, or with one or more FPGAs (field programmable gate arrays), PLDs (programmable logic devices), controllers, state machines, gated logic, discrete hardware components, or any other suitable circuitry, or any combination of circuitry that is capable of performing the various functionalities described throughout this disclosure. Depending on the particular application and the overall design constraints imposed on the overall system, those skilled in the art will recognize how to better implement the functionality described with respect to the processing system.
The machine-readable medium may include several software modules. These software modules include instructions that, when executed by a device, such as a processor, cause the processing system to perform various functions. These software modules may include a transmitting module and a receiving module. Each software module may reside in a single storage device or be distributed across multiple storage devices. As an example, a software module may be loaded into RAM from a hard drive when a triggering event occurs. During execution of the software module, the processor may load some instructions into the cache to increase access speed. One or more cache lines may then be loaded into a general register file for execution by the processor. When referring to the functionality of a software module below, it will be understood that such functionality is implemented by the processor when executing instructions from the software module.
If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as Infrared (IR), radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, and
Figure BDA0003485025310000111
disks, where a disk (disk) usually reproduces data magnetically, and a disk (disc) reproduces data optically with a laser. Thus, in some aspects, computer-readable media may comprise non-transitory computer-readable media (e.g., tangible media). Additionally, for other aspects, the computer-readable medium may comprise a transitory computer-readable medium (e.g., a signal). Combinations of the above should also be included within the scope of computer-readable media.
Accordingly, certain aspects may comprise a computer program product for performing the operations presented herein. For example, such a computer program product may include a computer-readable medium having instructions stored (and/or encoded) thereon, the instructions being executable by one or more processors to perform the operations described herein. In certain aspects, a computer program product may include packaging materials.
It is to be understood that the claims are not limited to the precise configuration and components illustrated above. Various changes, substitutions and alterations in the arrangement, operation and details of the method and apparatus described above may be made without departing from the scope of the claims.

Claims (15)

1. A short text similarity determination method comprises the following steps:
performing word segmentation on short texts in a corpus to obtain corresponding word sequences;
determining a penalty based on a total number of short texts in the corpus, the penalty decreasing as the total number of short texts in the corpus increases;
determining a word frequency and an adjusted inverse document frequency for each word in the sequence of words, wherein the adjusted inverse document frequency is calculated based on the penalty;
weighting the word frequency of each word by using the adjusted inverse document frequency;
combining the weighted word frequencies of each word in the word sequence to determine a word frequency vector of the short text; and
determining similarity of the short text to other short texts based on the word frequency vector.
2. The method of claim 1, wherein determining the adjusted inverse document frequency for each word comprises:
determining a total number of texts in the corpus;
determining the number of texts containing the word in the corpus;
adjusting the number of texts in the corpus containing the word based on the penalty, so that the number of texts in the corpus containing the word is exponentially increased when the total number of short texts in the corpus is smaller than a first threshold; and
determining the adjusted inverse document frequency based on the total number of texts and the adjusted number of texts in the corpus containing the word.
3. The method of claim 2, wherein adjusting the number of texts in the corpus containing the word based on the penalty further causes the number of texts in the corpus containing the word not to be increased when the total number of short texts in the corpus is greater than a second threshold.
4. The method of claim 2, wherein the penalty comprises an exponential smoothing factor and the adjusting comprises adding the exponential smoothing factor to the number of texts in the corpus containing the word.
5. The method of claim 1, wherein weighting the word frequency of each word with the adjusted inverse document frequency comprises multiplying the word frequency by the adjusted inverse document frequency based on the penalty calculation.
6. The method of claim 1, further comprising:
and if the word sequence obtained by segmenting the short text is different from the word sequences of the other short texts in length, filling or cutting the word sequence of the short text or the word sequences of the other short texts to enable the word sequences and the word sequences to be the same in length.
7. The method of claim 1, determining similarity of the short text to other short texts based on the word frequency vector comprises:
and calculating the cosine distance between the word frequency vector of the short text and the word frequency vectors of other short texts.
8. A short text similarity determination apparatus comprising:
a memory; and
a processor coupled to the memory, the processor configured to:
performing word segmentation on short texts in a corpus to obtain corresponding word sequences;
determining a penalty based on a total number of short texts in the corpus, the penalty decreasing as the total number of short texts in the corpus increases;
determining a word frequency and an adjusted inverse document frequency for each word in the sequence of words, wherein the adjusted inverse document frequency is calculated based on the penalty;
weighting the word frequency of each word by using the adjusted inverse document frequency;
combining the weighted word frequencies of each word in the word sequence to determine a word frequency vector of the short text; and
determining similarity of the short text to other short texts based on the word frequency vector.
9. The short text similarity determination apparatus of claim 8, wherein the processor being configured to determine the adjusted inverse document frequency for each word comprises the processor being configured to:
determining a total number of texts in the corpus;
determining the number of texts containing the word in the corpus;
adjusting the number of texts in the corpus containing the word based on the penalty, so that the number of texts in the corpus containing the word is exponentially increased when the total number of short texts in the corpus is smaller than a first threshold; and
determining the adjusted inverse document frequency based on the total number of texts and the adjusted number of texts in the corpus containing the word.
10. The short text similarity determination apparatus of claim 9, wherein the processor being configured to adjust the number of texts in the corpus containing the word based on the penalty further comprises the processor being configured such that the number of texts in the corpus containing the word is not increased when the total number of short texts in the corpus is greater than a second threshold.
11. The short text similarity determination apparatus of claim 9, wherein the penalty comprises an exponential smoothing factor and the adjustment comprises adding the exponential smoothing factor to the number of texts in the corpus containing the word.
12. The short-text similarity determination apparatus of claim 8 wherein the processor being configured to weight the word frequency for each word with the adjusted inverse document frequency comprises the processor being configured to multiply the word frequency by the adjusted inverse document frequency based on the penalty calculation.
13. The short-text similarity determination apparatus of claim 8, the processor further configured to:
and if the word sequence obtained by segmenting the short text is different from the word sequences of the other short texts in length, filling or cutting the word sequence of the short text or the word sequences of the other short texts to enable the word sequences and the word sequences to be the same in length.
14. The short text similarity determination apparatus of claim 8, wherein the processor being configured to determine the similarity of the short text to other short texts based on the word frequency vector comprises the processor being configured to:
and calculating the cosine distance between the word frequency vector of the short text and the word frequency vectors of other short texts.
15. A short text similarity determination apparatus comprising:
a module for segmenting words from short texts in a corpus to obtain corresponding word sequences;
means for determining a penalty based on a total number of short texts in the corpus, the penalty decreasing as the total number of short texts in the corpus increases;
means for determining a word frequency and an adjusted inverse document frequency for each word in the sequence of words, wherein the adjusted inverse document frequency is calculated based on the penalty;
a module for weighting the word frequency of each word with the adjusted inverse document frequency;
means for combining the weighted word frequencies of each word in the sequence of words to determine a word frequency vector for the short text; and
means for determining a similarity of the short text to other short texts based on the word frequency vector.
CN202210078359.5A 2022-01-24 2022-01-24 Efficient short text similarity determination method and device Active CN114418016B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210078359.5A CN114418016B (en) 2022-01-24 2022-01-24 Efficient short text similarity determination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210078359.5A CN114418016B (en) 2022-01-24 2022-01-24 Efficient short text similarity determination method and device

Publications (2)

Publication Number Publication Date
CN114418016A true CN114418016A (en) 2022-04-29
CN114418016B CN114418016B (en) 2025-10-17

Family

ID=81277043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210078359.5A Active CN114418016B (en) 2022-01-24 2022-01-24 Efficient short text similarity determination method and device

Country Status (1)

Country Link
CN (1) CN114418016B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030079185A1 (en) * 1998-10-09 2003-04-24 Sanjeev Katariya Method and system for generating a document summary
US7376635B1 (en) * 2000-07-21 2008-05-20 Ford Global Technologies, Llc Theme-based system and method for classifying documents
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN108121821A (en) * 2018-01-09 2018-06-05 惠龙易通国际物流股份有限公司 A kind of machine customer service method, equipment and computer storage media
CN108133752A (en) * 2017-12-21 2018-06-08 新博卓畅技术(北京)有限公司 A kind of optimization of medical symptom keyword extraction and recovery method and system based on TFIDF
CN109241400A (en) * 2018-07-12 2019-01-18 广东技术师范学院 A kind of medical information resource intelligent searching system
CN112365942A (en) * 2020-10-20 2021-02-12 哈尔滨学院 Infectious disease epidemic risk prediction analysis method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030079185A1 (en) * 1998-10-09 2003-04-24 Sanjeev Katariya Method and system for generating a document summary
US7376635B1 (en) * 2000-07-21 2008-05-20 Ford Global Technologies, Llc Theme-based system and method for classifying documents
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN108133752A (en) * 2017-12-21 2018-06-08 新博卓畅技术(北京)有限公司 A kind of optimization of medical symptom keyword extraction and recovery method and system based on TFIDF
CN108121821A (en) * 2018-01-09 2018-06-05 惠龙易通国际物流股份有限公司 A kind of machine customer service method, equipment and computer storage media
CN109241400A (en) * 2018-07-12 2019-01-18 广东技术师范学院 A kind of medical information resource intelligent searching system
CN112365942A (en) * 2020-10-20 2021-02-12 哈尔滨学院 Infectious disease epidemic risk prediction analysis method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
G. PALTOGLOU, M. THELWALL: "A study of information retrieval weighting schemes for sentiment analysis", PROCEEDINGS OF THE 48TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 31 December 2010 (2010-12-31), pages 1386 - 1395 *
贺科达;朱铮涛;程昱;: "基于改进TF-IDF算法的文本分类方法研究", 广东工业大学学报, vol. 33, no. 05, 30 September 2016 (2016-09-30) *
赵航: "基于链接信誉分析的网页权威排序分类算法研究", 中国优秀硕士学位论文全文数据库, 31 October 2012 (2012-10-31), pages 7 - 16 *
靖慧;杨振宇;于敏;: "基于改进的TFIDF和压缩自动编码器文本分类研究", 齐鲁工业大学学报(自然科学版), vol. 31, no. 03, 30 June 2017 (2017-06-30) *

Also Published As

Publication number Publication date
CN114418016B (en) 2025-10-17

Similar Documents

Publication Publication Date Title
CN111753167B (en) Search for processing methods, apparatus, computer equipment and media
US9213943B2 (en) Parameter inference method, calculation apparatus, and system based on latent dirichlet allocation model
US9348895B2 (en) Automatic suggestion for query-rewrite rules
CN110674621B (en) Attribute information filling method and device
CN106446122B (en) Information retrieval method, device and computing device
CN110866095B (en) Text similarity determining method and related equipment
US20200394269A1 (en) Machine Learning Based Intent Resolution for a Digital Assistant
US9116977B2 (en) Searching information
US9275064B2 (en) Caching of deep structures for efficient parsing
US11462038B2 (en) Interpreting text classification predictions through deterministic extraction of prominent n-grams
Güttel et al. A sketch-and-select Arnoldi process
CN113934842A (en) Text clustering method and device and readable storage medium
US8335749B2 (en) Generating a set of atoms
CN112948545A (en) Duplicate checking method, terminal equipment and computer readable storage medium
US20240078393A1 (en) Search-engine-augmented dialogue response generation with cheaply supervised query production
CN110852057A (en) Method and device for calculating text similarity
CN111324705B (en) System and method for adaptively adjusting associated search terms
CN114418016A (en) Efficient short text similarity determination method and device
Lopes et al. Accelerating block coordinate descent methods with identification strategies
US12282513B2 (en) Optimistic facet set selection for dynamic faceted search
CN114328855B (en) Document retrieval methods, devices, electronic devices, and readable storage media
US20130339003A1 (en) Assisted Free Form Decision Definition Using Rules Vocabulary
CN110083835A (en) A kind of keyword extracting method and device based on figure and words and phrases collaboration
CN111625579A (en) Information processing method, device and system
CN116010605A (en) Long text classification method, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 310000 Zhejiang Province, Hangzhou City, Xihu District, Xixi Road 543-569 (continuous odd numbers) Building 1, Building 2, 5th Floor, Room 518

Patentee after: Alipay (Hangzhou) Digital Service Technology Co.,Ltd.

Country or region after: China

Address before: 801-11, Section B, 8th floor, No. 556, Xixi Road, Xihu District, Hangzhou City, Zhejiang Province, 310023

Patentee before: Alipay (Hangzhou) Information Technology Co., Ltd.

Country or region before: China

CP03 Change of name, title or address