CN114418016A

CN114418016A - Efficient short text similarity determination method and device

Info

Publication number: CN114418016A
Application number: CN202210078359.5A
Authority: CN
Inventors: 刘东亚
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Digital Service Technology Co ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-04-29
Anticipated expiration: 2042-01-24
Also published as: CN114418016B

Abstract

One aspect of the present disclosure relates to an efficient short text similarity determination method, which includes segmenting short texts in a corpus to obtain corresponding word sequences; determining a penalty based on the total number of short texts in the corpus, and the penalty varies with decrease as the total number of short texts in the corpus increases; determine the word frequency and adjusted inverse document frequency for each word in the word sequence, wherein the adjusted inverse document frequency is calculated based on the penalty; use all weighting the word frequency of each word by describing the adjusted inverse document frequency; combining the weighted word frequency of each word in the word sequence to determine a word frequency vector for the short text; and determining the short text based on the word frequency vector Similarity to other short texts. The present disclosure also relates to other related aspects.

Description

Efficient short text similarity determination method and device

Technical Field

The present application relates generally to Natural Language Processing (NLP), and more particularly to efficient short text similarity determination.

Background

Text similarity measures are a common problem in the field of NLP. Different measurement methods have been studied separately in academic and industrial circles for long and short texts.

For similarity measurement methods for long texts, there are generally two paradigms: firstly, vectorization representation is performed on words or phrases, similarity is calculated after vector representation of long texts is obtained through aggregation, and the following methods are common: word2vec, Bow model, etc.; and secondly, introducing a deep learning network structure, learning vectors of sentences or texts according to context semantics, wherein Elmo, Bert and the like are common, and directly calculating the similarity through the constructed sentence vectors.

For the similarity measurement method of short text, there are also two classical paradigms in general: firstly, directly calculating similarity coefficients at a character level without vectorizing and characterizing sentences, wherein the similarity coefficients commonly include jaccard similarity, Sorensen similarity coefficient, Levenshtein distance, Hamming distance and the like; and secondly, vectorizing and characterizing the sentences after word segmentation, such as classical one-hot and TF-idf algorithms.

However, there are several problems with similarity measures for short text, including: 1) the similarity measurement at the character level ignores the word order factor; 2) for a large-scale corpus, the vector dimension of one-hot is increased linearly, meanwhile, the similarity is not distinguished obviously due to data sparsity, and for a small-magnitude corpus, the idf value of low-frequency words is larger due to the TF-idf calculation formula, salient points exist in text vectors, and the similarity vibration is larger. Accordingly, there is a need in the art for improved more efficient, more accurate short text similarity determination techniques.

Disclosure of Invention

One aspect of the disclosure relates to a method for determining similarity of short texts, which includes performing word segmentation on short texts in a corpus to obtain corresponding word sequences; determining a penalty based on a total number of short texts in the corpus, the penalty decreasing as the total number of short texts in the corpus increases; determining a word frequency and an adjusted inverse document frequency for each word in the sequence of words, wherein the adjusted inverse document frequency is calculated based on the penalty; weighting the word frequency of each word by using the adjusted inverse document frequency; combining the weighted word frequencies of each word in the word sequence to determine a word frequency vector of the short text; and determining similarity of the short text to other short texts based on the word frequency vector.

According to some exemplary embodiments, determining the adjusted inverse document frequency for each word comprises determining a total number of texts in the corpus; determining the number of texts containing the word in the corpus; adjusting the number of texts in the corpus containing the word based on the penalty, so that the number of texts in the corpus containing the word is exponentially increased when the total number of short texts in the corpus is smaller than a first threshold; and determining the adjusted inverse document frequency based on the total number of texts and the adjusted number of texts containing the word in the corpus.

According to some exemplary embodiments, adjusting the number of texts in the corpus that contain the word based on the penalty further causes the number of texts in the corpus that contain the word not to be increased when the total number of short texts in the corpus is greater than a second threshold.

According to some of the example embodiments, the penalty comprises an exponential smoothing factor, and the adjusting comprises adding the exponential smoothing factor to the number of texts in the corpus containing the word.

According to some exemplary embodiments, weighting the word frequency of each word with the adjusted inverse document frequency comprises multiplying the word frequency by the adjusted inverse document frequency based on the penalty calculation.

According to some exemplary embodiments, if the word sequence obtained by segmenting the short text is different from the word sequence of the other short text in length, the word sequence of the short text or the word sequence of the other short text is padded or cut so that the lengths of the two are the same.

According to some exemplary embodiments, determining the similarity of the short text to other short texts based on the word frequency vector comprises calculating a cosine distance between the word frequency vector of the short text and the word frequency vector of the other short text.

Other aspects of the disclosure also include, among other things, apparatuses, devices, and computer-readable storage media that implement the functionality of the respective methods.

Drawings

Fig. 1 illustrates a schematic diagram of a short text similarity determination system in accordance with an aspect of the present disclosure.

Fig. 2 illustrates a schematic diagram of a word frequency determination apparatus in accordance with an aspect of the present disclosure.

Fig. 3 illustrates a flow diagram of a short text similarity determination method in accordance with an aspect of the present disclosure.

Fig. 4 illustrates a block diagram of a short text similarity determination apparatus in accordance with an aspect of the present disclosure.

Detailed Description

The Term Frequency-Inverse Document Frequency (TF-IDF) technique is a common weighting technique used for data retrieval and text mining, and can be used to evaluate the importance degree of a single word to a certain text in a text library or corpus. The importance of a word increases in proportion to the number of times it appears in the document, i.e. the word frequency (TF), but at the same time decreases in inverse proportion to the frequency of its occurrence in the corpus (IDF). If a word is rare but it appears multiple times in the article, it is likely to reflect the characteristics of the article, and it is the desired keyword.

In order to count the keywords of the text, the text may be segmented, and then the word frequency of each word may be counted. Word frequency refers to the number of times a given word appears in the text. Keywords of text appear in the text with a high word frequency. However, high frequency words with insignificant meaning such as "of", "get", "ground", "is", "also" are easily mistaken for keywords. Thus, each word needs to be weighted to reduce the weight of such high-frequency words that are not significant, while words that are less common to the average meaning in the corpus but significant in the text increase their weight, which can be referred to as Inverse Document Frequency (IDF).

The inverse document frequency is a measure of the importance of a word and is inversely proportional to how common the word is in a general sense. For example, the inverse document frequency may be calculated by dividing the total number of texts in the corpus by the number of texts in the corpus that contain the word, and taking the logarithm of the resulting quotient.

After the TF and IDF are calculated, the two values are multiplied to obtain the TF-IDF value of the word. The larger the TF-IDF value, the higher the importance of the word to the text. And taking a plurality of words with the highest TF-IDF value in the text to obtain the key words of the text.

According to an exemplary embodiment, the short text similarity measure of the present disclosure may be determined by generating respective word frequency vectors of two short texts and calculating cosine similarity of the two word frequency vectors. According to other exemplary embodiments, the short text similarity metric of the present disclosure may be determined by generating respective word frequency vectors of two short texts and calculating other similarity metrics of the two word frequency vectors, such as including but not limited to jaccard similarity, Sorensen similarity coefficient, Levenshtein distance, and hamming distance, among others.

Fig. 1 illustrates a schematic diagram of a short text similarity determination system 100 in accordance with an aspect of the present disclosure. As shown in fig. 1, two or more short texts (e.g., short text 1 and short text 2) may be input into the word segmentation unit 102 for segmentation, respectively, to obtain a word sequence. Word segmentation may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.

The segmented short text (i.e., the sequence of words of the short text) may include a different number of words. For example, short text 1 may be participled as the sequence of words "word 1, word2, … …, word N" and short text 2 may be participled as the sequence of words "word 1, word2, … …, word M", where N may not equal M. In this case, the two lengths can be made equal by padding or cutting. In addition, as will be appreciated, expressions such as "word n" are merely intended to indicate that the word has the nth word order in the short text, and "word n" in different short texts need not be the same word.

The segmented short text may be input into the optimized word frequency calculation unit 104. The optimized word frequency calculating unit 104 may calculate the word frequency of each word in each participled short text. According to an exemplary embodiment, the word frequency in the present disclosure may be based on an optimized TF-IDF. That is, the optimized TF-IDF value for each word may be calculated and a vector of optimized TF-IDF values formed in an order corresponding to the word order of each word in the short text becomes a word frequency vector corresponding to the short text. The calculation method of the optimized TF-IDF value is further described below.

Word frequency vector 1 and word frequency vector 2 corresponding to short text 1 and short text 2, respectively, may be input to similarity unit 106. Similarity unit 106 may determine the similarity of the word frequency vectors. According to an exemplary embodiment, the similarity of the word frequency vectors may be measured using the cosine distance of the vectors. Finally, the similarity unit 106 may provide the similarity measure of the word frequency vector as the similarity measure of the short text.

Short text similarity metrics according to the present application may be applied to various application scenarios, such as including but not limited to transaction-common ship-to address aggregation, and the like.

Although the way of calculating the similarity of two short texts at a time is described above in conjunction with fig. 1, the present disclosure is not limited thereto, but may cover an embodiment in which the similarity of more short texts is compared at a time, as long as it is based on the technical spirit disclosed herein.

Fig. 2 illustrates a schematic diagram of a word frequency determination apparatus 200 in accordance with an aspect of the present disclosure. The word frequency determining apparatus 200 may comprise or constitute the optimized word frequency calculating unit 104 described above in connection with fig. 1.

According to an exemplary embodiment, the word frequency determining apparatus 200 may include, but is not limited to, for example, a word frequency calculating unit 202, an inverse document frequency calculating unit 204, a word frequency weighting unit 206, and a word frequency vector generating unit 208, etc. Although the units described above are described herein as separate functional modules, it will be appreciated by those of ordinary skill in the art that the functions described in connection with the various units may be implemented by various techniques, such as software, firmware, general purpose hardware, special purpose hardware, or the like. Furthermore, each unit need not be implemented in separate software or hardware, but may be implemented, for example, by a general-purpose processor and memory. The functional division is only for the convenience of understanding of the ordinary skilled person in the art, and does not constitute any limitation to the implementation of the present disclosure.

According to an exemplary embodiment, the word frequency calculation unit 202 may obtain the segmented short text 230 from the text corpus 210 to calculate the original word frequency of a single word. For example, the original word frequency may be calculated as:

according to an exemplary embodiment, the inverse document frequency calculation unit 204 may calculate the inverse document frequency of the word in the corpus. For example, in general, the inverse document frequency may be calculated as:

where n is the total number of texts in the corpus 220 obtained based on the corpus 210, and m is the number of texts in the corpus containing the word. The number of texts containing the word is increased by 1 to avoid the situation where the denominator is 0.

TF-IDF based word frequencies may have problems that depend strongly on corpus size and text length. For example, for a large-scale corpus, short text vectors are very sparse and the presence of low frequency words severely interferes with the similarity metric. On the other hand, for a small corpus, the IDF value is greatly jittered, which results in the indistinct similarity.

From this, an optimized IDF can be calculated. According to at least some example embodiments, calculating an optimized IDF may include introducing a penalty such that when the total number of texts in the corpus is small, a greater penalty is given to the IDF to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.

According to an exemplary embodiment, the penalty may include a non-negative smoothing factor δ_m. The smoothing factor delta is used when calculating the IDF of a particular word_mCan be used to adjust the number of texts in the corpus that contain the word.

For example, when the total text in the corpusThe smoothing factor δ is smaller (e.g., less than or equal to the first threshold value)_mLarger values may be desirable. The number of texts in the corpus containing the word is subject to the smoothing factor delta_mThe adjustment is significantly increased, so that the IDF is reduced, and the jitter of the IDF value is relieved.

For example, according to some exemplary embodiments, the smoothing factor δ is when the total number of texts in the corpus is small (e.g., less than or equal to a first threshold value)_mLarger constants or other empirical values may be taken. According to further exemplary embodiments, the smoothing factor δ is a factor that is used when the total number of texts in the corpus is small (e.g., less than or equal to a first threshold value)_mA variance value may be taken that drops significantly as the total number of texts increases.

On the other hand, when the total number of texts in the corpus is large (e.g., larger than the first threshold), the smoothing factor δ_mValues may be taken that decrease slowly as the total number of texts in the corpus increases. The number of texts in the corpus containing the word is subject to the smoothing factor delta_mThe degree of increase in adjustment decreases more slowly as the total number of texts in the corpus increases. According to some exemplary embodiments, the smoothing factor δ is such that when the total number of texts in the corpus increases to a certain extent (e.g., above a second threshold value), the smoothing factor δ_m0 can be taken directly.

According to some exemplary embodiments, the smoothing factor δ_mMay include a value that exponentially decreases with the total number of texts in the corpus.

According to some exemplary embodiments, the smoothing factor δ is used_mAdjusting the number of texts in the corpus that contain the word may include adding the smoothing factor δ to the number of texts in the corpus that contain the word_m。

According to some exemplary embodiments, the smoothing factor δ is used_mAdjusting the number of texts in the corpus that contain the word may include multiplying the number of texts in the corpus that contain the word by the smoothing factor δ_m。

According to some example embodiments, the first threshold for the total number of texts in the corpus may comprise, for example, 5. The present disclosure is not limited to the preferred embodiments. The first threshold may be larger or smaller.

According to some example embodiments, the second threshold for the total number of texts in the corpus may comprise, for example, 10. The present disclosure is not limited to the preferred embodiments. The second threshold may be larger or smaller.

According to some exemplary embodiments, the smoothing factor δ_mThe following can be calculated:

where n is the total number of texts in the corpus. Accordingly, by the smoothing factor δ_mThe smoothed IDF may be calculated as follows:

according to further exemplary embodiments, the smoothing factor δ_mThe following can be calculated:

δ_m＝ae^(5-n)，m,n>＝0 (5)

where n is the total number of texts in the corpus, m is the number of texts in the corpus containing the word, and a is a constant greater than 0. Accordingly, by the smoothing factor δ_mThe smoothed IDF may be calculated as follows:

as can be seen, an optimized IDF may include giving the IDF a greater penalty to mitigate jitter in IDF values when the total number of texts in the corpus is small; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.

The invention provides a method for calculating the similarity of short texts by combining word segmentation and improved inverse document frequency.

Fig. 3 illustrates a flow diagram of a short text similarity determination method 300 in accordance with an aspect of the present disclosure. The short text similarity determination method 300 may include, for example, at block 302, tokenizing the short text to obtain a sequence of words.

According to an example embodiment, the word segmentation may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.

According to an exemplary embodiment, the short text similarity determination method 300 may further include obtaining a text library and performing word segmentation on the short text in the text library as described at block 302, thereby obtaining a word segmentation list.

According to an exemplary embodiment, the short text similarity determination method 300 may optionally further include removing stop words based on the participle list to construct a corpus. The stop word refers to a specific character or word automatically filtered in the information retrieval related to the natural language, so as to save storage space and improve search efficiency.

The short text similarity determination method 300 may further include determining a word frequency (TF) and an adjusted Inverse Document Frequency (IDF) for each word in the short text, for example, at block 304. The adjusted Inverse Document Frequency (IDF) may be determined based on, for example, the manner described above in connection with fig. 1 and/or fig. 2, to give a greater penalty to IDF when the total number of texts in the corpus is smaller, to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.

The short text similarity determination method 300 may further include determining an optimized TF-IDF value for each word in the short text as a weighted word frequency for the word, for example, based on the word frequency (TF) and the optimized Inverse Document Frequency (IDF), at block 306. The optimized TF-IDF value may be a product of a word frequency (TF) of each word in the short text and an optimized Inverse Document Frequency (IDF).

After determining the weighted word frequency for each word in the short text, the short text similarity determination method 300 may further include determining a word frequency vector for the short text, for example, at block 308. Determining the word frequency vector for the short text may include combining the optimized TF-IDF values for each word into a vector in an order corresponding to the sequence of words of the short text into the word frequency vector corresponding to the short text.

The short text similarity determination method 300 may further include comparing the similarity between two or more short texts based on the word frequency vector of the short texts, for example, at block 310. According to some example embodiments, the similarity between two or more short texts may be determined based on cosine similarity of their respective word frequency vectors. According to some example embodiments, the similarity between two or more short texts may be determined based on other similarity measures of their respective word frequency vectors.

Fig. 4 illustrates a block diagram of a short text similarity determination apparatus 400 in accordance with an aspect of the present disclosure. The short text similarity determination apparatus 400 may include, for example, a module 402 for tokenizing short text.

According to an exemplary embodiment, the module for tokenizing short text 402 may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.

According to an exemplary embodiment, the short text similarity determination apparatus 400 may further include a module (not shown) for acquiring a text library. The module for segmenting the short text 402 may segment the short text in the acquired text library, thereby obtaining a segmentation list.

According to an exemplary embodiment, the short text similarity determining apparatus 400 may optionally be further configured to include a module (not shown) for removing stop words to construct a corpus based on the participle list. The stop word refers to a specific character or word automatically filtered in the information retrieval related to the natural language, so as to save storage space and improve search efficiency.

The short text similarity determination apparatus 400 may further include a module 404 for determining a word frequency (TF) and an adjusted Inverse Document Frequency (IDF) for each word in the short text, for example. The adjusted Inverse Document Frequency (IDF) may be determined based on, for example, the manner described above in connection with fig. 1 and/or fig. 2, to give a greater penalty to IDF when the total number of texts in the corpus is smaller, to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.

The short text similarity determination apparatus 400 may further include a module 406 for determining an optimized TF-IDF value for each word in the short text as a weighted word frequency for the word based on the word frequency (TF) and the adjusted Inverse Document Frequency (IDF). The optimized TF-IDF value may be a product of a word frequency (TF) of each word in the short text and an adjusted Inverse Document Frequency (IDF).

The short text similarity determination apparatus 400 may further include a module 408 for determining a word frequency vector for the short text after determining a weighted word frequency for each word in the short text. The module for determining a word frequency vector for the short text 408 may include, for example, a module for constructing a vector of optimized TF-IDF values in an order corresponding to the word order of each word in the short text into a word frequency vector corresponding to the short text.

The short text similarity determination apparatus 400 may further include a module 410 for comparing the similarity between two or more short texts based on the word frequency vectors of the short texts, for example. According to some example embodiments, the similarity between two or more short texts may be determined based on cosine similarity of their respective word frequency vectors. According to some example embodiments, the similarity between two or more short texts may be determined based on other similarity measures of their respective word frequency vectors.

The short text similarity determination apparatus 400 may be implemented in software, firmware, and/or hardware, or any combination thereof. For example, the various modules of the short text similarity determination apparatus 400 may be implemented in the form of processor-executable instructions stored on a computer-readable storage medium such that, when the processor-executable instructions are read and executed by one or more processors of a computer, the computer performs the functions of the

various modules

402 and 410 of the short text similarity determination apparatus 400, and so on. For another example, the short text similarity determination apparatus 400 may be implemented by a combination of a processor and a memory, the processor being coupled with the memory and executing programs and/or instructions stored in the memory to implement the functions of the

modules

402 and 410 of the short text similarity determination apparatus 400, and the like. As another example, short text similarity determination apparatus 400 may be implemented in various other dedicated or general firmware/hardware implementations.

According to the short text similarity efficient calculation method and device combining word segmentation and the improved inverse text frequency according to various embodiments of the aspects of the disclosure, the influence of the size of the corpus and the length of the text on the similarity measurement between short texts is eliminated or reduced, and the inverse text frequency calculation method is improved, so that the method can adaptively adjust the inverse text frequency value according to the size of the corpus and the length of the text, smooth the weight coefficients of the high-frequency words and the low-frequency words, and the obtained text vector can pay attention to the difference of the low-frequency words and the similarity of the high-frequency words. The short text similarity efficient calculation method and device of the improved inverse text frequency have the advantages that the similarity between short texts calculated by the method and device is smoother, more stable and easier to explain.

What has been described above is merely exemplary embodiments of the present invention. The scope of the invention is not limited thereto. Any changes or substitutions that may be easily made by those skilled in the art within the technical scope of the present disclosure are intended to be included within the scope of the present disclosure.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of storage medium known in the art. Some examples of storage media that may be used include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, and so forth. A software module may comprise a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.

The methods disclosed herein comprise one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.

The processor may execute software stored on a machine-readable medium. A processor may be implemented with one or more general and/or special purpose processors. Examples include microprocessors, microcontrollers, DSP processors, and other circuitry capable of executing software. Software should be construed broadly to mean instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. By way of example, a machine-readable medium may include RAM (random access memory), flash memory, ROM (read only memory), PROM (programmable read only memory), EPROM (erasable programmable read only memory), EEPROM (electrically erasable programmable read only memory), registers, a magnetic disk, an optical disk, a hard drive, or any other suitable storage medium, or any combination thereof. The machine-readable medium may be embodied in a computer program product. The computer program product may include packaging material.

In a hardware implementation, the machine-readable medium may be a part of the processing system that is separate from the processor. However, as those skilled in the art will readily appreciate, the machine-readable medium, or any portion thereof, may be external to the processing system. By way of example, a machine-readable medium may include a transmission line, a carrier wave modulated by data, and/or a computer product separate from the wireless node, all of which may be accessed by a processor through a bus interface. Alternatively or additionally, the machine-readable medium or any portion thereof may be integrated into a processor, such as a cache and/or a general register file, as may be the case.

The processing system may be configured as a general purpose processing system having one or more microprocessors that provide processor functionality, and an external memory that provides at least a portion of the machine readable medium, all linked together with other supporting circuitry through an external bus architecture. Alternatively, the processing system may be implemented with an ASIC (application specific integrated circuit) having a processor, a bus interface, a user interface (in the case of an access terminal), support circuitry, and at least a portion of a machine readable medium integrated in a single chip, or with one or more FPGAs (field programmable gate arrays), PLDs (programmable logic devices), controllers, state machines, gated logic, discrete hardware components, or any other suitable circuitry, or any combination of circuitry that is capable of performing the various functionalities described throughout this disclosure. Depending on the particular application and the overall design constraints imposed on the overall system, those skilled in the art will recognize how to better implement the functionality described with respect to the processing system.

The machine-readable medium may include several software modules. These software modules include instructions that, when executed by a device, such as a processor, cause the processing system to perform various functions. These software modules may include a transmitting module and a receiving module. Each software module may reside in a single storage device or be distributed across multiple storage devices. As an example, a software module may be loaded into RAM from a hard drive when a triggering event occurs. During execution of the software module, the processor may load some instructions into the cache to increase access speed. One or more cache lines may then be loaded into a general register file for execution by the processor. When referring to the functionality of a software module below, it will be understood that such functionality is implemented by the processor when executing instructions from the software module.

If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as Infrared (IR), radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, and

disks, where a disk (disk) usually reproduces data magnetically, and a disk (disc) reproduces data optically with a laser. Thus, in some aspects, computer-readable media may comprise non-transitory computer-readable media (e.g., tangible media). Additionally, for other aspects, the computer-readable medium may comprise a transitory computer-readable medium (e.g., a signal). Combinations of the above should also be included within the scope of computer-readable media.

Accordingly, certain aspects may comprise a computer program product for performing the operations presented herein. For example, such a computer program product may include a computer-readable medium having instructions stored (and/or encoded) thereon, the instructions being executable by one or more processors to perform the operations described herein. In certain aspects, a computer program product may include packaging materials.

It is to be understood that the claims are not limited to the precise configuration and components illustrated above. Various changes, substitutions and alterations in the arrangement, operation and details of the method and apparatus described above may be made without departing from the scope of the claims.

Claims

1. A short text similarity determination method comprises the following steps:

performing word segmentation on short texts in a corpus to obtain corresponding word sequences;

determining a penalty based on a total number of short texts in the corpus, the penalty decreasing as the total number of short texts in the corpus increases;

determining a word frequency and an adjusted inverse document frequency for each word in the sequence of words, wherein the adjusted inverse document frequency is calculated based on the penalty;

weighting the word frequency of each word by using the adjusted inverse document frequency;

combining the weighted word frequencies of each word in the word sequence to determine a word frequency vector of the short text; and

determining similarity of the short text to other short texts based on the word frequency vector.

2. The method of claim 1, wherein determining the adjusted inverse document frequency for each word comprises:

determining a total number of texts in the corpus;

determining the number of texts containing the word in the corpus;

adjusting the number of texts in the corpus containing the word based on the penalty, so that the number of texts in the corpus containing the word is exponentially increased when the total number of short texts in the corpus is smaller than a first threshold; and

determining the adjusted inverse document frequency based on the total number of texts and the adjusted number of texts in the corpus containing the word.

3. The method of claim 2, wherein adjusting the number of texts in the corpus containing the word based on the penalty further causes the number of texts in the corpus containing the word not to be increased when the total number of short texts in the corpus is greater than a second threshold.

4. The method of claim 2, wherein the penalty comprises an exponential smoothing factor and the adjusting comprises adding the exponential smoothing factor to the number of texts in the corpus containing the word.

5. The method of claim 1, wherein weighting the word frequency of each word with the adjusted inverse document frequency comprises multiplying the word frequency by the adjusted inverse document frequency based on the penalty calculation.

6. The method of claim 1, further comprising:

and if the word sequence obtained by segmenting the short text is different from the word sequences of the other short texts in length, filling or cutting the word sequence of the short text or the word sequences of the other short texts to enable the word sequences and the word sequences to be the same in length.

7. The method of claim 1, determining similarity of the short text to other short texts based on the word frequency vector comprises:

and calculating the cosine distance between the word frequency vector of the short text and the word frequency vectors of other short texts.

8. A short text similarity determination apparatus comprising:

a memory; and

a processor coupled to the memory, the processor configured to:

9. The short text similarity determination apparatus of claim 8, wherein the processor being configured to determine the adjusted inverse document frequency for each word comprises the processor being configured to:

determining a total number of texts in the corpus;

determining the number of texts containing the word in the corpus;

10. The short text similarity determination apparatus of claim 9, wherein the processor being configured to adjust the number of texts in the corpus containing the word based on the penalty further comprises the processor being configured such that the number of texts in the corpus containing the word is not increased when the total number of short texts in the corpus is greater than a second threshold.

11. The short text similarity determination apparatus of claim 9, wherein the penalty comprises an exponential smoothing factor and the adjustment comprises adding the exponential smoothing factor to the number of texts in the corpus containing the word.

12. The short-text similarity determination apparatus of claim 8 wherein the processor being configured to weight the word frequency for each word with the adjusted inverse document frequency comprises the processor being configured to multiply the word frequency by the adjusted inverse document frequency based on the penalty calculation.

13. The short-text similarity determination apparatus of claim 8, the processor further configured to:

14. The short text similarity determination apparatus of claim 8, wherein the processor being configured to determine the similarity of the short text to other short texts based on the word frequency vector comprises the processor being configured to:

15. A short text similarity determination apparatus comprising:

a module for segmenting words from short texts in a corpus to obtain corresponding word sequences;

means for determining a penalty based on a total number of short texts in the corpus, the penalty decreasing as the total number of short texts in the corpus increases;

means for determining a word frequency and an adjusted inverse document frequency for each word in the sequence of words, wherein the adjusted inverse document frequency is calculated based on the penalty;

a module for weighting the word frequency of each word with the adjusted inverse document frequency;

means for combining the weighted word frequencies of each word in the sequence of words to determine a word frequency vector for the short text; and

means for determining a similarity of the short text to other short texts based on the word frequency vector.