Detailed Description
The Term Frequency-Inverse Document Frequency (TF-IDF) technique is a common weighting technique used for data retrieval and text mining, and can be used to evaluate the importance degree of a single word to a certain text in a text library or corpus. The importance of a word increases in proportion to the number of times it appears in the document, i.e. the word frequency (TF), but at the same time decreases in inverse proportion to the frequency of its occurrence in the corpus (IDF). If a word is rare but it appears multiple times in the article, it is likely to reflect the characteristics of the article, and it is the desired keyword.
In order to count the keywords of the text, the text may be segmented, and then the word frequency of each word may be counted. Word frequency refers to the number of times a given word appears in the text. Keywords of text appear in the text with a high word frequency. However, high frequency words with insignificant meaning such as "of", "get", "ground", "is", "also" are easily mistaken for keywords. Thus, each word needs to be weighted to reduce the weight of such high-frequency words that are not significant, while words that are less common to the average meaning in the corpus but significant in the text increase their weight, which can be referred to as Inverse Document Frequency (IDF).
The inverse document frequency is a measure of the importance of a word and is inversely proportional to how common the word is in a general sense. For example, the inverse document frequency may be calculated by dividing the total number of texts in the corpus by the number of texts in the corpus that contain the word, and taking the logarithm of the resulting quotient.
After the TF and IDF are calculated, the two values are multiplied to obtain the TF-IDF value of the word. The larger the TF-IDF value, the higher the importance of the word to the text. And taking a plurality of words with the highest TF-IDF value in the text to obtain the key words of the text.
According to an exemplary embodiment, the short text similarity measure of the present disclosure may be determined by generating respective word frequency vectors of two short texts and calculating cosine similarity of the two word frequency vectors. According to other exemplary embodiments, the short text similarity metric of the present disclosure may be determined by generating respective word frequency vectors of two short texts and calculating other similarity metrics of the two word frequency vectors, such as including but not limited to jaccard similarity, Sorensen similarity coefficient, Levenshtein distance, and hamming distance, among others.
Fig. 1 illustrates a schematic diagram of a short text similarity determination system 100 in accordance with an aspect of the present disclosure. As shown in fig. 1, two or more short texts (e.g., short text 1 and short text 2) may be input into the word segmentation unit 102 for segmentation, respectively, to obtain a word sequence. Word segmentation may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.
The segmented short text (i.e., the sequence of words of the short text) may include a different number of words. For example, short text 1 may be participled as the sequence of words "word 1, word2, … …, word N" and short text 2 may be participled as the sequence of words "word 1, word2, … …, word M", where N may not equal M. In this case, the two lengths can be made equal by padding or cutting. In addition, as will be appreciated, expressions such as "word n" are merely intended to indicate that the word has the nth word order in the short text, and "word n" in different short texts need not be the same word.
The segmented short text may be input into the optimized word frequency calculation unit 104. The optimized word frequency calculating unit 104 may calculate the word frequency of each word in each participled short text. According to an exemplary embodiment, the word frequency in the present disclosure may be based on an optimized TF-IDF. That is, the optimized TF-IDF value for each word may be calculated and a vector of optimized TF-IDF values formed in an order corresponding to the word order of each word in the short text becomes a word frequency vector corresponding to the short text. The calculation method of the optimized TF-IDF value is further described below.
Word frequency vector 1 and word frequency vector 2 corresponding to short text 1 and short text 2, respectively, may be input to similarity unit 106. Similarity unit 106 may determine the similarity of the word frequency vectors. According to an exemplary embodiment, the similarity of the word frequency vectors may be measured using the cosine distance of the vectors. Finally, the similarity unit 106 may provide the similarity measure of the word frequency vector as the similarity measure of the short text.
Short text similarity metrics according to the present application may be applied to various application scenarios, such as including but not limited to transaction-common ship-to address aggregation, and the like.
Although the way of calculating the similarity of two short texts at a time is described above in conjunction with fig. 1, the present disclosure is not limited thereto, but may cover an embodiment in which the similarity of more short texts is compared at a time, as long as it is based on the technical spirit disclosed herein.
Fig. 2 illustrates a schematic diagram of a word frequency determination apparatus 200 in accordance with an aspect of the present disclosure. The word frequency determining apparatus 200 may comprise or constitute the optimized word frequency calculating unit 104 described above in connection with fig. 1.
According to an exemplary embodiment, the word frequency determining apparatus 200 may include, but is not limited to, for example, a word frequency calculating unit 202, an inverse document frequency calculating unit 204, a word frequency weighting unit 206, and a word frequency vector generating unit 208, etc. Although the units described above are described herein as separate functional modules, it will be appreciated by those of ordinary skill in the art that the functions described in connection with the various units may be implemented by various techniques, such as software, firmware, general purpose hardware, special purpose hardware, or the like. Furthermore, each unit need not be implemented in separate software or hardware, but may be implemented, for example, by a general-purpose processor and memory. The functional division is only for the convenience of understanding of the ordinary skilled person in the art, and does not constitute any limitation to the implementation of the present disclosure.
According to an exemplary embodiment, the word frequency calculation unit 202 may obtain the segmented short text 230 from the text corpus 210 to calculate the original word frequency of a single word. For example, the original word frequency may be calculated as:
according to an exemplary embodiment, the inverse document frequency calculation unit 204 may calculate the inverse document frequency of the word in the corpus. For example, in general, the inverse document frequency may be calculated as:
where n is the total number of texts in the corpus 220 obtained based on the corpus 210, and m is the number of texts in the corpus containing the word. The number of texts containing the word is increased by 1 to avoid the situation where the denominator is 0.
TF-IDF based word frequencies may have problems that depend strongly on corpus size and text length. For example, for a large-scale corpus, short text vectors are very sparse and the presence of low frequency words severely interferes with the similarity metric. On the other hand, for a small corpus, the IDF value is greatly jittered, which results in the indistinct similarity.
From this, an optimized IDF can be calculated. According to at least some example embodiments, calculating an optimized IDF may include introducing a penalty such that when the total number of texts in the corpus is small, a greater penalty is given to the IDF to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.
According to an exemplary embodiment, the penalty may include a non-negative smoothing factor δm. The smoothing factor delta is used when calculating the IDF of a particular wordmCan be used to adjust the number of texts in the corpus that contain the word.
For example, when the total text in the corpusThe smoothing factor δ is smaller (e.g., less than or equal to the first threshold value)mLarger values may be desirable. The number of texts in the corpus containing the word is subject to the smoothing factor deltamThe adjustment is significantly increased, so that the IDF is reduced, and the jitter of the IDF value is relieved.
For example, according to some exemplary embodiments, the smoothing factor δ is when the total number of texts in the corpus is small (e.g., less than or equal to a first threshold value)mLarger constants or other empirical values may be taken. According to further exemplary embodiments, the smoothing factor δ is a factor that is used when the total number of texts in the corpus is small (e.g., less than or equal to a first threshold value)mA variance value may be taken that drops significantly as the total number of texts increases.
On the other hand, when the total number of texts in the corpus is large (e.g., larger than the first threshold), the smoothing factor δmValues may be taken that decrease slowly as the total number of texts in the corpus increases. The number of texts in the corpus containing the word is subject to the smoothing factor deltamThe degree of increase in adjustment decreases more slowly as the total number of texts in the corpus increases. According to some exemplary embodiments, the smoothing factor δ is such that when the total number of texts in the corpus increases to a certain extent (e.g., above a second threshold value), the smoothing factor δm0 can be taken directly.
According to some exemplary embodiments, the smoothing factor δmMay include a value that exponentially decreases with the total number of texts in the corpus.
According to some exemplary embodiments, the smoothing factor δ is usedmAdjusting the number of texts in the corpus that contain the word may include adding the smoothing factor δ to the number of texts in the corpus that contain the wordm。
According to some exemplary embodiments, the smoothing factor δ is usedmAdjusting the number of texts in the corpus that contain the word may include multiplying the number of texts in the corpus that contain the word by the smoothing factor δm。
According to some example embodiments, the first threshold for the total number of texts in the corpus may comprise, for example, 5. The present disclosure is not limited to the preferred embodiments. The first threshold may be larger or smaller.
According to some example embodiments, the second threshold for the total number of texts in the corpus may comprise, for example, 10. The present disclosure is not limited to the preferred embodiments. The second threshold may be larger or smaller.
According to some exemplary embodiments, the smoothing factor δmThe following can be calculated:
where n is the total number of texts in the corpus. Accordingly, by the smoothing factor δmThe smoothed IDF may be calculated as follows:
according to further exemplary embodiments, the smoothing factor δmThe following can be calculated:
δm=ae(5-n),m,n>=0 (5)
where n is the total number of texts in the corpus, m is the number of texts in the corpus containing the word, and a is a constant greater than 0. Accordingly, by the smoothing factor δmThe smoothed IDF may be calculated as follows:
as can be seen, an optimized IDF may include giving the IDF a greater penalty to mitigate jitter in IDF values when the total number of texts in the corpus is small; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.
The invention provides a method for calculating the similarity of short texts by combining word segmentation and improved inverse document frequency.
Fig. 3 illustrates a flow diagram of a short text similarity determination method 300 in accordance with an aspect of the present disclosure. The short text similarity determination method 300 may include, for example, at block 302, tokenizing the short text to obtain a sequence of words.
According to an example embodiment, the word segmentation may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.
According to an exemplary embodiment, the short text similarity determination method 300 may further include obtaining a text library and performing word segmentation on the short text in the text library as described at block 302, thereby obtaining a word segmentation list.
According to an exemplary embodiment, the short text similarity determination method 300 may optionally further include removing stop words based on the participle list to construct a corpus. The stop word refers to a specific character or word automatically filtered in the information retrieval related to the natural language, so as to save storage space and improve search efficiency.
The short text similarity determination method 300 may further include determining a word frequency (TF) and an adjusted Inverse Document Frequency (IDF) for each word in the short text, for example, at block 304. The adjusted Inverse Document Frequency (IDF) may be determined based on, for example, the manner described above in connection with fig. 1 and/or fig. 2, to give a greater penalty to IDF when the total number of texts in the corpus is smaller, to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.
The short text similarity determination method 300 may further include determining an optimized TF-IDF value for each word in the short text as a weighted word frequency for the word, for example, based on the word frequency (TF) and the optimized Inverse Document Frequency (IDF), at block 306. The optimized TF-IDF value may be a product of a word frequency (TF) of each word in the short text and an optimized Inverse Document Frequency (IDF).
After determining the weighted word frequency for each word in the short text, the short text similarity determination method 300 may further include determining a word frequency vector for the short text, for example, at block 308. Determining the word frequency vector for the short text may include combining the optimized TF-IDF values for each word into a vector in an order corresponding to the sequence of words of the short text into the word frequency vector corresponding to the short text.
The short text similarity determination method 300 may further include comparing the similarity between two or more short texts based on the word frequency vector of the short texts, for example, at block 310. According to some example embodiments, the similarity between two or more short texts may be determined based on cosine similarity of their respective word frequency vectors. According to some example embodiments, the similarity between two or more short texts may be determined based on other similarity measures of their respective word frequency vectors.
Fig. 4 illustrates a block diagram of a short text similarity determination apparatus 400 in accordance with an aspect of the present disclosure. The short text similarity determination apparatus 400 may include, for example, a module 402 for tokenizing short text.
According to an exemplary embodiment, the module for tokenizing short text 402 may utilize various existing or future technologies. According to an exemplary embodiment, word segmentation tools applicable to a particular language may be used. For example, third party word segmentation packages such as jieba may be utilized, as well as other chinese word segmentation toolkits such as THULAC, pkuseg, Hanlp, etc.
According to an exemplary embodiment, the short text similarity determination apparatus 400 may further include a module (not shown) for acquiring a text library. The module for segmenting the short text 402 may segment the short text in the acquired text library, thereby obtaining a segmentation list.
According to an exemplary embodiment, the short text similarity determining apparatus 400 may optionally be further configured to include a module (not shown) for removing stop words to construct a corpus based on the participle list. The stop word refers to a specific character or word automatically filtered in the information retrieval related to the natural language, so as to save storage space and improve search efficiency.
The short text similarity determination apparatus 400 may further include a module 404 for determining a word frequency (TF) and an adjusted Inverse Document Frequency (IDF) for each word in the short text, for example. The adjusted Inverse Document Frequency (IDF) may be determined based on, for example, the manner described above in connection with fig. 1 and/or fig. 2, to give a greater penalty to IDF when the total number of texts in the corpus is smaller, to mitigate jitter in IDF values; and when the total number of the texts in the corpus is larger, the penalty is lightened, and the penalty can be gradually reduced along with the gradual increase of the total number of the texts in the corpus, so that the weight coefficients of the high-frequency words and the low-frequency words are more stable.
The short text similarity determination apparatus 400 may further include a module 406 for determining an optimized TF-IDF value for each word in the short text as a weighted word frequency for the word based on the word frequency (TF) and the adjusted Inverse Document Frequency (IDF). The optimized TF-IDF value may be a product of a word frequency (TF) of each word in the short text and an adjusted Inverse Document Frequency (IDF).
The short text similarity determination apparatus 400 may further include a module 408 for determining a word frequency vector for the short text after determining a weighted word frequency for each word in the short text. The module for determining a word frequency vector for the short text 408 may include, for example, a module for constructing a vector of optimized TF-IDF values in an order corresponding to the word order of each word in the short text into a word frequency vector corresponding to the short text.
The short text similarity determination apparatus 400 may further include a module 410 for comparing the similarity between two or more short texts based on the word frequency vectors of the short texts, for example. According to some example embodiments, the similarity between two or more short texts may be determined based on cosine similarity of their respective word frequency vectors. According to some example embodiments, the similarity between two or more short texts may be determined based on other similarity measures of their respective word frequency vectors.
The short text similarity determination apparatus 400 may be implemented in software, firmware, and/or hardware, or any combination thereof. For example, the various modules of the short text similarity determination apparatus 400 may be implemented in the form of processor-executable instructions stored on a computer-readable storage medium such that, when the processor-executable instructions are read and executed by one or more processors of a computer, the computer performs the functions of the various modules 402 and 410 of the short text similarity determination apparatus 400, and so on. For another example, the short text similarity determination apparatus 400 may be implemented by a combination of a processor and a memory, the processor being coupled with the memory and executing programs and/or instructions stored in the memory to implement the functions of the modules 402 and 410 of the short text similarity determination apparatus 400, and the like. As another example, short text similarity determination apparatus 400 may be implemented in various other dedicated or general firmware/hardware implementations.
According to the short text similarity efficient calculation method and device combining word segmentation and the improved inverse text frequency according to various embodiments of the aspects of the disclosure, the influence of the size of the corpus and the length of the text on the similarity measurement between short texts is eliminated or reduced, and the inverse text frequency calculation method is improved, so that the method can adaptively adjust the inverse text frequency value according to the size of the corpus and the length of the text, smooth the weight coefficients of the high-frequency words and the low-frequency words, and the obtained text vector can pay attention to the difference of the low-frequency words and the similarity of the high-frequency words. The short text similarity efficient calculation method and device of the improved inverse text frequency have the advantages that the similarity between short texts calculated by the method and device is smoother, more stable and easier to explain.
What has been described above is merely exemplary embodiments of the present invention. The scope of the invention is not limited thereto. Any changes or substitutions that may be easily made by those skilled in the art within the technical scope of the present disclosure are intended to be included within the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of storage medium known in the art. Some examples of storage media that may be used include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, and so forth. A software module may comprise a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
The methods disclosed herein comprise one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
The processor may execute software stored on a machine-readable medium. A processor may be implemented with one or more general and/or special purpose processors. Examples include microprocessors, microcontrollers, DSP processors, and other circuitry capable of executing software. Software should be construed broadly to mean instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. By way of example, a machine-readable medium may include RAM (random access memory), flash memory, ROM (read only memory), PROM (programmable read only memory), EPROM (erasable programmable read only memory), EEPROM (electrically erasable programmable read only memory), registers, a magnetic disk, an optical disk, a hard drive, or any other suitable storage medium, or any combination thereof. The machine-readable medium may be embodied in a computer program product. The computer program product may include packaging material.
In a hardware implementation, the machine-readable medium may be a part of the processing system that is separate from the processor. However, as those skilled in the art will readily appreciate, the machine-readable medium, or any portion thereof, may be external to the processing system. By way of example, a machine-readable medium may include a transmission line, a carrier wave modulated by data, and/or a computer product separate from the wireless node, all of which may be accessed by a processor through a bus interface. Alternatively or additionally, the machine-readable medium or any portion thereof may be integrated into a processor, such as a cache and/or a general register file, as may be the case.
The processing system may be configured as a general purpose processing system having one or more microprocessors that provide processor functionality, and an external memory that provides at least a portion of the machine readable medium, all linked together with other supporting circuitry through an external bus architecture. Alternatively, the processing system may be implemented with an ASIC (application specific integrated circuit) having a processor, a bus interface, a user interface (in the case of an access terminal), support circuitry, and at least a portion of a machine readable medium integrated in a single chip, or with one or more FPGAs (field programmable gate arrays), PLDs (programmable logic devices), controllers, state machines, gated logic, discrete hardware components, or any other suitable circuitry, or any combination of circuitry that is capable of performing the various functionalities described throughout this disclosure. Depending on the particular application and the overall design constraints imposed on the overall system, those skilled in the art will recognize how to better implement the functionality described with respect to the processing system.
The machine-readable medium may include several software modules. These software modules include instructions that, when executed by a device, such as a processor, cause the processing system to perform various functions. These software modules may include a transmitting module and a receiving module. Each software module may reside in a single storage device or be distributed across multiple storage devices. As an example, a software module may be loaded into RAM from a hard drive when a triggering event occurs. During execution of the software module, the processor may load some instructions into the cache to increase access speed. One or more cache lines may then be loaded into a general register file for execution by the processor. When referring to the functionality of a software module below, it will be understood that such functionality is implemented by the processor when executing instructions from the software module.
If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as Infrared (IR), radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, and

disks, where a disk (disk) usually reproduces data magnetically, and a disk (disc) reproduces data optically with a laser. Thus, in some aspects, computer-readable media may comprise non-transitory computer-readable media (e.g., tangible media). Additionally, for other aspects, the computer-readable medium may comprise a transitory computer-readable medium (e.g., a signal). Combinations of the above should also be included within the scope of computer-readable media.
Accordingly, certain aspects may comprise a computer program product for performing the operations presented herein. For example, such a computer program product may include a computer-readable medium having instructions stored (and/or encoded) thereon, the instructions being executable by one or more processors to perform the operations described herein. In certain aspects, a computer program product may include packaging materials.
It is to be understood that the claims are not limited to the precise configuration and components illustrated above. Various changes, substitutions and alterations in the arrangement, operation and details of the method and apparatus described above may be made without departing from the scope of the claims.