TW575813B - System and method using external search engine as foundation for segmentation of word - Google Patents
System and method using external search engine as foundation for segmentation of word Download PDFInfo
- Publication number
- TW575813B TW575813B TW91123508A TW91123508A TW575813B TW 575813 B TW575813 B TW 575813B TW 91123508 A TW91123508 A TW 91123508A TW 91123508 A TW91123508 A TW 91123508A TW 575813 B TW575813 B TW 575813B
- Authority
- TW
- Taiwan
- Prior art keywords
- word
- patent application
- words
- scope
- item
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 17
- 230000011218 segmentation Effects 0.000 title claims description 8
- 238000012545 processing Methods 0.000 claims description 9
- 239000013598 vector Substances 0.000 description 16
- 239000003921 oil Substances 0.000 description 14
- 241000196324 Embryophyta Species 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 8
- 238000007619 statistical method Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000010773 plant oil Substances 0.000 description 2
- 241000052341 Datamini Species 0.000 description 1
- 241001247287 Pentalinon luteum Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000001397 quillaja saponaria molina bark Substances 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 229930182490 saponin Natural products 0.000 description 1
- 150000007949 saponins Chemical class 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 239000002689 soil Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Description
575813575813
發明領域 …… 種運用外部搜尋引整供炎傲 統與方法,可應用於知識管理或文”、、断詞基礎之 本發明係關於 糸 應 (s 用於資料挖掘(d a t a m i n i ng)之搜尋括 、或特別疋 • · 兮孜術或文件相S3 ΜFIELD OF THE INVENTION ... This invention uses external search to guide the integration of the arrogant system and method, which can be applied to knowledge management or text, and the basis of word segmentation. The present invention is about search (including data mini ng) search including , Or special 疋 • xishu or file photo S3 Μ
Uilarity)運算之文件向量運算的處理。 關吐Uilarity) processing of file vector operations. Off
料挖掘或搜尋技術常以關鍵詞做^ iL:然而近期的主流技術卻是以整篇文件當作輸入:杳 ^件,去搜尋其他與之相關的文件,此即為文件相難 :异;但是在進行文件相關性運算之旨,必須先將全部文 ,轉換為向量型態,亦即將一篇文章中具有意義的有效詞 菜選出做為文件向量中的分量,再統計該分量在一篇文件 中出現的次數,如此便可形成文件向量;但是如何取得有 效詞彙,來將文件快速轉成向量,須倚賴更具效率的斷詞 技術(s egmen tat i on of word) 〇 辦—技術不僅可應用於文件的相關性分析,其所斷出 的有效詞彙更可提供文件的自動摘要(auto abstract)、 文件分類(clustering)、以及資訊擷取(information retrieval)等方面的應用。 然而要處理斷詞之前,須先在文件資料庫中進行統計 與運算,將各種可能的字組進行測試,並篩選最常使用的 詞彙,形成詞庫(thesaurus ),再依據詞庫中的有效詞彙 將文件中的句子斷出一組具有意義的有效詞彙,依此形成Data mining or search technology often uses keywords ^ iL: However, the recent mainstream technology takes the entire file as input: 杳 ^ files, to search for other related files, this is the file difficulty: difference; However, for the purpose of document correlation calculation, we must first convert all the text to a vector type, that is, select the meaningful words and phrases in an article as the components in the file vector, and then count the components in a document. The number of occurrences in a file can form a file vector in this way; but how to obtain an effective vocabulary to quickly convert a file into a vector must rely on more efficient word segmentation techniques (s egmen tat i on of word). It can be applied to the correlation analysis of documents, and the effective vocabulary broken out can also provide applications such as auto abstract, document classification, and information retrieval. However, before processing word segmentation, statistics and calculations must be performed in the document database, various possible groups of words must be tested, and the most commonly used words must be screened to form the thesaurus. Vocabulary breaks sentences in a file into a set of meaningful and effective words
五、發明說明(2) 文件向量;此時如果文件 大,則詞庫中的有效詞囊數量,表示統計基礎越 地將文件的關鍵詞斷出 二越夕,越能完整地、快速 統計基礎越小,則q廑击相反地,如果資料庫越小,表示 地將整篇文件之關^ ^ “的詞彙數量亦少,通常無法完敕 果失去意義。 …司斷出,使得文件間相關性運算的二 龐大的文U料而”的詞庫之前,須先建立 本及時間,都會二當;;异此斷詞,其所花費之成 題。 此成為本發明首要解決的課 地取料使=可r”網際網球無限制 最龐大的資料庫,Α所供、罔1v、網路本身就是一個 如能運用網路資源取用之不竭, 源,:t節省成本,====基礎、統計來 =統計來源本ί 尋::擎做 資料與數據’經過用詞特徵值的運算,; 庫::-來,不僅詞庫更加完整,ΐ:; 龎大貝科庫的成本與電腦資源的使用。 登a之簡要說明 本發明之主要目 為斷詞基礎之系統與 的在於提供一種運用外部搜尋引擎做 方法,使之能篩選出有效詞彙形成詞 575813 五、發明說明(3) 庫進一步地將文件轉換成向量,提供文件相關性運算。 ^為了達到上述目的,首先在本端系統預先設定用詞特 ^值之門松值,然後各別輸入所欲測試之詞及其所有可能 組於外部搜尋引擎做普通查詢,等待外部搜尋引擎回傳 :罔頁之後,接收其中統計數據,並做用詞特徵值運算,若 該詞=之用詞特徵值大於或大於等於預設之門檻值,則視 该5司彙為有效詞彙,並加入詞庫中。 ,用本系統建立詞庫的目的在於快速地將一篇文章斷 1 4用词茱組合而成的文件向量,以便於進行相關度的 文件轉換成由常用詞彙組合而成的文件向量說明如 等文件中所有出現的常用詞彙(字組)加以排列, 組可為雙字母纟日仏· 、 千 桿 :甘組(blgram),如常用詞彙「專利」、「商 / . 著作」’亦可為單字(unigram)、三字母組 = g^am)或以上,本發明以雙字母組做為示範的基礎, 的士盤ΐ = ί其應用的範圍;接著,再計算每一字組出現 的=量了二置的分量值,組合所有字組的分量值為該文件 直所古a /中白里的型悲舉例說明以下,例如,一群文件 現「專2、且排列為(專利,商標,著作),在該篇文件中出 現「著作」的次數為7次、出現「商標」的次數為3次、出 5),為具^的次數為5次,則該篇文件之向量為(7, 3, 一 個維度的向量(3-dimensional vector)。V. Description of the invention (2) File vector; if the file is large at this time, the number of valid words in the thesaurus indicates that the more the statistical basis is, the more the keywords of the file are cut off, and the more complete and rapid the statistical basis is. The smaller it is, the smaller it is. On the contrary, if the database is smaller, it means that the entire document will be closed ^ ^ "The number of vocabulary is also small, usually it ca n’t complete the fruit and loses its meaning.… The company broke it out, making the files related. "The two huge texts of sexual operations" must be established before the time and the time, will be both ;; if this word segmentation is different, the cost it takes is a problem. This has become the primary solution of the present invention to make the largest and largest database of = ok "Internet tennis. The database provided by A, 罔 1v, and the Internet itself is an inexhaustible source of network resources. Source: t save cost, ==== basis, statistics come = statistics source text finder :: engine to do data and data 'through the calculation of word feature values, library ::-come, not only the thesaurus is more complete, ΐ :; 厐 The cost of Big Baker's library and the use of computer resources. A brief description of Deng a The main purpose of the present invention is a system of word segmentation and to provide an external search engine as a method to enable it to filter out effective words. Form the word 575813 V. Description of the invention (3) The library further converts the file into a vector, and provides the file correlation operation. ^ In order to achieve the above purpose, first set the threshold value of the word feature ^ in the local system in advance, and then separately Enter the word you want to test and all of it may be grouped into an external search engine for a general query, and wait for the external search engine to return: After the title page, receive the statistics and perform the word feature value calculation. If the feature value of a word is greater than or equal to a preset threshold, the 5 divisions are regarded as valid vocabulary and added to the thesaurus. The purpose of establishing a thesaurus with this system is to quickly break an article 1 to 4 Word vectors combined with words and phrases, so that related documents can be converted into file vectors combined with commonly used words. For example, all the common words (words) appearing in the file are arranged. The group can be two-letter letters. Sundial ·, Thousand poles: Gan group (blgram), such as the common words "patent", "quote /. Work" can also be a single word (unigram, three-letter group = g ^ am) or more. The letter group is used as the basis for the demonstration. Taxi ΐ = ί its application range. Then, calculate the value of the two components that appear in each word group. The component values of all the word groups are combined in the file. The example of Gua / Zhongbaili's type tragedy is as follows. For example, a group of documents is now "Special 2, and arranged as (patent, trademark, work). The number of times that" work "appears in the document is 7 times, and" Trademarks "3 times, 5), is ^ Is five times, the article of the vector of documents (7, 3, a vector of dimension (3-dimensional vector).
第6頁 來,兩文件向量之相關性運算,說明 斤文件轉換成向量之後、就可以將兩文件J下·、 异’其内積值做為相似度值,其内積值越大::做内積4 相似度越大;例如,若一 、越大代表兩文件間 的運算,為文件間的個 ':、、,另一文件為D,内積值 加總;例如:文件β,其、旦Μ的为里值分別相乘,然後再 一文件D向量為= (dl,廿2向置為= (bl,b2, ......,bm),另From page 6, the correlation between two file vectors shows that after converting a file into a vector, you can use the inner product value of the two files J and X 'as the similarity value. The larger the inner product value: Do the inner product. 4 The greater the degree of similarity; for example, if one or more represents the operation between two files, it is the number between files: ',,, and the other file is D, and the inner product value is added up; for example: file β, its, and M Are multiplied by the median value, and then the D vector of the file is = (dl, 廿 2 is set to = (bl, b2, ......, bm), and
之相關度為其内積值=β , .···, dm),m為維度數目,BikD 由上可知,要使得文 ^^d1+b2xd2+ ...··.+bfflxdffl, 文件轉換為向量,文件向=:夠做相關性運算,必須先將 礎,有效率的做法是應用二古的分量,是以常用詞彙為基 詞彙陸續斷出;例如,—=阔庫來將一篇文件中的常用 田」,所有可能的雙字母二:「中國大陸新發現的油 陸」、「陸新」、「新發、、、*中國」、「國大」、「大 油」、「油田」,若詞▲二發現」、「現的」、「的 「大陸」、「發現」、「的常用詞彙有:「中國」、 〒,其餘的「國大」、「心」,則這些詞彙會陸續被斷 的油」等詞會被拋棄而^新」、新發」、「現的」、 用詞彙、用什麼做為樑 ^ ;然而如何判斷何者才是 搜尋引擎的回傳數據當=重點所在,本發明透過外 用詞特徵門檻值相比,列2 =貧料,經過運算後與預設之 用詞彙,否則不為常用詞彙疋否大於門檻值’是則其為常 運用外部搜尋引擎,山乂 然後取得外部搜尋?丨擎 $系統必須連結至網際 口傳之賢料,再交由系統運 575813 五、發明說明⑸ " _______ 端之系統及連外結構其說明如下。 如圖1所示,為本發明應用於電腦 種’首先將本發明以程式設計成包含但不冓之:中之-腦指令之軟體並安裝於電腦101±,不可執产的電 上型電腦或筆記型電腦;電腦之軟體102中=系桌 統、應用軟體、各式元件、資料庫 中了為作業系 本發明之系統亦屬於電腦軟體之―,=於=或資料, 及記憶體104等電腦可讀取媒體,執置夕於储存設備1〇3 及記憶體106中η吏用者於本機透 出=二硬碟機 滑鼠,進入電腦中之輸出入埠108r::=7:鍵盤或 之軟體,透過主機板為介面1〇9盥1 j電細指令 、玄X、兰山士 A 再他硬體組件間溝通, 並仫仫出中央處理皁元(Central PrQcessi =?運算各項機器指令’本電腦指令」 理後將結果达至顯不介面卡丨丨丨以顯示於螢幕丨丨2上。 系統使用者可選擇在本端電腦操作本系統,亦可透過 網路操作;若系統使用者來自區域網路(LAN)113,可透過 網路設備114進入本機網路介面卡115,以執行本電腦指令 之权體,系統亦可透過此區域網路連外至網際網路並與之 傳輸溝通。 若系統使用者來自廣域網路(w A N)丨丨6 (或是網際網 路),可透過網路設備114進入本機網路介面卡115,或透 ,數據機11 7登入另一輸出入埠i丨8,以執行本電腦指令之 車人體’系統亦可透過此廣域網路連外至網際網路並與之傳 輪溝通。The correlation is its inner product value = β,..., Dm), m is the number of dimensions, BikD can be seen from the above, to make the text ^^ d1 + b2xd2 + ..... + bfflxdffl, convert the file to a vector, Document direction =: enough to do correlation calculations, you must first base. An efficient approach is to use the components of Ergu, which are based on commonly used vocabularies. For example,-= Kuoku will use "Common fields", all possible two-letter two: "Newly discovered oil land in mainland China", "Lu Xin", "Xinfa ,,, * China", "National University", "Big Oil", "Oilfield", If the words ▲ Second Discovery, Present, Mainland China, Discovery, and Common vocabulary are: "China", 〒, and the remaining "National University" and "Heart", these words will continue one after another Words such as “broken oil” will be discarded and ^ new '', `` new '', `` present '', vocabulary, and what will be used as the beam ^; however, how to determine which is the search engine's postback data In the present invention, compared with the threshold of the feature word of the external word, column 2 = poor material, after the calculation, it is compared with the preset vocabulary. No No is not greater than the threshold value common vocabulary Cloth 'as it is often the use of external search engines, then made mountain qe external search?丨 Engine $ system must be connected to the Internet's oral materials, and then delivered to the system 575813 V. Invention Description ⑸ _______ The system and external structure of the terminal are described below. As shown in FIG. 1, the present invention is applied to a computer type. The first aspect of the present invention is to program the present invention to include but is not limited to: the middle-brain instruction software and install it on a computer 101 ±, a non-production computer. Or notebook computer; the software 102 of the computer = desktop systems, application software, various components, and databases are included in the system of the present invention. The system also belongs to computer software-=== or data, and memory 104. Wait until the computer can read the media, and place it in the storage device 103 and the memory 106. The user will see the local device = two hard disk drives, enter the input port 108r of the computer :: = 7 : Keyboard or software, through the motherboard as the interface, 109, 1 j electric instructions, Xuan X, Lanshan Shi A and other hardware components to communicate, and the central processing saponin (Central PrQcessi =? Operation All machine instructions 'this computer instruction' are processed and the result is displayed on the display interface card 丨 丨 丨 to be displayed on the screen 丨 丨 2. The system user can choose to operate the system on the local computer or through the network. ; If the system user is from a local area network (LAN) 113, The device 114 enters the local network interface card 115 to execute the authority of this computer command. The system can also connect to the Internet through the local area network and communicate with it. If the system user is from a wide area network (w AN ) 丨 丨 6 (or the Internet), you can enter the local network interface card 115 through the network device 114, or through, the modem 11 7 logs in to another input / output port i 丨 8 to execute the instructions of this computer The car body system can also connect to the Internet and communicate with it via this WAN.
$ 8頁 575813 五、發明說明(6) # ΐ i所i P:本發明之應用可存在於各式電腦可讀取媒 U括但不限於軟碟磁片、硬碟、光碟片、•閃記憶體 ROM 、非揮發性記憶體(nonvolat i le R0M)及各 J存取記憶體UAM)中;安裝上不限於 電腦做負荷平衡之運算亦可。 # @ 如圖2所不,為本發明之系統與網路連線時之方式之 二中一種,首先將本發明設計成電腦系統2〇4並安裝於本 細上,其中本系統可透過網路伺服器(Web Server) 203與網際網路系統202連線,再與外部搜尋引擎2〇ι溝通 與傳遞訊息;當外部搜尋引擎2〇1回傳資料時,亦透過網 際網路系統2 0 2傳至本端伺服器2〇3,然後由本端電腦系統 2 0 4接收訊息並進行處理。 如圖3所示,為本發明之主要流程,首先,步驟3〇1, 設定用詞特徵值之門檻值;該「用詞特徵值」帛足以代表 一詞彙在文件中用詞常見的程度,本發明使用相互資訊 (Mutual lnf0rmati0n; MI)做為計算用詞特徵值之其中一 種統計方法,公式如下: MI(ab)二Log(F(ab))-Log(F(a))—Log(F(b)) 右一詞彙為雙字母組,a為詞彙中的第一個字,b為雙 字母組中的第二個字;F (ab )為雙字母組吡出現的頻率 (次數),F(a)為a出現的頻率(次數),F(b)gb出現的 頻率(次數)。 相互資訊Μ I之公式,實務上會在各項對數值前加入係 數加以調整,使本發明之系統更易於篩選詞彙,其公式調$ 8 页 575813 V. Description of the invention (6) # ΐi 所 i P: The application of the present invention can exist in various computer readable media including but not limited to floppy disks, hard disks, optical disks, flash Memory ROM, nonvolatile memory (nonvolat i le R0M) and each J access memory UAM); the installation is not limited to the computer for load balancing calculations. # @ As shown in Figure 2, this is one of the two ways of connecting the system and the network of the present invention. First, the present invention is designed as a computer system 204 and installed on the notebook. The system can be accessed through the network. The Web server 203 is connected to the Internet system 202, and then communicates and transmits messages with the external search engine 200m. When the external search engine 201 returns data, it also passes the Internet system 2 0 2 is transmitted to the local server 203, and then the local computer system 204 receives the message and processes it. As shown in FIG. 3, this is the main flow of the present invention. First, in step 3101, a threshold value of a word feature value is set; the "word feature value" is sufficient to represent the degree to which a word is commonly used in a document. The present invention uses mutual information (Mutual lnf0rmati0n; MI) as one of the statistical methods for calculating the feature value of a word, and the formula is as follows: MI (ab) two Log (F (ab))-Log (F (a))-Log ( F (b)) The right word is a two-letter group, a is the first word in the vocabulary, b is the second word in the two-letter group; F (ab) is the frequency (number of times) of the two-letter group pyr. , F (a) is the frequency (number of times) that a appears, and F (b) gb appears (frequency). The formula of mutual information M I will be adjusted by adding coefficients before the logarithmic values in practice, making the system of the present invention easier to screen vocabulary.
575813 五、發明說明(7) 整如下: MI(ab)=w · Log(F(ab))-x •Log(F(a))-y -Log(F(b))-z 其中w, x,y,z為一貫數係數’意謂利用係數當作權數將 相互資訊數值加以調整,其最佳化之係數值已由近人計算 出,亦為本發明在實作上所運用,其ΜI公式如下: MI(ab) = 0.39 · Log (F(ab ))-0. 28 · Log(F(a))-0. 23 -Log (F(b))-0.32 ;係數皆接近0.3。 步驟301設定用詞特徵值之門檻值,如門檻值mj>-19 ;步驟3 0 2,輸入一詞彙於系統,如「油田」;步驟 3 0 3,傳送該詞彙與構成該詞彙之單字至外部搜尋引擎查 詢;目前外部搜尋引擎有:「Google」 (www.google.com) 、 r0penfind 」 (www.openfind.com.tw) 、 「AltaVista」 (www.altavista.com) 、 「Excite」 (www.excite.com)、 「Lycos」(www· lycos. com),系統可任意指定,本例使用 0 p e n f i n d做為外部搜尋引擎; 步驟3 0 4,接收並解讀外部搜尋引擎回傳訊息,例 如,於Openf ind查詢「油田」一詞,該搜尋引擎回傳之網 頁中包含「Open find找到1 2, 9 28篇相關網頁」,其中 12,928 即為F(ab),其他F(a)=l,630,526,F(b) = 1,1 82, 6 78,換言之,F(ab)、F(a)、F(b),三者皆由搜尋 引擎提供; 步驟3 0 5,計算該詞彙之用詞特徵值,此處用詞特徵 值以相互資訊MI為例,ΜΙ(Π油田·,)= Log(F(n油田n ))-Log575813 V. The description of the invention (7) is as follows: MI (ab) = w · Log (F (ab))-x • Log (F (a))-y -Log (F (b))-z where w, x, y, and z are consistent coefficients, which means that the mutual information values are adjusted by using the coefficients as weights. The optimized coefficient values have been calculated by recent people, and are also used in the practice of the present invention. The MI formula is as follows: MI (ab) = 0.39 · Log (F (ab))-0. 28 · Log (F (a))-0. 23 -Log (F (b))-0.32; coefficients are all close to 0.3. Step 301 sets the threshold value of the feature value of the word, such as threshold mj >-19; step 302, enter a word into the system, such as "oil field"; step 303, send the word and the words that form the word to External search engine query; currently the external search engines are: "Google" (www.google.com), r0penfind "(www.openfind.com.tw)," AltaVista "(www.altavista.com)," Excite "(www .excite.com), "Lycos" (www.lycos.com), the system can be specified arbitrarily. In this example, 0 penfind is used as the external search engine. Step 3 0 4: Receive and interpret the external search engine's return message. For example, Query the word "oilfield" in Openf ind. The web page returned by the search engine contains "Open find 1 2, 9 28 related pages", of which 12,928 is F (ab), and other F (a) = l, 630,526, F (b) = 1,1 82, 6 78, in other words, F (ab), F (a), F (b), all three are provided by the search engine; Step 3 0 5, calculate the use of the vocabulary Word feature value, here the word feature value is taken as the mutual information MI as an example, MI (Π field ·,) = Log (F (n oil field n))-Log
第10頁 575813 五、發明說明(8) (F(n 油,,))-Log(F(n 田,1 ))= -18· 82 0 6 ; 步驟3 0 6,判斷該用詞特徵值是否大於門檻值,是則 進行下一步,否則結束;由於-1 8. 82>-1 9,即用詞特徵值 ΜΙ(Π油田")大於門檻值mi,故進行步驟3 0 7 ;步驟3 0 7,將 該詞彙視為有效詞彙並加入詞庫。 本發明再以一實施例說明有效詞彙的篩選過程;例 如 , 一 段 句 子 厂 中 國 大 陸 新 發現的 油田 」,其所有可能的 詞 彙 ( 雙 字 母 組 ) 有 厂 中 國 J ^ Γ 國大 」、「大陸」、 厂 陸 新 J 厂 新 發 J 厂 發 現」、 「現 的」、「的油」、 厂 油 田 J 各 別 輸 入 於 外 部 搜尋引 擎Google查詢,其查核 結 果 筆 數 及 計 算 後 數 據 如 圖 4所示 ,其斥 丨詞特徵值以Μ I為 基礎 j 其Μ I值列表如下: 詞彙 ΜΙ 厂 中 國 j -14. 6 厂 國 大 j -18. 3 厂 大 陸 j -14· 4 厂 陸 新 j -1 8 · 6 厂 新 發 j -17. 9 厂 發 現 j -13. 9 厂 現 的 j -18. 6 厂 的 油 j -19. 6 厂 油 田 j -16. 6Page 10 575813 V. Description of the invention (8) (F (n oil ,,))-Log (F (n field, 1)) = -18 · 82 0 6; Step 3 0 6 to determine the feature value of the term If it is greater than the threshold value, then go to the next step, otherwise end; because -1 8. 82 > -19, that is, the word feature value MI (Πfield ") is greater than the threshold value mi, so proceed to step 3 0 7; step 3 0 7, treat the word as a valid word and add it to the thesaurus. The present invention further illustrates the screening process of effective vocabulary by using an embodiment; for example, a sentence sentence plant newly discovered oil field in mainland China ", all possible vocabulary (two-letter group) are the plant China J ^ National University", "Mainland" , Plant Luxin J Plant, Xinfa J Plant Found "," Current "," Oil ", and Plant Oilfield J were entered into Google search from external search engines, and the results of the check and the calculated data are shown in Figure 4. The feature values of its replies are based on Μj. The list of ΜI values is as follows: Vocabulary ΜΙ Plant China j -14. 6 Plant Guoda j -18. 3 Plant mainland j -14 · 4 Plant Luxin j -1 8 · 6 new plant j -17. 9 plant found j -13. 9 plant present j -18. 6 plant oil j -19. 6 plant oil field j -16. 6
第11頁 575813 五、發明說明(9) 其值大小排列之後之詞彙分別為:「發現」、「大陸」、 「中國」、「油田」、「新發」、「國大」、「現的」、 「陸新」、「的油」,其值大小與一般用詞頻率之印象吻 合,越前面的詞彙越常用,越後面的詞彙越不常用;若預 設之用詞特徵門檻值為m i = - 1 5,則「發現」、「大陸」、 「中國」三詞彙會被選為有效詞彙;若mi =-17,則「發 現」、「大陸」、「中國」、「油田」四詞彙會被選為有 效詞彙,其餘之詞彙「國大」、「陸新」、「新發」、 「現的」、「的油」都不會被選為有效詞彙。 用詞特徵值除了使用「相互資訊」統計方法做為計算 基礎以外,亦有其他方式可供計算;第二種用詞特徵值之 統計方法為詞彙頻率(T e r m F r e q u e n c y ; T F ),即以外部搜 尋引擎所回傳該詞彙的結果筆數,當作用詞特徵值;如以 上一實施例為範例,其T F值如圖4所示之F (a b) —欄,摘錄 於下表: 詞彙 TF 「中國」 1,910, 000 「國大」 3 5,5 0 0 「大陸」 568,000 「陸新」 8, 530 「新發」 6 3, 6 0 0P.11 575813 V. Description of the invention (9) The vocabularies after their values are arranged are: "discovery", "mainland", "China", "oil field", "new development", "National University", "present" "," Lu Xin ", and" Oil ", the values of which are consistent with the general impression of word frequency. The words that are more in front are more commonly used, and the words that are later are less commonly used. If the default feature word threshold is mi =-1 5, the three words "discovery", "mainland", and "China" will be selected as valid words; if mi = -17, the four words "discovery", "mainland", "China", and "oil field" will be selected Will be selected as a valid vocabulary, and the remaining vocabularies "National University", "Lu Xin", "Xin Fa", "present", and "oil" will not be selected as valid vocabulary. In addition to using the “reciprocal information” statistical method as the basis for calculation, word eigenvalues can be calculated in other ways. The second statistical method for word eigenvalues is vocabulary frequency (Term F requency; TF). The number of results of the vocabulary returned by the external search engine is used as the feature value of the word; as in the previous embodiment as an example, the TF value is shown in the F (ab) -column shown in Fig. 4, which is extracted from the following table: Vocabulary TF "China" 1,910,000 "National University" 3 5,5 0 0 "Mainland" 568,000 "Lu Xin" 8, 530 "Sin Fat" 6 3, 6 0 0
第12頁 575813Page 12 575813
925, 000 40,200 20,600 77, 6 0 0 五、發明說明(ίο) 「發現」 「現的」 「的油」 「油田」 若預設之門檻值為1 0 0,0 0 0,則有效詞彙為「中國 「發現」、「大陸」。 第三種用詞特徵值之統計方法為資訊增益 (Information Gain; IG) , IG 公式為: IG(t)二-EPr(ci)Log(Pr(ci)) + Pr(t) ΣPr (ci I t) Log (Pr(ciIt)) + Pr(t’)EPr(ci|t,)Log(Pr(cilt,)) 其中ci為文件中之分類,iM〜m,m為分類個數,pr(ci )為 該分類在總文件中出現的機率、頻率或次數,t為一雙字 母組,Pr(t)為該雙字母組在總文件中出現的頻率,pr (c i | t)為該雙字母組在分類c i中出現的頻率、ρ厂(七,)為該 雙字母組在總文件中未出現的頻率(結果筆數)、Pr ^ (cilt’)為該雙字母組不出現在分類中的頻率;'此^資訊 增益統計方法更適用於既經分類的文件資料庫,一般搜 引擎可允許使用者限定搜尋特定分類的内容,如Yah〇〇! 若無分類,則可將總體資料庫視為一個類別。 ’ 在資訊增益I G之公式中,可在久了石^ 土 、, 加以調整,使電腦更易於篩選詞彙、二)值a加入係數 IG(t)^w.EPr(cl)L〇g(Pr(ci))"^^^^^^925, 000 40,200 20,600 77, 6 0 0 5. Description of the invention (ί) “Discovery” “Current” “Oil” “Oil Field” If the preset threshold is 1 0 0, 0 0 0, the valid words are "Chinese" discovery "," Mainland ". The third statistical method of word eigenvalues is Information Gain (IG). The IG formula is: IG (t) two-EPr (ci) Log (Pr (ci)) + Pr (t) ΣPr (ci I t) Log (Pr (ciIt)) + Pr (t ') EPr (ci | t,) Log (Pr (cilt,)) where ci is the classification in the file, iM ~ m, m is the number of classifications, pr ( ci) is the probability, frequency or number of occurrences of the classification in the overall file, t is a pair of letters, Pr (t) is the frequency of the occurrence of the two letters in the overall file, and pr (ci | t) is the double The frequency of the letter group in the classification ci, ρ factory (seven,) is the frequency of the two-letter group that does not appear in the total file (the number of results), Pr ^ (cilt ') is the two-letter group does not appear in the classification Frequency; 'This ^ information gain statistical method is more suitable for classified document databases. General search engines allow users to limit the search to specific categories of content, such as Yah〇〇! If there is no category, the overall data can be The library is considered a category. '' In the formula of the information gain IG, it can be adjusted after a long time ^ soil, to make it easier for the computer to screen the vocabulary, and b) the value a is added with the coefficient IG (t) ^ w.EPr (cl) L0g (Pr (ci)) " ^^^^^^
第13頁 575813 五、發明說明(11) +x -Pr(t) ZPr(ci|t)Log(pr(ci|t)) + y · Pr(t’)EPr(ci|t’)L〇g(Pr(ci|t’)) 其中w, x,y,z為一實數係數,意謂利用係數當作權數將 資訊增益加以調整。 第四種用詞特徵值之統計方法為卡方統計(Chi- square Statistic ; CHI) , 統計雙 字母組ab 出 現次數A 、 該雙字母組前字a出現而後字b未出現的次數B、後字b出現 而前字未出現a的次數C、兩字ab皆未同時出現的次數d ; 步驟70 3,計算所有的雙字母組之CHI值: CHI (ab) = [ N X (AD-CB)2 ] + [(A+C)x (B+D)x (A+B)x (C+D)] 其中N為總文件數目;一般大型搜尋引擎皆會提供「進階 搜尋」的功能提供給使用者查詢,其中可選擇輸入「不包 括指定字詞」,如此一來,就可以搜尋出現a而不出現匕的 筆數、搜尋出現b而不出現a的筆數。 、 本發明以上所提 為斷同時不範的基礎 以上的「斷長詞」的 拆字的方法說明如後 供之實施例,以雙 ,而實際上的斷詞 處理:例如,其中 字母組(bigram)做 ,更可應用於二字 一種斷長詞為循序 首先’須先選定拆字的方式,例如將η個字的句子 成由第1個字至第(η_1)個字組成的長詞,以及第2個字斥 第η個字組成的長詞兩部份,以外部搜尋引擎所回傳之么 计資料計算各自的用詞特徵值,再以門檻值判斷何者為先有Page 13 575813 V. Description of the invention (11) + x -Pr (t) ZPr (ci | t) Log (pr (ci | t)) + y · Pr (t ') EPr (ci | t') L. g (Pr (ci | t ')) where w, x, y, and z are real coefficients, which means that the coefficients are used as weights to adjust the information gain. The fourth statistical method for the feature value of words is Chi-square Statistic (CHI), which counts the number of occurrences of the two-letter group ab, the number of occurrences of the first letter a of the two-letter group a, and the number of times B does not appear. The number of times that the word b appears without the preceding word a, and the number of times that the two characters ab do not appear simultaneously d; Step 70 3, calculate the CHI values of all two-letter groups: CHI (ab) = [NX (AD-CB) 2] + [(A + C) x (B + D) x (A + B) x (C + D)] where N is the total number of documents; general large search engines will provide the "advanced search" function provided For the user, you can choose to enter "exclude specified words". In this way, you can search for the number of occurrences of a without dagger, and the number of occurrences of b without a. The method for breaking words of "breaking words" mentioned above in the present invention, which is based on the principle of breaking at the same time, is described as an example provided later, and the actual word breaking is treated: for example, the letter group (bigram) ), It can also be applied to two-character long-word breaks in order. First, you must choose the way to split words, such as the η word sentence into a long word consisting of the first word to (η_1) words, and The second word excludes the two parts of the long word composed of the nth word, and uses the data returned by the external search engine to calculate the respective word feature values, and then uses the threshold value to determine which one is the first
第14頁 575813 五、發明說明(12) 效長詞,若判斷為有效之長詞,則由該句中斷出,否則再 回到拆字方式,繼續拆解該長詞並判斷之,如此循環處 理,直到無可處理為止。 本發明以外部搜尋引擎為資料庫計算而得的詞庫,其 應用相當廣泛,例如: 1 ·動態更新詞庫,讓詞庫隨時保持最新,可使得查詢、 分類更加精確。 2.問題解決(Problem-Solution),將詞囊> %、動Page 14 575813 V. Description of the invention (12) If the long word is judged to be a valid long word, it will be interrupted by the sentence. Otherwise, it will return to the word splitting mode, continue to disassemble the long word and judge it, and repeat the process in this way. Until nothing can be processed. The thesaurus calculated by the present invention using an external search engine as a database has a wide range of applications, such as: 1. Dynamically updating the thesaurus to keep the thesaurus up-to-date at all times, which can make queries and classification more accurate. 2. Problem-Solution, word bag >%, action
analysis),使用者若以「問題」的查詢,就可得到系統 1解決」方式的回應。 H庫Λ做為分類基礎’具有完備的詞庫之後,就可以 經分類的詞彙,解析文件内$,再將文件做 Π· 動摘要,以词庫做為尋找文件中關鍵詞句的基礎! 文件的重點之處點出,成為該文件的摘要 用者可快速劉覽自動摘要,更有效率地閱讀文件。 ,合以上所述’本發明之主要技 用外部搜尋引擎做為筛選有效詞彙 ::自動化analysis), if users use the "question" query, they can get a response from system 1. H library Λ as the basis of classification 'After having a complete thesaurus, you can categorize the vocabulary, analyze the $ in the file, and then use the file as a Π · dynamic summary, using the thesaurus as the basis for finding the key words in the file! The key points are that users who become the summary of the file can quickly browse the automatic summary and read the file more efficiently. In combination with the above, the main technique of the present invention is to use an external search engine as a screen for effective vocabulary :: automation
大量節省诸罢沏Μ隹 斷Η之統計基礎, 里即名建置與鬼集文件資料庫的成本。 本發明可應用於電腦資訊系統,其傳 限於任何媒介’如網路、無線傳輸裝置等皆可:匕- 本發明之技術内容及技術特點巳揭示如上,麸 本項技術之人士仍可能其Α 而无、 人士仍τ此基於本發明之教導及揭示而作種It saves a lot of statistical bases, including the cost of building a database of ghost files. The present invention can be applied to computer information systems, and its transmission is limited to any medium, such as the Internet, wireless transmission devices, etc .: Dagger-the technical content and technical characteristics of the present invention. As disclosed above, people with this technology may still However, people still do this based on the teaching and disclosure of the present invention.
第15頁 575813 五、發明說明(13) 不背離本發明精神之替換與修飾;因此,本發明之保護範 圍應不限於實施例所揭示者,而應包括各種不背離本發明 之替換與修飾,並為以下之申請專利範圍所涵蓋。Page 15 575813 V. Description of the invention (13) Replacement and modification without departing from the spirit of the invention; therefore, the protection scope of the invention should not be limited to those disclosed in the embodiments, but should include various substitutions and modifications without departing from the invention. It is covered by the following patent applications.
第16頁 575813 圖式簡單說明 圖1為本發明應用於電腦之架構圖; 圖2為本發明之系統與網路連線圖; 圖3為本發明之主要流程圖;及 圖4為本發明之實施例之數據表。Page 575813 Brief Description of Drawings Figure 1 is a structural diagram of the present invention applied to a computer; Figure 2 is a system and network connection diagram of the present invention; Figure 3 is a main flowchart of the present invention; and Figure 4 is the present invention The data sheet of the embodiment.
第17頁Page 17
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW91123508A TW575813B (en) | 2002-10-11 | 2002-10-11 | System and method using external search engine as foundation for segmentation of word |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW91123508A TW575813B (en) | 2002-10-11 | 2002-10-11 | System and method using external search engine as foundation for segmentation of word |
Publications (1)
Publication Number | Publication Date |
---|---|
TW575813B true TW575813B (en) | 2004-02-11 |
Family
ID=32734257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW91123508A TW575813B (en) | 2002-10-11 | 2002-10-11 | System and method using external search engine as foundation for segmentation of word |
Country Status (1)
Country | Link |
---|---|
TW (1) | TW575813B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008107305A3 (en) * | 2007-03-07 | 2008-11-06 | Ibm | Search-based word segmentation method and device for language without word boundary tag |
TWI474196B (en) * | 2007-04-02 | 2015-02-21 | Microsoft Corp | Search macro suggestions relevant to search queries |
TWI486800B (en) * | 2008-04-11 | 2015-06-01 | 微軟公司 | System and method for search results ranking using editing distance and document information |
US9348912B2 (en) | 2007-10-18 | 2016-05-24 | Microsoft Technology Licensing, Llc | Document length as a static relevance feature for ranking search results |
US9495462B2 (en) | 2012-01-27 | 2016-11-15 | Microsoft Technology Licensing, Llc | Re-ranking search results |
TWI787755B (en) * | 2021-03-11 | 2022-12-21 | 碩網資訊股份有限公司 | Method for cross-device and cross-language question answering matching based on deep learning |
-
2002
- 2002-10-11 TW TW91123508A patent/TW575813B/en not_active IP Right Cessation
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008107305A3 (en) * | 2007-03-07 | 2008-11-06 | Ibm | Search-based word segmentation method and device for language without word boundary tag |
US8131539B2 (en) * | 2007-03-07 | 2012-03-06 | International Business Machines Corporation | Search-based word segmentation method and device for language without word boundary tag |
TWI474196B (en) * | 2007-04-02 | 2015-02-21 | Microsoft Corp | Search macro suggestions relevant to search queries |
US9348912B2 (en) | 2007-10-18 | 2016-05-24 | Microsoft Technology Licensing, Llc | Document length as a static relevance feature for ranking search results |
TWI486800B (en) * | 2008-04-11 | 2015-06-01 | 微軟公司 | System and method for search results ranking using editing distance and document information |
US9495462B2 (en) | 2012-01-27 | 2016-11-15 | Microsoft Technology Licensing, Llc | Re-ranking search results |
TWI787755B (en) * | 2021-03-11 | 2022-12-21 | 碩網資訊股份有限公司 | Method for cross-device and cross-language question answering matching based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019091026A1 (en) | Knowledge base document rapid search method, application server, and computer readable storage medium | |
CN102831128B (en) | Method and device for sorting information of namesake persons on Internet | |
CN104615593B (en) | Hot microblog topic automatic testing method and device | |
CN104063387B (en) | Apparatus and method of extracting keywords in the text | |
Pu et al. | Subject categorization of query terms for exploring Web users' search interests | |
CN100595753C (en) | Text subject recommending method and device | |
CN102955772B (en) | A kind of similarity calculating method based on semanteme and device | |
CN102880623B (en) | Personage's searching method of the same name and system | |
CN103514213B (en) | Term extraction method and device | |
WO2008014702A1 (en) | Method and system of extracting new words | |
WO2007143914A1 (en) | Method, device and inputting system for creating word frequency database based on web information | |
CN107967290A (en) | A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data | |
CN100433007C (en) | Method for providing research result | |
CN102214189A (en) | Data mining-based word usage knowledge acquisition system and method | |
CN102779135A (en) | Method and device for obtaining cross-linguistic search resources and corresponding search method and device | |
CN110134847A (en) | A hotspot mining method and system based on Internet financial information | |
CN111160007B (en) | Search method and device based on BERT language model, computer equipment and storage medium | |
CN109815401A (en) | A Person Name Disambiguation Method Applied to Web Person Search | |
CN106951420A (en) | Literature search method and apparatus, author's searching method and equipment | |
CN116738065A (en) | Enterprise searching method, device, equipment and storage medium | |
CN117216275A (en) | Text processing method, device, equipment and storage medium | |
WO2015084757A1 (en) | Systems and methods for processing data stored in a database | |
Pickard | Comparing word2vec and GloVe for automatic measurement of MWE compositionality | |
TW575813B (en) | System and method using external search engine as foundation for segmentation of word | |
Gupta et al. | Text analysis and information retrieval of text data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |