TW575813B

TW575813B - System and method using external search engine as foundation for segmentation of word

Info

Publication number: TW575813B
Application number: TW91123508A
Authority: TW
Inventors: Jen-Diann Chiou; Yi-An Chen
Original assignee: Intumit Inc
Priority date: 2002-10-11
Filing date: 2002-10-11
Publication date: 2004-02-11

Description

575813575813

發明領域 …… 種運用外部搜尋引整供炎傲統與方法，可應用於知識管理或文”、、断詞基礎之本發明係關於糸應 (s 用於資料挖掘（d a t a m i n i ng)之搜尋括、或特別疋 • · 兮孜術或文件相S3 ΜFIELD OF THE INVENTION ... This invention uses external search to guide the integration of the arrogant system and method, which can be applied to knowledge management or text, and the basis of word segmentation. The present invention is about search (including data mini ng) search including ， Or special 疋 • xishu or file photo S3 Μ

Uilarity)運算之文件向量運算的處理。關吐Uilarity) processing of file vector operations. Off

料挖掘或搜尋技術常以關鍵詞做^ iL:然而近期的主流技術卻是以整篇文件當作輸入：杳 ^件，去搜尋其他與之相關的文件，此即為文件相難 :异；但是在進行文件相關性運算之旨，必須先將全部文，轉換為向量型態，亦即將一篇文章中具有意義的有效詞菜選出做為文件向量中的分量，再統計該分量在一篇文件中出現的次數，如此便可形成文件向量；但是如何取得有效詞彙，來將文件快速轉成向量，須倚賴更具效率的斷詞技術（s egmen tat i on of word) 〇辦—技術不僅可應用於文件的相關性分析，其所斷出的有效詞彙更可提供文件的自動摘要（auto abstract)、文件分類（clustering)、以及資訊擷取（information retrieval)等方面的應用。然而要處理斷詞之前，須先在文件資料庫中進行統計與運算，將各種可能的字組進行測試，並篩選最常使用的詞彙，形成詞庫（thesaurus )，再依據詞庫中的有效詞彙將文件中的句子斷出一組具有意義的有效詞彙，依此形成Data mining or search technology often uses keywords ^ iL: However, the recent mainstream technology takes the entire file as input: 杳 ^ files, to search for other related files, this is the file difficulty: difference; However, for the purpose of document correlation calculation, we must first convert all the text to a vector type, that is, select the meaningful words and phrases in an article as the components in the file vector, and then count the components in a document. The number of occurrences in a file can form a file vector in this way; but how to obtain an effective vocabulary to quickly convert a file into a vector must rely on more efficient word segmentation techniques (s egmen tat i on of word). It can be applied to the correlation analysis of documents, and the effective vocabulary broken out can also provide applications such as auto abstract, document classification, and information retrieval. However, before processing word segmentation, statistics and calculations must be performed in the document database, various possible groups of words must be tested, and the most commonly used words must be screened to form the thesaurus. Vocabulary breaks sentences in a file into a set of meaningful and effective words

五、發明說明（2) 文件向量；此時如果文件大，則詞庫中的有效詞囊數量，表示統計基礎越地將文件的關鍵詞斷出二越夕，越能完整地、快速統計基礎越小，則q廑击相反地，如果資料庫越小，表示地將整篇文件之關^ ^ “的詞彙數量亦少，通常無法完敕果失去意義。 …司斷出，使得文件間相關性運算的二龐大的文U料而”的詞庫之前，須先建立本及時間，都會二當;；异此斷詞，其所花費之成題。此成為本發明首要解決的課地取料使=可r”網際網球無限制最龐大的資料庫，Α所供、罔1v、網路本身就是一個如能運用網路資源取用之不竭，源，：t節省成本，====基礎、統計來 =統計來源本ί 尋::擎做資料與數據’經過用詞特徵值的運算，；庫：：-來，不僅詞庫更加完整，ΐ:; 龎大貝科庫的成本與電腦資源的使用。登a之簡要說明本發明之主要目為斷詞基礎之系統與的在於提供一種運用外部搜尋引擎做方法，使之能篩選出有效詞彙形成詞 575813 五、發明說明（3) 庫進一步地將文件轉換成向量，提供文件相關性運算。 ^為了達到上述目的，首先在本端系統預先設定用詞特 ^值之門松值，然後各別輸入所欲測試之詞及其所有可能組於外部搜尋引擎做普通查詢，等待外部搜尋引擎回傳 :罔頁之後，接收其中統計數據，並做用詞特徵值運算，若該詞=之用詞特徵值大於或大於等於預設之門檻值，則視该5司彙為有效詞彙，並加入詞庫中。，用本系統建立詞庫的目的在於快速地將一篇文章斷 1 4用词茱組合而成的文件向量，以便於進行相關度的文件轉換成由常用詞彙組合而成的文件向量說明如等文件中所有出現的常用詞彙（字組）加以排列，組可為雙字母纟日仏· 、千桿 :甘組（blgram)，如常用詞彙「專利」、「商 / . 著作」’亦可為單字（unigram)、三字母組 = g^am)或以上，本發明以雙字母組做為示範的基礎，的士盤ΐ = ί其應用的範圍；接著，再計算每一字組出現的=量了二置的分量值，組合所有字組的分量值為該文件直所古a /中白里的型悲舉例說明以下，例如，一群文件現「專2、且排列為（專利，商標，著作），在該篇文件中出現「著作」的次數為7次、出現「商標」的次數為3次、出 5)，為具^的次數為5次，則該篇文件之向量為（7， 3，一個維度的向量（3-dimensional vector)。V. Description of the invention (2) File vector; if the file is large at this time, the number of valid words in the thesaurus indicates that the more the statistical basis is, the more the keywords of the file are cut off, and the more complete and rapid the statistical basis is. The smaller it is, the smaller it is. On the contrary, if the database is smaller, it means that the entire document will be closed ^ ^ "The number of vocabulary is also small, usually it ca n’t complete the fruit and loses its meaning.… The company broke it out, making the files related. "The two huge texts of sexual operations" must be established before the time and the time, will be both ;; if this word segmentation is different, the cost it takes is a problem. This has become the primary solution of the present invention to make the largest and largest database of = ok "Internet tennis. The database provided by A, 罔 1v, and the Internet itself is an inexhaustible source of network resources. Source: t save cost, ==== basis, statistics come = statistics source text finder :: engine to do data and data 'through the calculation of word feature values, library ::-come, not only the thesaurus is more complete, ΐ :; 厐 The cost of Big Baker's library and the use of computer resources. A brief description of Deng a The main purpose of the present invention is a system of word segmentation and to provide an external search engine as a method to enable it to filter out effective words. Form the word 575813 V. Description of the invention (3) The library further converts the file into a vector, and provides the file correlation operation. ^ In order to achieve the above purpose, first set the threshold value of the word feature ^ in the local system in advance, and then separately Enter the word you want to test and all of it may be grouped into an external search engine for a general query, and wait for the external search engine to return: After the title page, receive the statistics and perform the word feature value calculation. If the feature value of a word is greater than or equal to a preset threshold, the 5 divisions are regarded as valid vocabulary and added to the thesaurus. The purpose of establishing a thesaurus with this system is to quickly break an article 1 to 4 Word vectors combined with words and phrases, so that related documents can be converted into file vectors combined with commonly used words. For example, all the common words (words) appearing in the file are arranged. The group can be two-letter letters. Sundial ·, Thousand poles: Gan group (blgram), such as the common words "patent", "quote /. Work" can also be a single word (unigram, three-letter group = g ^ am) or more. The letter group is used as the basis for the demonstration. Taxi ΐ = ί its application range. Then, calculate the value of the two components that appear in each word group. The component values of all the word groups are combined in the file. The example of Gua / Zhongbaili's type tragedy is as follows. For example, a group of documents is now "Special 2, and arranged as (patent, trademark, work). The number of times that" work "appears in the document is 7 times, and" Trademarks "3 times, 5), is ^ Is five times, the article of the vector of documents (7, 3, a vector of dimension (3-dimensional vector).

第6頁來，兩文件向量之相關性運算，說明斤文件轉換成向量之後、就可以將兩文件J下·、异’其内積值做為相似度值，其内積值越大：：做内積4 相似度越大；例如，若一、越大代表兩文件間的運算，為文件間的個 ':、、，另一文件為D，内積值加總；例如：文件β，其、旦Μ的为里值分別相乘，然後再一文件D向量為= (dl，廿2向置為= (bl，b2, ......，bm)，另From page 6, the correlation between two file vectors shows that after converting a file into a vector, you can use the inner product value of the two files J and X 'as the similarity value. The larger the inner product value: Do the inner product. 4 The greater the degree of similarity; for example, if one or more represents the operation between two files, it is the number between files: ',,, and the other file is D, and the inner product value is added up; for example: file β, its, and M Are multiplied by the median value, and then the D vector of the file is = (dl, 廿 2 is set to = (bl, b2, ......, bm), and

之相關度為其内積值=β ， .···， dm)，m為維度數目，BikD 由上可知，要使得文 ^^d1+b2xd2+ ...··.+bfflxdffl, 文件轉換為向量，文件向=:夠做相關性運算，必須先將礎，有效率的做法是應用二古的分量，是以常用詞彙為基詞彙陸續斷出；例如，—=阔庫來將一篇文件中的常用田」，所有可能的雙字母二:「中國大陸新發現的油陸」、「陸新」、「新發、、、*中國」、「國大」、「大油」、「油田」，若詞▲二發現」、「現的」、「的「大陸」、「發現」、「的常用詞彙有：「中國」、〒，其餘的「國大」、「心」，則這些詞彙會陸續被斷的油」等詞會被拋棄而^新」、新發」、「現的」、用詞彙、用什麼做為樑 ^ ;然而如何判斷何者才是搜尋引擎的回傳數據當=重點所在，本發明透過外用詞特徵門檻值相比，列2 =貧料，經過運算後與預設之用詞彙，否則不為常用詞彙疋否大於門檻值’是則其為常運用外部搜尋引擎，山乂然後取得外部搜尋？丨擎 $系統必須連結至網際口傳之賢料，再交由系統運 575813 五、發明說明⑸ " _______ 端之系統及連外結構其說明如下。如圖1所示，為本發明應用於電腦種’首先將本發明以程式設計成包含但不冓之：中之-腦指令之軟體並安裝於電腦101±，不可執产的電上型電腦或筆記型電腦；電腦之軟體102中=系桌統、應用軟體、各式元件、資料庫中了為作業系本發明之系統亦屬於電腦軟體之―，=於=或資料，及記憶體104等電腦可讀取媒體，執置夕於储存設備1〇3 及記憶體106中η吏用者於本機透出=二硬碟機滑鼠，進入電腦中之輸出入埠108r::=7:鍵盤或之軟體，透過主機板為介面1〇9盥1 j電細指令、玄X、兰山士 A 再他硬體組件間溝通，並仫仫出中央處理皁元（Central PrQcessi =?運算各項機器指令’本電腦指令」理後將結果达至顯不介面卡丨丨丨以顯示於螢幕丨丨2上。系統使用者可選擇在本端電腦操作本系統，亦可透過網路操作；若系統使用者來自區域網路（LAN)113，可透過網路設備114進入本機網路介面卡115，以執行本電腦指令之权體，系統亦可透過此區域網路連外至網際網路並與之傳輸溝通。若系統使用者來自廣域網路（w A N)丨丨6 (或是網際網路），可透過網路設備114進入本機網路介面卡115，或透，數據機11 7登入另一輸出入埠i丨8，以執行本電腦指令之車人體’系統亦可透過此廣域網路連外至網際網路並與之傳輪溝通。The correlation is its inner product value = β,..., Dm), m is the number of dimensions, BikD can be seen from the above, to make the text ^^ d1 + b2xd2 + ..... + bfflxdffl, convert the file to a vector, Document direction =: enough to do correlation calculations, you must first base. An efficient approach is to use the components of Ergu, which are based on commonly used vocabularies. For example,-= Kuoku will use "Common fields", all possible two-letter two: "Newly discovered oil land in mainland China", "Lu Xin", "Xinfa ,,, * China", "National University", "Big Oil", "Oilfield", If the words ▲ Second Discovery, Present, Mainland China, Discovery, and Common vocabulary are: "China", 〒, and the remaining "National University" and "Heart", these words will continue one after another Words such as “broken oil” will be discarded and ^ new '', `` new '', `` present '', vocabulary, and what will be used as the beam ^; however, how to determine which is the search engine's postback data In the present invention, compared with the threshold of the feature word of the external word, column 2 = poor material, after the calculation, it is compared with the preset vocabulary. No No is not greater than the threshold value common vocabulary Cloth 'as it is often the use of external search engines, then made mountain qe external search?丨 Engine $ system must be connected to the Internet's oral materials, and then delivered to the system 575813 V. Invention Description ⑸ _______ The system and external structure of the terminal are described below. As shown in FIG. 1, the present invention is applied to a computer type. The first aspect of the present invention is to program the present invention to include but is not limited to: the middle-brain instruction software and install it on a computer 101 ±, a non-production computer. Or notebook computer; the software 102 of the computer = desktop systems, application software, various components, and databases are included in the system of the present invention. The system also belongs to computer software-=== or data, and memory 104. Wait until the computer can read the media, and place it in the storage device 103 and the memory 106. The user will see the local device = two hard disk drives, enter the input port 108r of the computer :: = 7 : Keyboard or software, through the motherboard as the interface, 109, 1 j electric instructions, Xuan X, Lanshan Shi A and other hardware components to communicate, and the central processing saponin (Central PrQcessi =? Operation All machine instructions 'this computer instruction' are processed and the result is displayed on the display interface card 丨丨丨 to be displayed on the screen 丨丨 2. The system user can choose to operate the system on the local computer or through the network. ; If the system user is from a local area network (LAN) 113, The device 114 enters the local network interface card 115 to execute the authority of this computer command. The system can also connect to the Internet through the local area network and communicate with it. If the system user is from a wide area network (w AN ) 丨丨 6 (or the Internet), you can enter the local network interface card 115 through the network device 114, or through, the modem 11 7 logs in to another input / output port i 丨 8 to execute the instructions of this computer The car body system can also connect to the Internet and communicate with it via this WAN.

$ 8頁 575813 五、發明說明（6) # ΐ i所i P:本發明之應用可存在於各式電腦可讀取媒 U括但不限於軟碟磁片、硬碟、光碟片、•閃記憶體 ROM 、非揮發性記憶體（nonvolat i le R0M)及各 J存取記憶體UAM)中；安裝上不限於電腦做負荷平衡之運算亦可。 # @ 如圖2所不，為本發明之系統與網路連線時之方式之二中一種，首先將本發明設計成電腦系統2〇4並安裝於本細上，其中本系統可透過網路伺服器（Web Server) 203與網際網路系統202連線，再與外部搜尋引擎2〇ι溝通與傳遞訊息；當外部搜尋引擎2〇1回傳資料時，亦透過網際網路系統2 0 2傳至本端伺服器2〇3，然後由本端電腦系統 2 0 4接收訊息並進行處理。如圖3所示，為本發明之主要流程，首先，步驟3〇1，設定用詞特徵值之門檻值；該「用詞特徵值」帛足以代表一詞彙在文件中用詞常見的程度，本發明使用相互資訊 (Mutual lnf0rmati0n; MI)做為計算用詞特徵值之其中一種統計方法，公式如下： MI(ab)二Log(F(ab))-Log(F(a))—Log(F(b)) 右一詞彙為雙字母組，a為詞彙中的第一個字，b為雙字母組中的第二個字；F (ab )為雙字母組吡出現的頻率 (次數），F(a)為a出現的頻率（次數），F(b)gb出現的頻率（次數）。相互資訊Μ I之公式，實務上會在各項對數值前加入係數加以調整，使本發明之系統更易於篩選詞彙，其公式調$ 8 页 575813 V. Description of the invention (6) # ΐi 所 i P: The application of the present invention can exist in various computer readable media including but not limited to floppy disks, hard disks, optical disks, flash Memory ROM, nonvolatile memory (nonvolat i le R0M) and each J access memory UAM); the installation is not limited to the computer for load balancing calculations. # @ As shown in Figure 2, this is one of the two ways of connecting the system and the network of the present invention. First, the present invention is designed as a computer system 204 and installed on the notebook. The system can be accessed through the network. The Web server 203 is connected to the Internet system 202, and then communicates and transmits messages with the external search engine 200m. When the external search engine 201 returns data, it also passes the Internet system 2 0 2 is transmitted to the local server 203, and then the local computer system 204 receives the message and processes it. As shown in FIG. 3, this is the main flow of the present invention. First, in step 3101, a threshold value of a word feature value is set; the "word feature value" is sufficient to represent the degree to which a word is commonly used in a document. The present invention uses mutual information (Mutual lnf0rmati0n; MI) as one of the statistical methods for calculating the feature value of a word, and the formula is as follows: MI (ab) two Log (F (ab))-Log (F (a))-Log ( F (b)) The right word is a two-letter group, a is the first word in the vocabulary, b is the second word in the two-letter group; F (ab) is the frequency (number of times) of the two-letter group pyr. , F (a) is the frequency (number of times) that a appears, and F (b) gb appears (frequency). The formula of mutual information M I will be adjusted by adding coefficients before the logarithmic values in practice, making the system of the present invention easier to screen vocabulary.

575813 五、發明說明（7) 整如下： MI(ab)=w · Log(F(ab))-x •Log(F(a))-y -Log(F(b))-z 其中w， x，y，z為一貫數係數’意謂利用係數當作權數將相互資訊數值加以調整，其最佳化之係數值已由近人計算出，亦為本發明在實作上所運用，其ΜI公式如下： MI(ab) = 0.39 · Log (F(ab ))-0. 28 · Log(F(a))-0. 23 -Log (F(b))-0.32 ;係數皆接近0.3。步驟301設定用詞特徵值之門檻值，如門檻值mj>-19 ;步驟3 0 2，輸入一詞彙於系統，如「油田」；步驟 3 0 3，傳送該詞彙與構成該詞彙之單字至外部搜尋引擎查詢；目前外部搜尋引擎有：「Google」 (www.google.com) 、 r0penfind 」 (www.openfind.com.tw) 、「AltaVista」 (www.altavista.com) 、「Excite」（www.excite.com)、「Lycos」（www· lycos. com)，系統可任意指定，本例使用 0 p e n f i n d做為外部搜尋引擎；步驟3 0 4，接收並解讀外部搜尋引擎回傳訊息，例如，於Openf ind查詢「油田」一詞，該搜尋引擎回傳之網頁中包含「Open find找到1 2, 9 28篇相關網頁」，其中 12,928 即為F(ab)，其他F(a)=l，630,526，F(b) = 1，1 82, 6 78，換言之，F(ab)、F(a)、F(b)，三者皆由搜尋引擎提供；步驟3 0 5，計算該詞彙之用詞特徵值，此處用詞特徵值以相互資訊MI為例，ΜΙ(Π油田·，）= Log(F(n油田n ))-Log575813 V. The description of the invention (7) is as follows: MI (ab) = w · Log (F (ab))-x • Log (F (a))-y -Log (F (b))-z where w, x, y, and z are consistent coefficients, which means that the mutual information values are adjusted by using the coefficients as weights. The optimized coefficient values have been calculated by recent people, and are also used in the practice of the present invention. The MI formula is as follows: MI (ab) = 0.39 · Log (F (ab))-0. 28 · Log (F (a))-0. 23 -Log (F (b))-0.32; coefficients are all close to 0.3. Step 301 sets the threshold value of the feature value of the word, such as threshold mj >-19; step 302, enter a word into the system, such as "oil field"; step 303, send the word and the words that form the word to External search engine query; currently the external search engines are: "Google" (www.google.com), r0penfind "(www.openfind.com.tw)," AltaVista "(www.altavista.com)," Excite "(www .excite.com), "Lycos" (www.lycos.com), the system can be specified arbitrarily. In this example, 0 penfind is used as the external search engine. Step 3 0 4: Receive and interpret the external search engine's return message. For example, Query the word "oilfield" in Openf ind. The web page returned by the search engine contains "Open find 1 2, 9 28 related pages", of which 12,928 is F (ab), and other F (a) = l, 630,526, F (b) = 1,1 82, 6 78, in other words, F (ab), F (a), F (b), all three are provided by the search engine; Step 3 0 5, calculate the use of the vocabulary Word feature value, here the word feature value is taken as the mutual information MI as an example, MI (Π field ·,) = Log (F (n oil field n))-Log

第10頁 575813 五、發明說明（8) (F(n 油，，））-Log(F(n 田，1 ))= -18· 82 0 6 ; 步驟3 0 6，判斷該用詞特徵值是否大於門檻值，是則進行下一步，否則結束；由於-1 8. 82>-1 9，即用詞特徵值 ΜΙ(Π油田"）大於門檻值mi，故進行步驟3 0 7 ;步驟3 0 7，將該詞彙視為有效詞彙並加入詞庫。本發明再以一實施例說明有效詞彙的篩選過程；例如，一段句子厂中國大陸新發現的油田」，其所有可能的詞彙 ( 雙字母組 ) 有厂中國 J ^ Γ 國大」、「大陸」、厂陸新 J 厂新發 J 厂發現」、「現的」、「的油」、厂油田 J 各別輸入於外部搜尋引擎Google查詢，其查核結果筆數及計算後數據如圖 4所示，其斥丨詞特徵值以Μ I為基礎 j 其Μ I值列表如下：詞彙 ΜΙ 厂中國 j -14. 6 厂國大 j -18. 3 厂大陸 j -14· 4 厂陸新 j -1 8 · 6 厂新發 j -17. 9 厂發現 j -13. 9 厂現的 j -18. 6 厂的油 j -19. 6 厂油田 j -16. 6Page 10 575813 V. Description of the invention (8) (F (n oil ,,))-Log (F (n field, 1)) = -18 · 82 0 6; Step 3 0 6 to determine the feature value of the term If it is greater than the threshold value, then go to the next step, otherwise end; because -1 8. 82 > -19, that is, the word feature value MI (Πfield ") is greater than the threshold value mi, so proceed to step 3 0 7; step 3 0 7, treat the word as a valid word and add it to the thesaurus. The present invention further illustrates the screening process of effective vocabulary by using an embodiment; for example, a sentence sentence plant newly discovered oil field in mainland China ", all possible vocabulary (two-letter group) are the plant China J ^ National University", "Mainland" , Plant Luxin J Plant, Xinfa J Plant Found "," Current "," Oil ", and Plant Oilfield J were entered into Google search from external search engines, and the results of the check and the calculated data are shown in Figure 4. The feature values of its replies are based on Μj. The list of ΜI values is as follows: Vocabulary ΜΙ Plant China j -14. 6 Plant Guoda j -18. 3 Plant mainland j -14 · 4 Plant Luxin j -1 8 · 6 new plant j -17. 9 plant found j -13. 9 plant present j -18. 6 plant oil j -19. 6 plant oil field j -16. 6

第11頁 575813 五、發明說明（9) 其值大小排列之後之詞彙分別為：「發現」、「大陸」、「中國」、「油田」、「新發」、「國大」、「現的」、「陸新」、「的油」，其值大小與一般用詞頻率之印象吻合，越前面的詞彙越常用，越後面的詞彙越不常用；若預設之用詞特徵門檻值為m i = - 1 5，則「發現」、「大陸」、「中國」三詞彙會被選為有效詞彙；若mi =-17，則「發現」、「大陸」、「中國」、「油田」四詞彙會被選為有效詞彙，其餘之詞彙「國大」、「陸新」、「新發」、「現的」、「的油」都不會被選為有效詞彙。用詞特徵值除了使用「相互資訊」統計方法做為計算基礎以外，亦有其他方式可供計算；第二種用詞特徵值之統計方法為詞彙頻率（T e r m F r e q u e n c y ; T F )，即以外部搜尋引擎所回傳該詞彙的結果筆數，當作用詞特徵值；如以上一實施例為範例，其T F值如圖4所示之F (a b) —欄，摘錄於下表：詞彙 TF 「中國」 1，910, 000 「國大」 3 5,5 0 0 「大陸」 568,000 「陸新」 8, 530 「新發」 6 3, 6 0 0P.11 575813 V. Description of the invention (9) The vocabularies after their values are arranged are: "discovery", "mainland", "China", "oil field", "new development", "National University", "present" "," Lu Xin ", and" Oil ", the values of which are consistent with the general impression of word frequency. The words that are more in front are more commonly used, and the words that are later are less commonly used. If the default feature word threshold is mi =-1 5, the three words "discovery", "mainland", and "China" will be selected as valid words; if mi = -17, the four words "discovery", "mainland", "China", and "oil field" will be selected Will be selected as a valid vocabulary, and the remaining vocabularies "National University", "Lu Xin", "Xin Fa", "present", and "oil" will not be selected as valid vocabulary. In addition to using the “reciprocal information” statistical method as the basis for calculation, word eigenvalues can be calculated in other ways. The second statistical method for word eigenvalues is vocabulary frequency (Term F requency; TF). The number of results of the vocabulary returned by the external search engine is used as the feature value of the word; as in the previous embodiment as an example, the TF value is shown in the F (ab) -column shown in Fig. 4, which is extracted from the following table: Vocabulary TF "China" 1,910,000 "National University" 3 5,5 0 0 "Mainland" 568,000 "Lu Xin" 8, 530 "Sin Fat" 6 3, 6 0 0

第12頁 575813Page 12 575813

925, 000 40,200 20,600 77, 6 0 0 五、發明說明（ίο) 「發現」「現的」「的油」「油田」若預設之門檻值為1 0 0，0 0 0，則有效詞彙為「中國「發現」、「大陸」。第三種用詞特徵值之統計方法為資訊增益 (Information Gain; IG) ， IG 公式為： IG(t)二-EPr(ci)Log(Pr(ci)) + Pr(t) ΣPr (ci I t) Log (Pr(ciIt)) + Pr(t’）EPr(ci|t，）Log(Pr(cilt，））其中ci為文件中之分類，iM〜m，m為分類個數，pr(ci )為該分類在總文件中出現的機率、頻率或次數，t為一雙字母組，Pr(t)為該雙字母組在總文件中出現的頻率，pr (c i | t)為該雙字母組在分類c i中出現的頻率、ρ厂（七，）為該雙字母組在總文件中未出現的頻率（結果筆數）、Pr ^ (cilt’）為該雙字母組不出現在分類中的頻率；'此^資訊增益統計方法更適用於既經分類的文件資料庫，一般搜引擎可允許使用者限定搜尋特定分類的内容，如Yah〇〇! 若無分類，則可將總體資料庫視為一個類別。 ’ 在資訊增益I G之公式中，可在久了石^ 土、，加以調整，使電腦更易於篩選詞彙、二)值a加入係數 IG(t)^w.EPr(cl)L〇g(Pr(ci))"^^^^^^925, 000 40,200 20,600 77, 6 0 0 5. Description of the invention (ί) “Discovery” “Current” “Oil” “Oil Field” If the preset threshold is 1 0 0, 0 0 0, the valid words are "Chinese" discovery "," Mainland ". The third statistical method of word eigenvalues is Information Gain (IG). The IG formula is: IG (t) two-EPr (ci) Log (Pr (ci)) + Pr (t) ΣPr (ci I t) Log (Pr (ciIt)) + Pr (t ') EPr (ci | t,) Log (Pr (cilt,)) where ci is the classification in the file, iM ~ m, m is the number of classifications, pr ( ci) is the probability, frequency or number of occurrences of the classification in the overall file, t is a pair of letters, Pr (t) is the frequency of the occurrence of the two letters in the overall file, and pr (ci | t) is the double The frequency of the letter group in the classification ci, ρ factory (seven,) is the frequency of the two-letter group that does not appear in the total file (the number of results), Pr ^ (cilt ') is the two-letter group does not appear in the classification Frequency; 'This ^ information gain statistical method is more suitable for classified document databases. General search engines allow users to limit the search to specific categories of content, such as Yah〇〇! If there is no category, the overall data can be The library is considered a category. '' In the formula of the information gain IG, it can be adjusted after a long time ^ soil, to make it easier for the computer to screen the vocabulary, and b) the value a is added with the coefficient IG (t) ^ w.EPr (cl) L0g (Pr (ci)) " ^^^^^^

第13頁 575813 五、發明說明（11) +x -Pr(t) ZPr(ci|t)Log(pr(ci|t)) + y · Pr(t’）EPr(ci|t’）L〇g(Pr(ci|t’））其中w， x，y，z為一實數係數，意謂利用係數當作權數將資訊增益加以調整。第四種用詞特徵值之統計方法為卡方統計（Chi- square Statistic ; CHI) ，統計雙字母組ab 出現次數A 、該雙字母組前字a出現而後字b未出現的次數B、後字b出現而前字未出現a的次數C、兩字ab皆未同時出現的次數d ; 步驟70 3，計算所有的雙字母組之CHI值： CHI (ab) = [ N X (AD-CB)2 ] + [(A+C)x (B+D)x (A+B)x (C+D)] 其中N為總文件數目；一般大型搜尋引擎皆會提供「進階搜尋」的功能提供給使用者查詢，其中可選擇輸入「不包括指定字詞」，如此一來，就可以搜尋出現a而不出現匕的筆數、搜尋出現b而不出現a的筆數。、本發明以上所提為斷同時不範的基礎以上的「斷長詞」的拆字的方法說明如後供之實施例，以雙，而實際上的斷詞處理：例如，其中字母組（bigram)做，更可應用於二字一種斷長詞為循序首先’須先選定拆字的方式，例如將η個字的句子成由第1個字至第（η_1)個字組成的長詞，以及第2個字斥第η個字組成的長詞兩部份，以外部搜尋引擎所回傳之么计資料計算各自的用詞特徵值，再以門檻值判斷何者為先有Page 13 575813 V. Description of the invention (11) + x -Pr (t) ZPr (ci | t) Log (pr (ci | t)) + y · Pr (t ') EPr (ci | t') L. g (Pr (ci | t ')) where w, x, y, and z are real coefficients, which means that the coefficients are used as weights to adjust the information gain. The fourth statistical method for the feature value of words is Chi-square Statistic (CHI), which counts the number of occurrences of the two-letter group ab, the number of occurrences of the first letter a of the two-letter group a, and the number of times B does not appear. The number of times that the word b appears without the preceding word a, and the number of times that the two characters ab do not appear simultaneously d; Step 70 3, calculate the CHI values of all two-letter groups: CHI (ab) = [NX (AD-CB) 2] + [(A + C) x (B + D) x (A + B) x (C + D)] where N is the total number of documents; general large search engines will provide the "advanced search" function provided For the user, you can choose to enter "exclude specified words". In this way, you can search for the number of occurrences of a without dagger, and the number of occurrences of b without a. The method for breaking words of "breaking words" mentioned above in the present invention, which is based on the principle of breaking at the same time, is described as an example provided later, and the actual word breaking is treated: for example, the letter group (bigram) ), It can also be applied to two-character long-word breaks in order. First, you must choose the way to split words, such as the η word sentence into a long word consisting of the first word to (η_1) words, and The second word excludes the two parts of the long word composed of the nth word, and uses the data returned by the external search engine to calculate the respective word feature values, and then uses the threshold value to determine which one is the first

第14頁 575813 五、發明說明（12) 效長詞，若判斷為有效之長詞，則由該句中斷出，否則再回到拆字方式，繼續拆解該長詞並判斷之，如此循環處理，直到無可處理為止。本發明以外部搜尋引擎為資料庫計算而得的詞庫，其應用相當廣泛，例如： 1 ·動態更新詞庫，讓詞庫隨時保持最新，可使得查詢、分類更加精確。 2.問題解決（Problem-Solution)，將詞囊> %、動Page 14 575813 V. Description of the invention (12) If the long word is judged to be a valid long word, it will be interrupted by the sentence. Otherwise, it will return to the word splitting mode, continue to disassemble the long word and judge it, and repeat the process in this way. Until nothing can be processed. The thesaurus calculated by the present invention using an external search engine as a database has a wide range of applications, such as: 1. Dynamically updating the thesaurus to keep the thesaurus up-to-date at all times, which can make queries and classification more accurate. 2. Problem-Solution, word bag >%, action

analysis)，使用者若以「問題」的查詢，就可得到系統 1解決」方式的回應。 H庫Λ做為分類基礎’具有完備的詞庫之後，就可以經分類的詞彙，解析文件内$，再將文件做 Π· 動摘要，以词庫做為尋找文件中關鍵詞句的基礎! 文件的重點之處點出，成為該文件的摘要用者可快速劉覽自動摘要，更有效率地閱讀文件。，合以上所述’本發明之主要技用外部搜尋引擎做為筛選有效詞彙：：自動化analysis), if users use the "question" query, they can get a response from system 1. H library Λ as the basis of classification 'After having a complete thesaurus, you can categorize the vocabulary, analyze the $ in the file, and then use the file as a Π · dynamic summary, using the thesaurus as the basis for finding the key words in the file! The key points are that users who become the summary of the file can quickly browse the automatic summary and read the file more efficiently. In combination with the above, the main technique of the present invention is to use an external search engine as a screen for effective vocabulary :: automation

大量節省诸罢沏Μ隹斷Η之統計基礎，里即名建置與鬼集文件資料庫的成本。本發明可應用於電腦資訊系統，其傳限於任何媒介’如網路、無線傳輸裝置等皆可：匕- 本發明之技術内容及技術特點巳揭示如上，麸本項技術之人士仍可能其Α 而无、人士仍τ此基於本發明之教導及揭示而作種It saves a lot of statistical bases, including the cost of building a database of ghost files. The present invention can be applied to computer information systems, and its transmission is limited to any medium, such as the Internet, wireless transmission devices, etc .: Dagger-the technical content and technical characteristics of the present invention. As disclosed above, people with this technology may still However, people still do this based on the teaching and disclosure of the present invention.

第15頁 575813 五、發明說明（13) 不背離本發明精神之替換與修飾；因此，本發明之保護範圍應不限於實施例所揭示者，而應包括各種不背離本發明之替換與修飾，並為以下之申請專利範圍所涵蓋。Page 15 575813 V. Description of the invention (13) Replacement and modification without departing from the spirit of the invention; therefore, the protection scope of the invention should not be limited to those disclosed in the embodiments, but should include various substitutions and modifications without departing from the invention. It is covered by the following patent applications.

第16頁 575813 圖式簡單說明圖1為本發明應用於電腦之架構圖；圖2為本發明之系統與網路連線圖；圖3為本發明之主要流程圖；及圖4為本發明之實施例之數據表。Page 575813 Brief Description of Drawings Figure 1 is a structural diagram of the present invention applied to a computer; Figure 2 is a system and network connection diagram of the present invention; Figure 3 is a main flowchart of the present invention; and Figure 4 is the present invention The data sheet of the embodiment.

第17頁Page 17

Claims

575813

6. The scope of patent application β is also +7, + _ word basis method 1 · An external search engine is used as money "^ Step: To verify whether a word is a valid word, including the following y == 特-有字Pass the query string from the external search engine; (c) Receive the external search engine to send back the net tribute, and get the statistics, "" at least include the number of query results; (d) calculate the word according to the above statistical data The feature value of a word is regarded as a valid word if the feature value of the word is greater than the $ threshold of the feature value of the word. 2 · The method as described in item 1 of the scope of patent application, wherein the Lusi in step (b) All the words in the dish are all words whose number of words is less than the number of words in the vocabulary, and the words at least include single words. ^ ^ The method described in item 1 of the scope of patent application, wherein the word characteristic value :, f method The log value of the query result of the vocabulary minus the entire search result of the word group ★, the log value of the number of small and small numbers. Wan Ling described in the above item, where each log value is added with the word characteristic value Adjusted, and the word features 佶 l and τ plus each can be multiplied by the third term one Real number coefficient ^ = will be adjusted by the feature value of the word progress will be adjusted by the feature value of ㈣ f the formula described in item 1 of the range "Ten different ways for the query result of the word ^: the characteristics of the word

575813

6. The scope of patent application 6. The method described in item 1 of the Chinese patent I, where the term feature value is calculated as IG, IG (t) = ~ SPr (ci) L0g (pr ( ci)) + Calculate (t) ZPr (c1 | t) Log (Pr (cllt)) + Pr (t,) ΣPr (cχ, t,) L0g (Pr (ci It,)), where t is one Vocabulary, C1 is the classification in the file, i = i ~ m, m is the total number of classifications, PrCcU is the frequency of the classification in the overall file, pr (t) g the number of occurrences of the two-letter group in the overall file, pr (ci 丨 t) is the number of occurrences of the two-letter group in category C1, Pr (t,) is the number of occurrences of the two-letter group in the total file, and PrUi | t,) is the two-letter group Number of pens that do not appear in category ci. 7. The method as described in item 1 of the scope of patent application, wherein the feature value of the term is calculated as CHI value: CHI (ab) = [Nx (AD —CB) 2] + [(A + C) x (B + D) x (A + B) x (C + D)], where the word ab appears simultaneously with the number of pens A, the word a appears before the word b, and the number of pens B not appearing, the word b appears before The number of pens c in which the word a does not appear, and the number of pens d in which the two characters ab do not appear at the same time. 8. — A computer system for using an external search engine as a word segmentation system. The system includes: (a) — a network connection module to connect the central processing module and the Internet; (b) — central processing Module to perform the following steps

575813 VI. Patent Application ^ -------- threshold for storing word feature values;-input a vocabulary and all words in the vocabulary into the query string of an external search engine through a network connection module ; Receive a web page returned by an external search engine through a network connection module, and obtain statistical data, including at least the number of query results; calculate the word feature value of the word according to the above statistical data, if the word feature value is greater than Using a threshold of a wide eigenvalue is considered a valid vocabulary. 9. The computer system as described in item 8 of the scope of patent application, wherein all the words in the vocabulary in module (b) are all words with the number of words less than the number of words in the vocabulary, and the word i rarely includes a single word. 10. The computer system as described in item 8 of the scope of patent application, wherein the feature value of the word in the central processing module is calculated by querying the log of the last name & number minus the query of each word group The logarithm of the number of results is 11. The computer system described in item 10 of the scope of the patent application, where each value can be multiplied by a real number coefficient to adjust the word feature value, and a real number coefficient is added. _Step will be adjusted with word feature values. . 12. The computer system according to item 8 of the scope of patent application, wherein the feature value of the term in the management module is the number of query results of the term. 13. The computer system described in item 8 of the scope of patent application, wherein the central office

575813 VI. The feature value of the word in the patent application scope management module is calculated as follows: G, JG (t) = — Σ pr (ci) Log (pr (ci)) + Pr (t) IPr (ciIt) Log (Pr (ciIt)) + Pr (t ') EPr (ci I t,) Log (Pr (ci I t')) 'where t is a word, Ci is the classification in the file, i = Bu ^, m is The total number of classifications' Pr (ci) is the frequency of occurrence of the classification in the total file, pr (t) is the number of occurrences of the two-letter group in the total file, and Pr (ci I t) is the number of the two-letter group in the classification The number of pens appearing in ci, Pr (t,) is the number of pens not appearing in the total file for the two-letter group, and pr (ci | t,) is the number of pens not appearing in class ci for the two-letter group. 14. The computer system as described in item 8 of the scope of patent application, wherein the feature value of the term in the central processing module is calculated as CHI value: cHI (ab) = [Nx (AD-CB) 2]-[( Α + 〇χ (B + D) x (A + B) x (C + D)], where the word ab appears simultaneously with the number of pens A, the word before the word a appears, and the word b does not appear after the number of words B, the last word Number of occurrences of b appearing without the preceding character c and number of occurrences of the ab not appearing at the same time d. 1 5 · The computer system described in item 8 of the declared patent scope, wherein the central processing module includes at least the central processing Units and computer-readable media. 1 6. A computer-readable medium that can store computer fingers, programs, or software 'and use it with a computer device to perform method 1 using external search engines as a basis for word segmentation to verify Whether a vocabulary is valid, the method includes the following steps:

575813 6. Scope of patent application (a) Set thresholds for word feature values; (b) Enter a vocabulary and all the words in the vocabulary in an external search engine query string; (c) Receive an external search engine Return the webpage and obtain statistical data, including at least the number of query results; (d) Calculate the word feature value of the word according to the above statistical data. If the word feature value is greater than the threshold value of the word feature value, then consider Is a valid vocabulary. 17. The computer-readable medium as described in item 16 of the scope of patent application, wherein all the words in the vocabulary in step (b) are all words whose number is less than the number of words in the vocabulary, and the words include at least a single word . 18. The computer-readable medium as described in item 16 of the scope of patent application, wherein the feature value of the term is calculated by the logarithm of the query result of the word minus the query result of each word group. The logarithmic value. 19. The computer-readable medium as described in item 18 of the scope of patent application, where each logarithmic value can be multiplied by a real number coefficient to adjust the word characteristic value, and a real number coefficient can be added to further Adjust with word feature values. 20. The computer-readable medium as described in item 16 of the scope of patent application, wherein the term feature value is calculated by the number of query results of the term.

Page 22 575813 VI. Patent application scope 21, such as the computer-readable media described in item 16 of the patent application scope, in which the term characteristic value is calculated by IG, IG (t) =-EPr (ci) Log (Pr (ci)) + Pr (t) ZPr (ci | t) Log (Pr (ci | t)) + Pr (t ') EPr (ci | t') Log (Pr (ci | t ')) , Where t is a vocabulary, ci is the classification in the file, i = 1 ~ m, m is the total number of classifications, Pr (ci) is the frequency of the classification in the overall file, and Pr (t) is the two-letter group The number of occurrences in the total file, Pr (ci | t) is the number of occurrences of the two-letter group in the classification ci, and Pr (t ') is the number of occurrences of the two-letter group in the total file, P r (ci | t ') is the number of pens in which the two-letter group does not appear in the classification ci. 22. The computer-readable medium as described in item 16 of the scope of patent application, wherein the term feature value is calculated as CHI value. CHI (ab) = [NX (AD-CB) 2] ^ [( A + C) x (B + D) x (A + B) x (C + D)], where the word ab appears simultaneously in the number of pens A, the word a appears before and the word b does not appear after the number of pens B, The number of pens C where the word b appears but the word a does not appear a, and the number of pens D where neither word ab appears.

Page 23