[go: up one dir, main page]

CN1151558A - Information searching method and system - Google Patents

Information searching method and system Download PDF

Info

Publication number
CN1151558A
CN1151558A CN 95118142 CN95118142A CN1151558A CN 1151558 A CN1151558 A CN 1151558A CN 95118142 CN95118142 CN 95118142 CN 95118142 A CN95118142 A CN 95118142A CN 1151558 A CN1151558 A CN 1151558A
Authority
CN
China
Prior art keywords
character string
document
valid matching
search
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 95118142
Other languages
Chinese (zh)
Inventor
久保田理惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN1151558A publication Critical patent/CN1151558A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种通过模糊检索可以检出接近人的感觉的字符串检索技术。按照本发明,提出一种使用含有字符串字形在文献内位置信息的索引文件,对于含有与指定字符串及字符排列相似的字符串的文献进行高速检索的方式。在这种方式中,可指定待检索字符串和检索精度(大于0而小于1),并能确定含有与待检索的字符串的“相似度”超过指定检索精度的具体“相似字符串”的文献及“相似字符串”在文献内的位置。

A character string retrieval technique that can detect human-like senses through fuzzy retrieval. According to the present invention, a method for high-speed retrieval of documents containing character strings similar to a specified character string and character arrangement is proposed by using an index file containing position information of character string fonts in the document. In this way, the character string to be retrieved and the retrieval precision (greater than 0 but less than 1) can be specified, and the specific "similar character string" containing the "similarity" of the character string to be retrieved exceeds the specified retrieval precision can be determined. The location of the document and the "similar string" within the document.

Description

信息检索方法和系统Information retrieval method and system

本发明涉及高速、且以所要求的容许模糊度进行检索,例如,对于以文本文件形式存储在磁盘内的大量文献进行检索的系统和方法。The present invention relates to a system and method for searching at high speed and with required allowable ambiguity, for example, searching a large number of documents stored in a disk in the form of text files.

迄今为止,一直希望以高速检索存储在磁盘内的以自然语言书写的新闻报道、专利公报、科学技术文献等大量文献资料,并提出了各种各样的检索方式。这些检索方式大致区分如下。(a)关键字检索方式Hitherto, it has been desired to search at high speed a large number of documents such as news reports written in natural language, patent publications, and scientific and technical documents stored in a disk, and various search methods have been proposed. These search methods are broadly classified as follows. (a) Keyword search method

在这种方式中,要预先针对表示每个文献及该文献内容的关键字作成索引。这时,确定关键字的方法,有根据字形分解等自动检出关键字法、人工键入关键字法、及二者组合法。但是,这种方式只能针对有关键字索引的字符串进行检索,而且根据字形分解,自动检出关键字的精度要取决于单词、语法词典的精度,所以要在词典的编订工作上花费大量人力,这是它的缺点。(b)无索引全文检索方式In this method, an index is made in advance for keywords representing each document and the content of the document. At this time, the methods for specifying keywords include automatic detection of keywords based on character shape decomposition, manual key-in, and a combination of both. However, this method can only be retrieved for character strings with keyword indexes, and according to the decomposition of glyphs, the accuracy of automatic keyword detection depends on the accuracy of words and grammar dictionaries, so it takes a lot of work to compile dictionaries Manpower, that is its shortcoming. (b) Non-indexed full-text search method

这是一种虽然不使用索引但在每次都要当指定待检索的字符串作为检索对象的文献全文扫描的方式。也是一种要使用特殊硬件以提高检索速度的方式。但是,使用特殊硬件的系统要增加费用,同时还要受到客户服务环境方面的约制,局限于某些可以使用的机型。(c)按索引全文检索方式This is a way to scan the full text of documents that specify the character string to be searched as the search object each time, although no index is used. It is also a way to use special hardware to increase retrieval speed. However, systems using specialized hardware incur additional costs and are subject to customer service environment constraints that limit the availability of certain models. (c) Full-text search by index

本发明即属于按索引全文检索方式。这是通过使用索引,试图达到对全文进行高速检索,已知的有如下所示的几种技术方法。The present invention belongs to the full-text search method by index. This is to try to achieve high-speed retrieval of the full text by using an index. There are several known technical methods as shown below.

在特开平4-205560号公报中公布将作为检索对象的字符串按检索时使用的检索单位划分,对该每个检索单位附加升序符号,对该分成的检索单位附加表示其逻辑划分的属性符号,并对作为检索对象的字符串附加表示各个字符在检索单位中所处位置的字符位置顺序符号,作成由检索单位识别符号、字符位置顺序符号和属性符号组成的字符位置信息,并将该字符位置信息存储在每种字符区内,编制成检索文件。Japanese Unexamined Patent Publication No. 4-205560 discloses that character strings to be searched are divided into search units used in the search, each search unit is assigned an ascending order symbol, and the divided search unit is assigned an attribute symbol indicating its logical division , and add the character position order symbol indicating the position of each character in the search unit to the character string as the search object, make the character position information composed of the search unit identification symbol, the character position order symbol and the attribute symbol, and put the character The location information is stored in each character area and compiled into a retrieval file.

在特开平4-215181号公报中公布:为了减少在检索处理时对字符串的查对次数,并且为了能够采用通用信息处理装置进行高速查对,将表示构成检索对象字符串的各字符组在字符串中所处位置的字符组位置信息编制成按照各种字符组的类别分组的检索文件。It is announced in Japanese Patent Laid-Open No. 4-215181 that in order to reduce the number of checks for character strings during retrieval processing, and to enable high-speed checks using a general-purpose information processing device, each character group representing a character string constituting the search object is placed in the The character group position information of the position in the character string is compiled into a search file grouped according to the categories of various character groups.

可是,也经常会遇到不但要检索与检索字符串完全一致的字符串,并且还想检索包含部分一致字符串的文献。例如,用户对检索字符串的记忆模糊,或者会遇到字符串出现各种各样的变形,而将这些变形全部列举有时又会遇到困难的情况。However, it is often encountered that not only the character string that is exactly the same as the search character string is to be retrieved, but also the documents that contain part of the same character string are expected to be retrieved. For example, the user's memory of the search character string is fuzzy, or there are various deformations of the character string, and it is sometimes difficult to enumerate all these deformations.

现有技术中典型的部分字符串的指定方法是采用标准的表达方式。按照该方法,可以指定重复出现在0次以上的任意字符、重复出现在1次以上的任意字符、在行末位置、行首位置、以及在绝对字符码范围内的任意字符等。A typical method for specifying a partial character string in the prior art is to use a standard expression. According to this method, any character that repeats 0 or more times, any character that repeats 1 or more times, any character at the end of a line, at the beginning of a line, or within an absolute character code range can be specified.

另外,在特开平63-99830号公报公布在具备检索字符串数据和被检索字符串数据部分一致功能的系统中,设有存储表示与检索字符串数据有同类语关系的数据表以及表示该检索字符串数据是否在任何一个被检索字符串数据中出现的数据表。In addition, in the system having the function of partially matching the search character string data and the search character string data disclosed in Japanese Patent Laid-Open No. 63-99830, a data table is provided to store the same language relationship with the search character string data and to indicate the search character string data. Whether the string data appears in any of the retrieved string data in the data table.

在特开平62-221027号公报公布:当将部分对象字符串从开头断开的一段字符串在词典中未能检索的查到时,通过对只将其长度加1对下一个断开的字符串进行前推检索,既可减少无效检索次数,且可以较高的速度进行有效的单词读出。It was announced in JP-P-62-221027 that: when a character string that disconnects part of the object character string from the beginning cannot be retrieved in the dictionary, the next character string to be disconnected is obtained by adding 1 to its length. The forward retrieval of strings can not only reduce the number of invalid retrievals, but also can effectively read out words at a higher speed.

在特开平4-326164号公报和特开平5-174067公报中公布:在数据库检索系统中储存了与检索对象的每个事件的自相关信息,针对每个事件求出检索关键字的自相关信息与检索对象的上述自相关信息的一致度,并按一致度的降序输出事件编号的检索手段。It is disclosed in JP-A-4-326164 and JP-A-5-174067 that the autocorrelation information of each event of the retrieval object is stored in the database retrieval system, and the autocorrelation information of the retrieval keyword is obtained for each event A search method that outputs the event numbers in descending order of the degree of coincidence with the above-mentioned autocorrelation information of the search object.

但是,在上述现有技术的字符串检索方法中,要想指定应检索字符串的模糊度是困难的,因而在检索结果中所含的字符串有多并不是用户所要求的,或者是不合逻辑的字符串。However, in the character string retrieval method of the above-mentioned prior art, it is difficult to specify the ambiguity of the character string to be retrieved, so how many character strings are included in the retrieval result is not what the user requires, or is not suitable. logical string.

本发明的目的是提供一种可任意指定应检索字符串的模糊度的字符串检索方法。It is an object of the present invention to provide a character string search method that can arbitrarily designate the fuzziness of a character string to be searched.

本发明的另一目的是提供用于实现可任意指定应检索字符串模糊度的字符串检索方法的索引结构。Another object of the present invention is to provide an index structure for realizing a character string search method in which the ambiguity of a character string to be searched can be arbitrarily specified.

本发明的又一目的是通过模糊检索提供一种可使检索接近人的感觉的字符串检索技术。Another object of the present invention is to provide a character string retrieval technology that can make retrieval close to human perception through fuzzy retrieval.

按照本发明,为了针对由多个文献构成的数据库进行全文检索,对每个文献附加一个唯一的编号(或符号),并将各文献中的每个连续的N个字符和该N个字符所在的文献编号及其在文献中的位置的信息储存在索引文件中。索引文件适合于由字形文件和位置信息文件两个文件组成。在字形文件中存储着字形·分隔符和与其对应的文献编号和文献内位置编号在位置信息文件中所处的位置。在位置信息文件中存储着文献编号和文献内位置编号。According to the present invention, in order to perform a full-text search for a database composed of multiple documents, a unique number (or symbol) is added to each document, and each continuous N character in each document and the location of the N characters Information about the document number and its position in the document is stored in the index file. The index file is suitably composed of two files, a font file and a position information file. The font/delimiter, the document number corresponding to it, and the position of the position number in the document in the position information file are stored in the font file. The document number and the position number in the document are stored in the position information file.

按照本发明,提供一种使用上述索引文件对含有与指定字符串和字符排列相似的字符串的文献进行高速检索的方式。在这种方式中,可以指定待检索的字符串和检索精度(大于0而小于1),并能确定含有与待检索的字符串的″相似度″超过指定检索精度的″相似字符串″的文献,以及″相似字符串″在文献中的位置。According to the present invention, there is provided a system for performing high-speed retrieval of documents containing character strings similar to a designated character string and character arrangement using the above-mentioned index file. In this way, the character string to be retrieved and the retrieval precision (greater than 0 but less than 1) can be specified, and the "similar character string" containing the "similarity" of the character string to be retrieved exceeds the specified retrieval precision can be determined. document, and the position of the "similar character string" in the document.

这种方式,具体地说,就是从文献中选定与待检索的字符串″相似的字符串″,并从哪些字符是连续一致、有多少多余的字符夹在其中的两个观点出发对″相似度″作数值化处理。This method, specifically, selects "similar character strings" from the literature to be retrieved, and proceeds to "similar character strings" from the two viewpoints of which characters are continuous and consistent and how many redundant characters are sandwiched therein. "similarity" is numerically processed.

这时,如″相似度″的最高值为1,表示字符串完全一致,且当字符串完全一致时,则″相似度″必定为1。当在相似字符串中夹有不是待检索的字符串的多余字符或只有待检索的字符串的一部分出现在相似字符串中时,其″相似度″即为小于1的值,但按照本发明,如果这样的″相似度″值却相当符合人的感觉,那就是有用的。At this time, if the highest value of the "similarity" is 1, it means that the character strings are completely consistent, and when the character strings are completely consistent, then the "similarity" must be 1. When there are redundant characters that are not the character string to be retrieved in the similar character strings or only a part of the character string to be retrieved appears in the similar character strings, its "similarity" is a value less than 1, but according to the present invention , if such a "similarity" value is quite in line with human perception, then it is useful.

上述索引文件由于能够高速检索文献中任意连续的N个字符,所以通过使用上述索引文件并连续与应检索字符串的N个字符顺序进行比较,就可以高速检出哪些字符是属于连续一致,和有多少多余的字符夹在其中间。Since the above-mentioned index file can retrieve any consecutive N characters in the document at high speed, by using the above-mentioned index file and continuously comparing the order of the N characters of the character string to be retrieved, it is possible to quickly detect which characters are consecutive and consistent, and How many extra characters are sandwiched in between.

图1是表示硬件结构的框图。FIG. 1 is a block diagram showing a hardware configuration.

图2是处理单元的框图。Fig. 2 is a block diagram of a processing unit.

图3是索引文件的结构图。Fig. 3 is a structural diagram of an index file.

图4是表示索引文件编制处理的流程图。FIG. 4 is a flowchart showing index file creation processing.

图5是使用索引文件的字符串检索处理流程图。FIG. 5 is a flowchart of character string search processing using an index file.

图6是使用索引进行模糊检索处理的流程图。FIG. 6 is a flowchart of fuzzy search processing using an index.

以下参照附图说明本发明的实施例。A.硬件构成Embodiments of the present invention will be described below with reference to the drawings. A. Hardware composition

参照图1,图中示出了用来实施本发明的系统结构简图。该结构为将以下各部分连接于总线101的一般结构,其中包括具有运算和输入输出控制功能的中央处理机(CPU)102、装入程序并为CPU102提供工作区的主存储器(RAM)104、用来键入命令和字符串等的键盘106、存储用来控制中央处理机的操作系统、数据库文件、检索工具、索引文件等的硬盘108、用来显示数据库检索结果的显示器110以及用来指定显示器110屏面的任意位置并将该位置信息传送给中央处理机的鼠标器112。Referring to Fig. 1, there is shown a schematic diagram of a system structure for implementing the present invention. This structure is the general structure that the following parts are connected to the bus 101, including a central processing unit (CPU) 102 with computing and input and output control functions, a main memory (RAM) 104 that loads programs and provides a work area for the CPU 102, A keyboard 106 for typing commands and character strings, a hard disk 108 for storing the operating system used to control the central processing unit, database files, retrieval tools, index files, etc., a display 110 for displaying database retrieval results, and for specifying the display 110 any position on the screen and transmit the position information to the mouse 112 of the central processing unit.

作为操作系统最好是以Windows(微软公司商标)、OS/2(IBM商标)、AIX(IBM商标)升级的X-WINDOW系统(MIT商标)等的标准支持GUI多窗口环境,但本发明也可在PC-DOS(IBM商标)、MS-DOS(微软公司注册商标)等字符基准环境下实现,并不限定特定的操作系统环境。It is preferable to support the GUI multi-window environment with standards such as the X-WINDOW system (MIT trademark) upgraded by Windows (Microsoft Corporation trademark), OS/2 (IBM trademark), AIX (IBM trademark) as the operating system, but the present invention also It can be implemented in character-based environments such as PC-DOS (trademark of IBM), MS-DOS (registered trademark of Microsoft Corporation), and is not limited to a specific operating system environment.

另外,图1示出了独立环境的系统,但一般地说,由于数据库文件要求大容量的磁盘机,如果在客户服务器系统中采用本发明,则可将数据库文件和检索工具配置在服务机,而客户机对服务机通过以太网、令牌环等作局部区域网连接,也可只将查看检索结果的显示控制部分配置在服务机侧。B.系统构成In addition, Fig. 1 has shown the system of independent environment, but generally speaking, because database file requires the disk machine of large capacity, if adopt the present invention in client server system, then database file and retrieval tool can be configured in service machine, While the client machine is connected to the server machine through a local area network such as Ethernet, Token Ring, etc., it is also possible to configure only the display control part for checking the retrieval results on the server side. B. System configuration

下面,参照图2的框图,说明本发明的系统结构。应注意图2中用各个框图表示的单元是作为数据文件或程序文件单独或整体存储在图1的硬盘108中的。Next, referring to the block diagram of FIG. 2, the system configuration of the present invention will be described. It should be noted that the units represented by various block diagrams in FIG. 2 are individually or collectively stored in the hard disk 108 in FIG. 1 as data files or program files.

本发明的主要设想是认为数据库202是存储新闻报道用的数据库、专利公报数据库等的多个文献。但应注意本发明的适用范围并不限定由多个文献组成的数据库,对单一文献内的检索也能适用。这时,单个文献的内容(例如)是以文本文件的形式按可检索方式存储的。另外,对每个文献附加唯一的文献编号。文献编号最好是从1开始的升序编号,但对专利公报数据库也可使用申请号或公布号等唯一的文献编号。为了识别每个文献,也可不用顺序编号,而使用″ABC″、″&XYZ″等符号。但是,为表示这种识别符号通常需要的字节数比数字多,所以实际上用顺序编号识别文献的方式是最理想的。The main idea of the present invention is to consider that the database 202 stores a plurality of documents such as a news report database, a patent publication database, and the like. However, it should be noted that the scope of application of the present invention is not limited to databases composed of multiple documents, and it is also applicable to searches within a single document. At this time, the content of a single document is stored in a retrievable manner, for example, in the form of a text file. In addition, a unique document number is attached to each document. The document number is preferably an ascending number starting from 1, but unique document numbers such as application numbers or publication numbers can also be used for patent publication databases. In order to identify each document, symbols such as "ABC" and "&XYZ" may be used instead of sequential numbers. However, since more bytes than numbers are usually required to represent such identifiers, the method of identifying documents with sequential numbers is actually the most ideal.

由于对于数据库202所存储的新闻报道或专利公报这样庞大的信息内容进行直接检索需要很长的处理时间,所以通常是预先针对数据库202所存储的新闻报道内容,利用索引生成·更新模块206,编成索引文件204。在后面叙述的本发明的实施例中,按这种方式编成的索引文件204是由字形文件和位置信息文件两个文件组成。在字形文件中存储着字形·分隔符和与其对应的文献编号以及文献内的位置编号在位置信息文件中所处的位置。在位置信息文件中存储着文献编号和文献内位置编号。Since it takes a long processing time to directly retrieve such huge information content as news reports or patent publications stored in the database 202, usually the index generation/update module 206 is used to compile the content of the news reports stored in the database 202 in advance. into an index file 204. In the embodiment of the present invention described later, the index file 204 compiled in this way is composed of two files: a font file and a position information file. In the font file, the font/delimiter, the document number corresponding thereto, and the position of the position number in the document in the position information file are stored. The document number and the position number in the document are stored in the position information file.

数据库202也可以将每个文献作为单独的文件管理,或者,可将整个文献顺序排列成连续的单一文件,简言之,实质上是对每个文献附加一个唯一的编号,并可按该唯一的编号访问每个文献的内容。在前者的情况下,数据库202针对与每个文献的唯一编号和存储文献的实际文件名对应设置的数据表进行管理,而在后者的情况下,数据库202是针对与每个文献的唯一编号和单一数据库文件中的位偏移量及文献大小对应设置的数据表进行管理。检索工具208是以来自检索字符输入模块210的检索字符串作为输入来检索索引文件204,并具有将含有输入检索字符串文献的文献编号(可为多个数字)和该输入检索字符串在文献中的位置(也可为多个数字)返回的功能。检索字符输入模块210最好由多窗口环境的对话框构成,并具有用键盘106将所要求的应检索字符输入到该输入框的形式。The database 202 can also manage each document as a separate file, or can arrange the entire document into a continuous single file. In short, it is essentially attaching a unique number to each document, and it can number to access the contents of each document. In the case of the former, the database 202 manages a data table corresponding to the unique number of each document and the actual file name of the stored document, while in the case of the latter, the database 202 manages the data table corresponding to the unique number of each document. Manage the data table set corresponding to the bit offset and document size in a single database file. The retrieval tool 208 uses the retrieval string from the retrieval character input module 210 as an input to retrieve the index file 204, and has the document number (can be multiple numbers) of the document containing the input retrieval string and the input retrieval string in the document The position (can also be multiple numbers) in the returned function. The search character input module 210 is preferably composed of a dialog box in a multi-window environment, and has a form in which a desired character to be searched is input into the input box using the keyboard 106 .

另外,按照本发明的特征,检索字符输入模块210可以按0~1的数值(也可按百分数的0~100的数字)输入模糊检索的相似度。为此,检索字符输入模块210显示具有在0~1之间指示任意位置的指针的滑动块或卷滚条。该滑动块的指针(例如)可指示系统设定值1,或可用鼠标器112操作,拖拉移动指针指示其他值。In addition, according to the feature of the present invention, the search character input module 210 can input the similarity of fuzzy search by the value of 0-1 (or the number of 0-100 as a percentage). For this, the search character input module 210 displays a slider or a scroll bar having a pointer indicating an arbitrary position between 0˜1. The pointer of the slider can, for example, indicate the system setting value 1, or the mouse 112 can be used to drag and move the pointer to indicate other values.

结果显示模块212根据来自检索工具208的检索结果的文献编号和检索字符在该文献中出现位置的值访问数据库202,并在适当的单独检索结果显示窗口显示与该文献中该位置对应的行。当检索结果在该窗口的一个屏面中容纳不下时,将显示卷滚条,用户可移动卷滚条依次注意查看检索结果。C.索引文件的结构和编制方法The result display module 212 accesses the database 202 according to the value of the document number of the search result from the search tool 208 and the position where the search character appears in the document, and displays the row corresponding to the position in the document in an appropriate separate search result display window. When the search results cannot be accommodated in one screen of the window, a scroll bar will be displayed, and the user can move the scroll bar to view the search results one by one. C. Structure and preparation method of index file

在本发明中,将所有的连续N个字符及其在文献内位置以及文献内分割信息编成文献,并附加索引形成文件。在文件中,文献内的典型分割信息计有“。”、“、”等文章所用的分隔符和“第1章”、“摘要”等广义的文献分隔符。C1.字符串的标准化In the present invention, all the consecutive N characters and their positions in the document and the segmentation information in the document are compiled into a document, and an index is added to form a file. In the file, the typical segmentation information in the document includes the separators used in articles such as ".", "," and generalized document separators such as "Chapter 1" and "Abstract". C1. Standardization of strings

为了编制索引文件所必需的最初处理要进行如下所述的字符串标准化处理。这就是说,特别是当准备检索的文献是日语文本文件,而且是以半角和全角方式形成的混合文件。因此,要进行将半角字符换成相应的全角字符的处理。C2.字形信息的取出The initial processing necessary for indexing files involves character string normalization as described below. That is to say, especially when the document to be retrieved is a Japanese text document, and it is a mixed document formed in half-width and full-width. Therefore, a process of replacing half-width characters with corresponding full-width characters is performed. C2. Retrieval of font information

编制索引文件的下一个步骤是对标准化后字符串的全部字符由开头处截取从头开始的连续N个字符(以下称字形),并将其连同文献编号、文献内位置编号一并存入索引文件。但应取N≥1,如为日语,取N=2为适当。The next step of indexing the file is to intercept all the characters of the standardized string from the beginning to the beginning of the N consecutive characters (hereinafter referred to as fonts), and store them in the index file together with the document number and the position number in the document . However, N≥1 should be taken. For Japanese, N=2 is appropriate.

文献内位置编号是针对文献内检索对象的全部字符附加的文献内部唯一顺序号。并将字形开头字符的文献内位置编号作为该字形的文献内位置编号。在文献结束处的后续字符总计不到N个时,要装入X‘00’等规定的填充字符使其总计为N个。The position number in the document is the unique serial number inside the document attached to all the characters of the search object in the document. And the position number in the document of the beginning character of the glyph is used as the position number in the document of the glyph. When the number of subsequent characters at the end of the document is less than N in total, prescribed padding characters such as X'00' should be loaded to make the total number N.

另外,在本实施例中,将每个单独的文献按照对检索有意义的划分方法分割成块,并将分割信息存储在索引文件内。分割信息的存储按照与前述的字形相同的形式进行。即不采用从标准化的字符串截取字形的方式存储,改用将特别规定的分隔符连同文献的文献编号和块的边界字符的文献内位置信息一并存储。In addition, in this embodiment, each individual document is divided into blocks according to a meaningful division method for retrieval, and the division information is stored in the index file. Segmentation information is stored in the same format as the aforementioned font. That is, instead of storing the glyphs from the standardized character string, store the specially specified separator together with the document number of the document and the position information within the document of the block boundary character.

由于分隔符有多种类型,所以可以有多种不同的分割方法。但是,必须规定分隔符不得与按标准化字符串读取的字形重复。在本实施例中,当通过标准化处理是将1字节代码变换成2字节代码,因而将2个字节作为1个字看待时,如该1个字的值在255以下,则不适用常用的字符编码。因此,可将0~255的任意字值单独分配给多种类型的分隔符。Since there are many types of delimiters, there can be many different splitting methods. However, it must be specified that the delimiter must not duplicate glyphs read as normalized strings. In this embodiment, when the 1-byte code is converted into a 2-byte code through the standardization process, and thus 2 bytes are regarded as 1 word, if the value of the 1 word is below 255, then it is not applicable Commonly used character encodings. Therefore, arbitrary word values from 0 to 255 can be individually assigned to various types of delimiters.

将分割信息按照与字形相同的形式存储的优点如下。The advantages of storing segment information in the same form as fonts are as follows.

—索引的生成·更新简单。无须对分割信息另行处理。—Easy creation and update of indexes. No separate processing is required for the split information.

—不会使索引的容量有明显增加。— will not significantly increase the capacity of the index.

例如,与对每个文献内位置编号附加其所属的块编号的形式相比,容量的增加非常小。C3.文献内位置编号的具体例子For example, compared with the method in which the block number to which it belongs is added to each position number in a document, the increase in capacity is very small. C3. Specific examples of position numbers within documents

例如,将在开头含有“本日は晴天なり。ただぃま、マイクのテスト中。”(今天天晴,现正试验话筒。)这样一段文章的文献存储在数据库202(图2)内。如对该段文章的各字符附加文献内位置编号,则如下所示。For example, documents containing such an article at the beginning as "Today is sunny. ただぃま、マイクのテスト." (It is sunny today, and the microphone is being tested.) are stored in the database 202 (FIG. 2). If the position number in the document is added to each character of the paragraph, it will be as follows.

字符的文献内位置编号1 2 3 4 5 6 7 8 9 10111213141516171819202122The position number of the character in the document 1 2 3 4 5 6 7 8 9 10111213141516171819202122

标准化的字符串      本日は晴天なり。ただいま、マイクのテスト中。Standardized string 今日は晚天なり.ただいま、マイクのテスト中.

分隔方式1                         |                         |Separation method 1 | |

分隔方式2                         |         |               |Separation method 2 | | | |

现设该文献的文献编号为1号,并设上述字形的字符数N=2。这样一来,对每个字形(长度为2)附加有关的文献编号和文献内位置编号如下。字形        文献编号      文献内位置编号本日        1                1日は        1                2は晴        1                3晴天        1                4天な        1                5なり        1                6り。        1                7。た        1                8分隔符1     1                8分隔符2     1                8ただ        1                9だぃ        1                10いま        1                11ま、        1                12、マ        1                13分隔符2     1                13マイ        1                14イク        1                15クの        1                16のテ        1                17テス        1                18スト        1                19ト中        1                20中。        1                21。             1                22分隔符1        1                22分隔符2        1                22C4.文献内分割信息的作用Assume that the document number of this document is No. 1, and the number of characters of the above-mentioned glyphs is N=2. In this way, the related document number and position number in the document are attached to each font (length: 2) as follows. Ghostic Literature Number Literature Internal Literature No. 1 1 は 1 2 は は 1 3 sunny day 1 4 days な 1 5 り 1 6 り. 1 7. 1 1 8 separate symbol 1 1 8 separate symbol 2 1 8 ただ 1 9 だ ぃ ぃ ま ま 11 ま, 1 12, マ 1 13 separator 2 1 13 マ イ ク 1 15 ク の 1 16 の テ 1 17テス 1 18スト 1 19ト中 1 20中. 1 21.                                                                                                                                                   

以下说明文献内分割信息(分隔)在检索中的利用价值。·仅以特定块作为对象的检索The utility value of the segmentation information (separation) in the document for retrieval will be described below.・Retrieval targeting only a specific block

例如,文献由所谓的标题·摘要·正文组成时,一般希望仅以标题、摘要等特定部分作为对象进行检索。通过对标题的结束、摘要的结束存储分隔符及其位置信息,则可实现这样的检索。·多个字符串之间有密切关联的文献检索For example, when a document consists of so-called title, abstract, and text, it is generally desirable to search only specific parts such as the title and abstract. Such a search can be realized by storing the delimiter and its position information for the end of the title and the end of the abstract. ·Retrieval of documents closely related between multiple character strings

一般都希望对于意识到在多个字符串之间从文理上有密切关联的进行检索。In general, it is desirable to search for those who are aware of the close relationship between texts and texts among a plurality of character strings.

例如,可以想像,字符串之间的关系仅在同一文献内相比,可能就不如在同一段落内的关系密切,而在同一个句子内的关系则会更为密切,通过对段落和句子的结束处添加分隔符并储存及其位置信息就可对存在同一存储块中的文献进行检索,这样就能够进行意识到关系密切的检索。C5.索引文件的结构For example, it is conceivable that the relationship between character strings may not be as close as that in the same paragraph if they are compared only in the same document, but the relationship in the same sentence will be closer. By adding a delimiter at the end and storing its location information, documents stored in the same memory block can be searched, enabling search that is aware of the close relationship. C5. Structure of the index file

字形、分隔符及其文献编号以及文献内位置编号必须按在检索时能够高效率取出的形式存储。为此,在本实施例中,索引文件由字形文件(主要存储字形、分隔符的文件)和位置信息文件(主要存储文献编号·文献内位置编号的文件)组成。在字形文件中存储着字形、分隔符以及与其对应的文献编号和文献内位置编号在位置信息文件中所处的位置。在位置信息文件中存储着文献编号、文献内位置编号。Glyphs, delimiters and their document numbers, as well as position numbers within documents must be stored in a form that can be retrieved efficiently during retrieval. For this reason, in this embodiment, the index file is composed of font files (files that mainly store fonts and separators) and location information files (files that mainly store document numbers and position numbers in documents). The glyphs, delimiters and corresponding document numbers and the position of the position numbers in the document are stored in the font file. The document number and the position number in the document are stored in the position information file.

这种字形文件和与其对应的位置信息文件的例子示于图3。An example of such a font file and its corresponding position information file is shown in FIG. 3 .

在图3中,字形文件302的项目是数据库202所有文献中的连续N个字符(这里,N=2)的字形。为了能够进行折半检索,字形文件302的项目最好是以标准化字形的开头字符代码值按升序排序。“分隔符1”、“分隔符2”、“なり”、“は晴”等就是字形文件302的一个个的项目。例如,“分隔符1”为“,”、“、”、“。”等分隔文章、句子用的符号的总括表示,被分配给2字节的值。In FIG. 3 , the items of the font file 302 are the fonts of N consecutive characters (here, N=2) in all the documents in the database 202 . In order to perform a half search, the items of the glyph file 302 are preferably sorted in ascending order by the initial character code values of the standardized glyphs. “Separator 1”, “Separator 2”, “なり”, “は晴” and the like are individual items of the font file 302 . For example, "separator 1" is a general representation of symbols for separating sentences and sentences such as ",", ",", ".", and is assigned a 2-byte value.

图3的位置信息文件304存储与字形文件302的每个项目相对应的、至少是一个文献编号和与该每个文献编号有关的、至少是一个文献位置编号。The position information file 304 of FIG. 3 stores at least one document number corresponding to each item of the font file 302 and at least one document position number related to each document number.

为了使字形文件302的项目与位置信息文件304的项目彼此对应,图中虽未列出,但在字形文件302的每个项目中应具有与之相对应的、在位置信息文件304中的项目信息、从位置信息文件304的最前头开始的位移量和相应的位置信息文件304项目大小的信息。即在图3中,例如,字形文件302根据与“分隔符2”有关并存储在其中的位移量信息,从位置信息文件304的开头进行查找,并从查到的位置读取仅在项目大小的信息中指定的字节数。从而,可以一并读取与“分隔符2”有关的文献编号1中的8、13、22…这样的文献内位置编号值以及与文献编号2有关的文献内位置编号值…(如果有的话)、及与文献编号n有关的文献内位置编号值。In order to make the item of the font file 302 and the item of the position information file 304 correspond to each other, although it is not listed in the figure, each item of the font file 302 should have a corresponding item in the position information file 304 information, the displacement amount from the head of the position information file 304, and the corresponding position information file 304 item size information. That is, in FIG. 3, for example, the font file 302 searches from the beginning of the position information file 304 according to the displacement information related to the "delimiter 2" and stored therein, and reads only the item size from the found position. The number of bytes specified in the message for . Thus, the position number values in the document such as 8, 13, 22... in the document number 1 related to "delimiter 2" and the position number values in the document related to the document number 2... (if any words), and the position number value in the document related to the document number n.

与文献编号i有关的文献内位置编号值一般是以,例如,(文献编号i:4字节)(文献内位置编号数目k:4字节)(第1个文献内位置编号:4字节)…(第k个文献内位置编号:4字节)的形式存储的。在这个例子中,作为存储文献内位置编号的字段,虽然存储的是文献的绝对位置,要采用4个字节,但实际上是从前一个文献内位置编号的位移量起算存储,所以可节省1~3个字节。C6.索引文件的编制处理The value of the position number in the document related to the document number i is generally, for example, (document number i: 4 bytes) (the number of position numbers in the document k: 4 bytes) (the first position number in the document: 4 bytes )...(position number in the kth document: 4 bytes) is stored. In this example, as the field for storing the position number in the document, although the absolute position of the document is stored, 4 bytes are used, but it is actually stored from the displacement of the position number in the previous document, so it can save 1 ~3 bytes. C6. Preparation and processing of index files

下面参照图4说明索引文件的编制处理。这种处理是在最初建立数据库202时,或者向数据库202追加或从数据库202中删除时,利用图2的索引生成·更新模块206进行的处理。Next, the creation process of the index file will be described with reference to FIG. 4 . Such processing is performed by the index creation/update module 206 in FIG. 2 when the database 202 is initially created, or when it is added to or deleted from the database 202 .

在图4中,首先在步骤402,进行确保存储区的处理。这种处理,例如,可通过调用操作系统的功能在RAM104上获得规定大小的工作区。In FIG. 4, first, in step 402, processing for securing a storage area is performed. In this processing, for example, a work area of a predetermined size can be obtained on the RAM 104 by calling a function of the operating system.

在步骤404,从数据库202将一个文献读入在上述步骤402获得的适当存储区。At step 404, a document is read from database 202 into the appropriate storage area obtained at step 402 above.

在步骤406,对在步骤404读入的文献进行标准化处理。In step 406, normalize the documents read in in step 404.

在步骤408,通过扫描标准化后的文献,作成字形·分隔符,并将字形·分隔符连同该文献的文献编号及字形·分隔符的文献内位置编号存储在步骤402获得的存储区内。In step 408, the font-separator is created by scanning the standardized document, and the font-separator is stored in the storage area obtained in step 402 together with the document number of the document and the position number of the font-separator in the document.

在步骤408的处理中,随着将字形、文献编号及文献内位置编号存储在步骤402预先获得的存储区内,该获得的存储区的空闲区域可能还没有存满。因此,在步骤410,对所获得的存储区是否存满进行检查处理,如果已存满,则在步骤412,根据,例如,字形及分隔符的编码值、文献编号、文献内位置编号将存储区所存储的字形、分隔符和文献的文献编号以及字形、分隔符的文献内位置信息进行分类,并将其作为中间文件写入磁盘108(图1),因此,写成中间文件的数据所占的存储区在以下的处理中还可开放使用。而后面的处理则进入步骤414。In the process of step 408, as the font, document number and position number in the document are stored in the pre-obtained storage area in step 402, the free area of the obtained storage area may not be full yet. Therefore, in step 410, check whether the obtained storage area is full, if it is full, then in step 412, according to, for example, the encoding value of font and separator, document number, position number in the document will store The document numbers of the fonts stored in the district, delimiters and documents and the position information in the documents of fonts and delimiters are classified, and it is written into the disk 108 (Fig. 1) as an intermediate file. The storage area of is also open for use in the following processing. And the subsequent processing enters step 414 .

如在步骤410判断出存储区尚有余裕,则处理直接进入步骤414。If it is determined in step 410 that there is still room in the storage area, then the process directly proceeds to step 414 .

在步骤414,判断在数据库202内是否还留有在步骤404还未读完的文献。如果是,则处理返回步骤404。In step 414, it is judged whether there are still documents that have not been read in step 404 in the database 202. If so, processing returns to step 404 .

在步骤414,如判断出数据库202的全部文献已读入处理完毕,则仍根据字形、分隔符的编码值、文献编号、文献内位置编号将留在步骤402获得的存储区内尚未写出的字形、分隔符和文献的文献编号以及字形·分隔符的文献内位置信息进行分类,并将其作成中间文件,写入磁盘108(图1)。In step 414, if it is judged that all the documents of the database 202 have been read in and processed, then still according to the code value of the font, the separator, the document number, the position number in the document will stay in the storage area obtained in step 402 and have not written yet The document numbers of the fonts, delimiters, and documents, and the position information within the document of the fonts and delimiters are classified into intermediate files and written to the disk 108 (FIG. 1).

通过在步骤412和步骤416的中间文件写入处理,在磁盘108内存有多个中间文件,由于该各中间文件已预先作过分类,所以在步骤418中利用众所周知的合并分类技术进行处理,并从上述多个中间文件编制成图3所示的字形文件302和位置信息文件304,并将其存储在磁盘108内。在原来的多个中间文件中,字形有可能重复出现几次,因此,将重复的同一字形的项目合并为一个,并对与其有关的文献编号和文献内位置编号进行相关的处理。D.使用索引文件的检索处理Through the intermediate file writing process in step 412 and step 416, there are a plurality of intermediate files in the disk 108. Since these intermediate files have been classified in advance, they are processed in step 418 using well-known merge and classification techniques, and A font file 302 and a position information file 304 shown in FIG. In multiple original intermediate files, fonts may appear repeatedly several times. Therefore, the repeated items of the same font are merged into one, and related processing is performed on the document number and the position number in the document. D. Retrieval processing using index files

以下参照图5的流程图,说明使用上述编成的索引文件进行检索处理的例子。首先,在步骤502,显示,例如,具有输入框的对话框,对用户提示要进行输入处理,将检索字符串输入到该输入框中。Hereinafter, an example of search processing using the index file created above will be described with reference to the flowchart of FIG. 5 . First, in step 502, for example, a dialog box with an input box is displayed, prompting the user to perform input processing, and a search character string is input into the input box.

用户向输入框输入检索字符串,敲OK按钮,于是按需要进行检索字符串的标准化处理,然后从该检索字符串的开头数N个字符的字形使用上述索引文件进行检索处理。这里所说的N个字符的字形的长度是与上述索引文件的字符串字形长度N相同,因此,可以将取检索字符串的部分字符串的N个字符的字形作为关键字,对上述索引文件进行高速折半检索。日语文献的适当N值的一个例子是取N=2。The user enters the search character string into the input box and presses the OK button, then standardizes the search character string as required, and then uses the above-mentioned index file to perform search processing on the glyphs of N characters from the beginning of the search character string. The length of the font of N characters mentioned here is the same as the character string font length N of the above-mentioned index file, therefore, the font of N characters of the partial character string of the retrieval character string can be used as a key to the above-mentioned index file Perform high-speed half search. An example of a suitable value of N for Japanese documents is to take N=2.

在步骤506,如经判断发现找不到检索字符串开头的N个字符的字形,则在步骤508中在信息框中适当地显示出找不到检索字符串的信息,处理结束。In step 506, if it is judged that the font of N characters at the beginning of the search character string cannot be found, then in step 508, the message that the search character string cannot be found is appropriately displayed in the message box, and the process ends.

在步骤506,如经判断已经找到检索字符串开头的N个字符的字形,则由于从索引文件返回一个以上的文献编号和该文献编号的至少一个文献内位置编号,所以为了进行后面的处理要在步骤510中将该信息先存入主存储器或磁盘上的规定缓冲区内。In step 506, if it is judged that the font of N characters at the beginning of the search character string has been found, then more than one document number and at least one position number in the document of the document number are returned from the index file, so in order to carry out subsequent processing In step 510, the information is first stored in the main memory or in a specified buffer on the disk.

在步骤512,判断检索字符串是否已按N个字符的字形的部分字符串全部检索完毕,如果是,则处理进入步骤520。如果不是,则在步骤514,根据检索字符串的下一个N个字符的字形使用上述索引文件进行检索处理。检索字符串的长度一般不限于N的倍数,因此,当检索一个个的N个字符的字形的处理一直进行到靠近检索字符串末端的部分字符串时,索引文件关键字的字符串有时会比N个字符的字形短。遇到在这种情况,可取检索字符串最后的N个字符中的部分字符串。这一来,结果就会与在这之前所取的N个字符有所重复。当检索字符串不满N个字符时,折半检索有多个候选结果,其后的处理就是通过顺序检索找出多个候选结果。In step 512, it is judged whether the search character string has been retrieved according to the partial character string of N characters font, if yes, then the process proceeds to step 520. If not, then in step 514, use the above-mentioned index file to perform retrieval processing according to the glyphs of the next N characters of the retrieval character string. The length of the search character string is generally not limited to a multiple of N. Therefore, when the processing of retrieving glyphs of N characters one by one is carried out to a part of the character string near the end of the search character string, the character string of the index file keyword is sometimes longer than Glyphs of N characters are short. In this case, it is advisable to retrieve part of the last N characters of the string. In this way, the result will be repeated with the N characters taken before. When the search string is less than N characters, there are multiple candidate results for the half search, and the subsequent processing is to find multiple candidate results through sequential search.

在步骤516中,与步骤506相同,是判断是否在索引文件中找到与检索字符串的N个字符的字形。但是,步骤516与步骤506有本质上的差异,在步骤516,找到和未找到的意思是指:所找的是具有与检索字符串开头前N个字符相关的、在某个文献编号中的某个文献内位置编号只加N的文献内位置编号中的字形。In step 516, the same as step 506, it is judged whether the glyphs of N characters of the search character string are found in the index file. However, step 516 is substantially different from step 506. In step 516, the meanings of finding and not finding refer to: what you are looking for is related to the first N characters of the search character string, in a certain document number The position number in a certain document only adds the glyph in the position number in the document of N.

在步骤516,如经判断,发现找不到检索字符串的该N个字符的字形,则在步骤508中在信息框中显示找不到检索字符串的信息,于是处理结束。In step 516, if it is judged that the glyphs of the N characters of the search character string cannot be found, then in step 508, the message that the search character string cannot be found is displayed in the message box, and the process ends.

在步骤516,如经判断,发现已找到检索字符串开头N个字符的字形,则只要顺序地循环从索引文件检索结果返回的文献编号及在该文献编号中至少一个文献内位置编号中的一个开头N个字符的字形的信息和同一文献内的位置编号,在步骤518,为进行以后的处理,先将信息存储在主存储器或磁盘上的规定缓冲区内。In step 516, if it is judged that the font of the first N characters of the search character string has been found, then only one of the document number returned from the index file retrieval result and at least one position number in the document in the document number is sequentially circulated. The font information of the first N characters and the position number in the same document, in step 518, for subsequent processing, the information is first stored in the main memory or in the specified buffer on the magnetic disk.

在步骤512,如判断检索字符串已全部检索完毕,则进入步骤520,由存储在缓冲区内的文献编号及文献内位置编号确定存在检索字符串的文献编号及其位置,在步骤522,用该文献编号及文献内位置编号访问数据库202的存储内容,并将存在文献检索字符串的文献中的该行在另一个窗口内恰当显示。In step 512, if it is judged that the search character string has all been retrieved, then enter step 520, and determine the document number and its position where the search character string exists by the document number stored in the buffer and the position number in the document, and in step 522, use The document number and the position number in the document access the stored content of the database 202, and display the line in the document in which the document search character string exists appropriately in another window.

为了检查检索字符串是否出现在文献内的特定的块(例:第3个块),应计算上述检索字符串在文献内出现位置之前出现的上述文献内的分隔位置,借以检查上述检索字符串在上述文献内位于哪一个块(第几个块),也可与指定块的编号进行比较。E.模糊检索处理In order to check whether the search string appears in a specific block (example: the 3rd block) in the document, the separation position in the above document where the above search string occurs before the occurrence position in the document should be calculated to check the above search string Which block (how many blocks) is located in the above-mentioned document can also be compared with the number of the designated block. E. Fuzzy retrieval processing

图5所示使用索引文件进行的处理,可以说进行的是严谨的检索处理,但按照本发明,是包括使用索引文件,能够按照指定的字符串和与字符排列相似的字符串,对数据库每个文件高速执行称之为模糊检索的处理。特别是,在这种方式中,可指定待检索的字符串和检索精度(大于0而小于1),并能确定所含与待检索的字符串的具体″相似度″超过指定检索精度的″相似字符串″的文献及″相似字符串″在文献内的位置。E1.凭人的感觉确定字符串的相似The processing shown in Figure 5 using the index file can be said to be rigorous retrieval processing, but according to the present invention, it includes the use of the index file, which can search the database for each high-speed execution of a process called fuzzy retrieval. In particular, in this way, the character string to be retrieved and the retrieval precision (greater than 0 but less than 1) can be specified, and it can be determined that the specific "similarity" contained in the string to be retrieved exceeds the specified retrieval precision. The document of "similar character string" and the position of "similar character string" in the document. E1. Determine the similarity of character strings based on human feeling

凭懂日语的人的感觉,发现字符的排列相似、且含义相近的日语字符串有下列几种情况。(1)片假名的表达方式不同小字体和大字体    “ソフトウエフ”“ソフトウエア”(软件)有无长音“-”     “コンパイラ-”“コンパイラ”(编程器)有无中置圆点“·”“アイビ-エム”“アイ·ビ-·エム”(IBM)其他              “ビルデインゲ”“ビルヂング”(建筑物)(2)在汉字词组与汉字词组之间插入助词等According to the feelings of people who understand Japanese, there are the following situations in which Japanese character strings with similar character arrangement and similar meaning are found. (1) Katakana expressions are different. Small fonts and large fonts. "ソフトウエフ" and "ソフトウエア" (software) have a long sound "-" ""アイビ-エム","アイ·ビ-·エム"(IBM)Others

“在宅起诉”“在宅のまま起诉”"Suing at Home" "Suing at Home のまま"

“政界再编”“政界の再编”(3)汉字词组复合词和缺一部分的词组复合词"Politics Reedited" "Politics の Reedited" (3) Phrase Compound Words with Chinese Characters and Partial Phrase Compound Words

“国立民族博物馆”“国立博物馆”“民族博物馆”(4)因用省略语等而缺少部分字符"National Museum of Ethnology", "National Museum of Ethnology", "Museum of Ethnology" (4) Some characters are missing due to abbreviation, etc.

“ソフトウエア开发”“ソフト开发”(软件开发)(5)外来语错拼"ソフトウエエアdevelopment" "ソフトdevelopment" (software development) (5) misspelling of foreign words

“カリフオルニア”“カリフオリニア”(加利福尼亚)"California" "California" (California)

以上情况的共同的特点是字符大体上连续一致,但有缺少或多出字符。The common feature of the above situations is that the characters are generally consistent, but there are missing or extra characters.

如从哪方面相似的观点出发来研究几个词,与“ソフトメ-カ-”相似的按序排列计有“ソフトのメ-カ-”、“ソフト开发メ-カ-”、“ソフトの开发メ-カ-”,而如果与“政治资金规正法案”相比,感觉到与其相似的按序排列则有“政治资金规正法”、“政治资金规正”、“政治资金”。If we study a few words from the similar point of view, the similar order to "ソフトメ-カ-" includes "ソフトのメ-カ-", "ソフトの发展メ-カ-", "ソフトの攻略メ-カ-”, and if compared with the “Political Funds Regulation Act”, there are “Political Funds Regulation Act”, “Political Funds Regulation”, and “Political Funds” in order of similarity.

另外,字符虽可以说一致,但要说“ソフトクリ-ム制造机械の制造を主业务とする机械メ-カ-”(以奶品制造机械为主业务的机械制造厂)与“ソフトメ-カ-”是相似的字符串,那就是不合逻辑的感觉了。In addition, although the characters can be said to be the same, but to say "ソフトクリ-ム Manufacturing Machinery の Manufacturing を Main Business とする机场メ-カ-" (a machinery manufacturing factory with dairy manufacturing machinery as its main business) and "ソフトメ-カ-" is a similar string, that is the illogical feeling.

人是否会感到字符串相似的感觉可归纳为,(A)连续一致的字符越多越感到相似,(B)中间夹杂的不一致字符越多越感到不相似,(C)中间夹杂的不一致字符过多就感觉不到是一个字符串。Whether people feel the string similarity can be summarized as follows: (A) the more consistent characters, the more similar they feel; (B) the more inconsistent characters in the middle, the more dissimilar they feel; (C) the more inconsistent characters in the middle It doesn't feel like a string at most.

这时,必须考虑输入字符串在文献中的接近位置重复出现的具体情况。如举个例子,就是输入字符串为“理学部长に就任”,而文献中为“理学部部长に就任”。重复出现的“部”这个字符中的一个虽是多余的字符,但与“理学部の长に就任”的这个无关系的字终“の”相比,就应当认为前者是接近的一致字符的想法是妥当的。E2.索引文件的结构和一致度At this time, it is necessary to consider the specific situation that the input string repeats at close positions in the document. For example, the input character string is "Minister of Scienceにtake office", but in the document it is "Minister of Scienceにtake office". Although one of the repeated characters of "部" is a superfluous character, compared with the irrelevant character ending "の" of "论学部の长に职交", it should be considered that the former is a close consistent character. is appropriate. E2. Structure and consistency of index files

图3所示的索引文件结构,是在N个字符的字形上附加文献编号及文献内位置编号的索引,检索处理是以一个字形为最小单位进行的检索处理,检出该文献编号及文献内位置编号。但在检索不满N个字符的字符串时,必须将从待检索的字符串从头开始,以全部字形的字符为最小单位,进行检索处理,其个数有时是相当多的。与输入的字符串中的字符在N个以上时的检索次数顶多是以输入字符串中的字符个数为最小单位的检索相比,可以说不满N个字符的输入字符串的检索负荷更大。The index file structure shown in Figure 3 is an index in which a document number and a position number in a document are added to a glyph of N characters. position number. However, when retrieving a character string that is less than N characters, it is necessary to start from the beginning of the character string to be retrieved, and use the characters of all glyphs as the minimum unit to perform retrieval processing, and its number is quite large sometimes. Compared with the retrieval times when the input string contains more than N characters, at most the number of characters in the input string is the minimum unit of retrieval, it can be said that the retrieval load of the input string that is less than N characters is greater. big.

因此,应舍去不满N个字符的部分的一致,而根据N个字符连续一致的部分来确定相似字符串,可以认为这是确为保持高速性的这种说法是适当的。E3.相似字符串和相似度的确定规则Therefore, it is appropriate to say that the matching of parts less than N characters should be discarded, and similar character strings should be determined based on parts of N characters that are consecutive and matching. E3. Determination rules for similar character strings and similarity

从与输入字符串有M个字符以上连续一致的字符串中,收集与输入字符串有相同顺序关系相互间比较接近的作为相似字符串,根据一致的字符数、不一致的字符数来计算相似度就是规则的概要。From the strings that have more than M consecutive characters with the input string, collect the similar strings that have the same sequence relationship with the input string and are relatively close to each other as similar strings, and calculate the similarity based on the number of consistent characters and the number of inconsistent characters That's a summary of the rules.

首先,定义在说明中使用的术语。First, define the terms used in the description.

一致字符串:consistent string:

待检索的字符串与文献原文有M个字符以上连续一致的部分。从相同的字符开始选定其中最大的长度。The character string to be retrieved has more than M consecutive characters consistent with the original text of the document. Select the largest length among them starting from the same character.

(例)待检索的字符串:政治资金规正法案(Example) String to be searched: Political Fund Regulation Act

文献原文:…资金规正のために法のカで…The original text of the document: ...Fund regulation のために法のカで...

设M=2。因此,“资金规正”为一致字符串。这时,因为要选择最长的,所以“资金”和“资金规”不能称作一致字符串。而“法”因为不满2个字符所以不属于一致字符串。Let M=2. Therefore, "funding regularization" is a consistent string. At this time, "fund" and "fund rule" cannot be called a consistent character string because the longest one should be selected. And "law" does not belong to the consistent character string because it is less than 2 characters.

有效一致字符串:Valid consistent strings:

构成相似字符串的一致字符串。Consistent strings that form similar strings.

最大不一致字符串长度L:Maximum inconsistent string length L:

相似字符串中含有的不一致字符连续达L个字符。L为1以上的常数。Inconsistent characters contained in similar character strings are continuous up to L characters. L is a constant of 1 or more.

以下说明″相似字符串″的选定方法和″相似度″的数值化方法。(1)第1个有效一致字符串的确定The selection method of "similar character string" and the numerical method of "similarity" will be described below. (1) Determination of the first effective consistent string

按照在文献中的顺序取第1个一致字符串作为第1个有效一致字符串。Take the first consistent string as the first valid consistent string according to the order in the literature.

其中,in,

第i个有效一致字符串在文献中的开始位置标记为s(D,i)The starting position of the i-th valid consistent string in the document is marked as s(D, i)

第i个有效一致字符串在文献中的结束位置标记为e(D,i)The end position of the i-th effective consistent string in the document is marked as e(D, i)

第i个有效一致字符串在待检索的字符串中的开始位置标记为s(C,i)The starting position of the i-th valid consistent string in the string to be retrieved is marked as s(C, i)

第i个有效一致字符串在待检索的字符串中的结束位置标记为e(C,i)。(2)下一个有效一致字符串的确定The end position of the i-th effective consistent string in the string to be retrieved is marked as e(C, i). (2) Determination of the next valid consistent string

在确定第i个有效一致字符串时,按如下方法确定第i+1个有效一致字符串。When determining the i-th valid consistent character string, determine the i+1-th valid consistent character string as follows.

当最初的一致字符串满足以下a)、b)两个条件时,取作第i+1个有效一致字符串。a)e(D,i)+1<=s(D,i+1)<=e(D,i)+L+1When the initial consistent character string satisfies the following two conditions a) and b), it is taken as the i+1th effective consistent character string. a) e(D, i)+1<=s(D, i+1)<=e(D, i)+L+1

上式是指:当第i个有效一致字符串与第i+1个有效一致字符串之间夹入的多余字符允许在L个字符以下The above formula means: when the redundant characters inserted between the i-th effective consistent string and the i+1 effective consistent string are allowed to be less than L characters

(参照后文的例3)b)s(C,i+1)>e(C,i)-(M-1)(Refer to Example 3 below) b) s(C, i+1)>e(C, i)-(M-1)

在未选定符合条件的有效一致字符串之前反复进行。(3)″相似字符串″及其″相似程度″(相似度)的确定Iterates until no valid consistent string matching the criteria is selected. (3) Determination of "similar character strings" and their "similarity" (similarity)

如未选出上述的有效一致字符串,则以第1个有效一致字符串的起始字符到最后的有效一致字符串的最后字符作为″相似字符串″,并按下式计算″相似度″。相似度=If the above-mentioned valid and consistent character string is not selected, then use the initial character of the first valid consistent character string to the last character of the last valid consistent character string as "similar character string", and calculate "similarity" according to the formula . Similarity =

(待检索字符串中属于有效一致字符串的字符数(The number of characters in the string to be retrieved that are valid consistent strings

/待检索的字符串的字符数,/ The number of characters of the string to be retrieved,

″相似字符串″中属于有效一致字符串的字符数The number of characters in "similar strings" that are valid consistent strings

/″相似字符串″的字符数)最小值E4.″相似字符串″中属于有效一致字符串字符数的计算方法/The number of characters of "similar character string") the minimum value E4. The calculation method of the number of characters belonging to an effective consistent character string in "similar character string"

当有2个字符与待检索字符串中的相应字符相同时,第1个字符按1计算,第2个字符按0.5计算。其他场合一个字符按1计算。(参照后文的例4)E5.″相似字符串″的确定顺序When there are 2 characters identical to the corresponding characters in the string to be retrieved, the first character is counted as 1, and the second character is counted as 0.5. In other cases, a character is counted as 1. (Refer to example 4 below) E5. Determination sequence of "similar character string"

第1个″相似字符串″从文献的开头开始进行比较确定。在确定第i个″相似字符串″时的过程中,从第i个″相似字符串″的开头字符向后找,在构成第i个″相似字符串″的有效一致字符串中找出不属于有效一致字符串的字符,然后开始进行比较,从而找出第i+1个相似字符串。The first "similar character string" is compared and determined from the beginning of the document. In the process of determining the i-th "similar character string", look backward from the beginning character of the i-th "similar character string", and find out different characters in the effective consistent character strings that constitute the i-th "similar character string". Characters that belong to a valid consistent string, and then start to compare, so as to find the i+1th similar string.

通过对常数L、M进行适当赋值,根据字符的排列是否相似,可以计算出与人的一般判断相当一致的″相似度″。By properly assigning constants L and M, and according to whether the arrangement of characters is similar, the "similarity" that is quite consistent with the general judgment of people can be calculated.

另外,当″相似度″为最高值1时,表示字符串完全一致,当字符串如完全一致,则″相似度″必为1。E6.模糊检索的流程图In addition, when the "similarity" is the highest value 1, it means that the character strings are completely consistent, and if the character strings are completely consistent, then the "similarity" must be 1. E6. Flow chart of fuzzy retrieval

以上的处理如用流程图表示,则如图6所示。在图6中,首先在步骤602,提示输入检索字符串。在步骤604中,提示输入0~1的相似度。在步骤602和步骤604中的字符串和数值的输入操作,通常是在对话框上使用输入框和卷滚条进行。If the above processing is represented by a flowchart, it is as shown in FIG. 6 . In FIG. 6, first at step 602, a prompt is prompted to input a search character string. In step 604, a similarity of 0-1 is prompted to be input. The input operation of character strings and values in step 602 and step 604 is usually performed on a dialog box using an input box and a scroll bar.

在步骤606,将有效一致字符串的编号i设置为1,在步骤608,进行有效一致字符串的检索。此时,假定满足有效一致字符串长度设定在M以上的条件,则在图4的处理中,按照M个字符的字形编制索引文件是有利的。之所以如此,是因为如果预先准备这样的索引文件,那末就可通过对索引文件的折半检索,高速执行任意的M字形的检索。接着,再利用索引文件从M个字符的字形的开始位置错开1个字符,在索引文件中进行M个字符的字形检索,如果该结果检出的文献编号与前一次M个字符的字形检索相同,而且,文献内位置编号又是顺序的,即可得到M+1长度的有效一致字符串。采用上述的方法,如果文献编号与前一次M个字符的字形检索相同,而且文献内位置编号又是顺序的,则每当满足一次上述的条件,有效一致字符串的长度就加1。但是,如使用索引文件进行的M个字符的字形检索什麽也没找到、或返回的文献编号不一致、或者文献内位置编号不是顺序的,那就到了有效一致字符串的结束位置。In step 606, the number i of the valid consistent character string is set to 1, and in step 608, a search for the valid consistent character string is performed. At this time, assuming that the condition that the effective consistent character string length is set to be greater than M is met, then in the processing of FIG. 4 , it is advantageous to compile the index file according to the glyphs of M characters. The reason for this is that if such an index file is prepared in advance, an arbitrary M-shaped search can be performed at high speed by half-searching the index file. Then, use the index file to stagger 1 character from the start position of the font of M characters, and search for the font of M characters in the index file. , moreover, the position numbers in the document are sequential, and an effective consistent character string of length M+1 can be obtained. Using the above method, if the document number is the same as the font search of the previous M characters, and the position numbers in the document are sequential, then every time the above condition is met, the length of the valid consistent character string will be increased by 1. However, if nothing is found in the font search of M characters using the index file, or the document numbers returned are inconsistent, or the position numbers in the documents are not sequential, then the end position of the valid consistent character string has been reached.

随着情况的不同,有时侯会完全找不到有效一致字符串,在这种情况下,根据步骤610的判断,进入步骤626,显示没有找到,并结束处理。Depending on the situation, sometimes the valid consistent character string cannot be found at all. In this case, according to the judgment of step 610, enter step 626, display that it is not found, and end the process.

在步骤610,如判断找到了有效一致字符串,则进入步骤612进行处理;在文献中,从s(D,i)到e(D,i);在检索字符串中,从s(C,i)到e(C,i),都作出有效字符串的标记。In step 610, if it is judged that an effective consistent character string is found, then enter step 612 for processing; in the document, from s(D, i) to e(D, i); in the retrieval character string, from s(C, i) to e(C, i), all mark valid character strings.

在步骤614,如发现满足下列条件:a)e(D,i)+1<=s(D,i+1)<=e(D,i)+L+1In step 614, if the following conditions are found: a) e(D, i)+1<=s(D, i+1)<=e(D, i)+L+1

且,b)s(C,i+1)>e(C,i)-(M-1)And, b) s(C, i+1)>e(C, i)-(M-1)

则继续利用索引文件检索第i+1个有效一致字符串,如已找到则,返回步骤612,对于该第i+1个有效一致字符串,在文献中从s(D,i+1)到e(D,i+1);在检索字符串中从s(C,i+1)到e(C,i+1),都作出有效字符串的标记。(在步骤618的加i,表示针对下一个有效一致字符串)Then continue to use the index file to retrieve the i+1th effective consistent character string, if found, return to step 612, for the i+1th effective consistent character string, in the document from s(D, i+1) to e(D, i+1); In the retrieval string, from s(C, i+1) to e(C, i+1), mark valid character strings. (adding i in step 618 represents the next effective consistent character string)

另一方面,在步骤616,如未找到早先找到的有效一致字符串,则在步骤620进行相似度的计算。其方法如上所述,例如,用下式计算,On the other hand, in step 616, if no valid consistent character string found earlier is found, then in step 620, the calculation of the similarity is performed. The method is as described above, for example, with the following formula,

相似度=Similarity =

(待检索的字符串中属于有效一致字符串的字符数(The number of characters in the string to be retrieved that are valid consistent strings

/待检索的字符串的字符数,/ The number of characters of the string to be retrieved,

″相似字符串″中属于有效一致字符串的字符数The number of characters in "similar strings" that are valid consistent strings

/″相似字符串″的字符数)最小值这时,″相似字符串″为从最初有效一致字符串的开始位置到最后有效一致字符串的最后位置之间的字符串。/Number of characters of "similar character strings") Minimum value At this time, "similar character strings" are character strings from the start position of the first valid consistent character string to the last position of the last valid consistent character string.

在步骤622中,根据在步骤620计算的相似度和在步骤604输入的相似度,进行结果的选定,仅当结果是大于步骤604输入的相似度时,方才在步骤624进行结果显示。In step 622, the result is selected according to the similarity calculated in step 620 and the similarity input in step 604, and only when the result is greater than the similarity input in step 604, the result is displayed in step 624.

在步骤624进行的处理操作是根据在步骤608、步骤614的索引文件检索结果返回的文献编号和文献内位置编号,访问存储在数据库内的文献内容,并显示该位置所在行。The processing operation in step 624 is to access the document content stored in the database according to the document number and position number in the document returned by the index file retrieval results in steps 608 and 614, and display the row where the position is located.

另外,一个检索字符串的″相似字符串″,有可能在多个文献中同时找到,在一个文献中也可在多处找到。因此,必须注意,步骤606~622适用于这样的多个″相似字符串″,而在步骤624中却是仅在多个″相似字符串″之中选定满足相似度条件的进行显示。E7.确定″相似字符串″和相似度的例子In addition, a "similar character string" of a search character string may be found in multiple documents at the same time, and may also be found in multiple places in one document. Therefore, it must be noted that steps 606-622 are applicable to such a plurality of "similar character strings", but in step 624, only those satisfying the similarity condition are selected for display among the plurality of "similar character strings". E7. Example of Determining "Similar Strings" and Similarity

所示的例子设M=2,L=3。(例1)The example shown assumes M=2, L=3. (example 1)

               1 2 3 4 5 6待检索字符串C:    アイビ-エム#(アイビ-エム#为IBM公司商标)        1 2 3 4 5 6 String to be retrieved C: アイビ-エム# (アイビ-エム# is a trademark of IBM Corporation)

               1 2 3 4 5 6 7 8…文献          D:  アイ·ビ-·エム…                                                                                   

开头最长一致字符串为″アイ″,因此The longest consistent string at the beginning is "アイ", so

第1个有效一致字符串为″アイ″s(C,1)=1e(C,1)=2The first valid consistent character string is "アイ" s(C, 1)=1e(C, 1)=2

                             s(D,1)=1e(D,1)=2s(D,1)=1e(D,1)=2

因e(C,1)-(M-1)=1,所以将从待检索的字符串的第2个字符以后开始的字符串与从文献的3、4、5或6开始的字符串进行比较,以检索第2个有效一致字符串(因e(D,1)+1=3、e(D,1)+L+1=6)。Because e(C, 1)-(M-1)=1, the character string starting after the second character of the character string to be retrieved is compared with the character string starting from 3, 4, 5 or 6 of the document Compare to retrieve the second valid consistent character string (because e(D, 1)+1=3, e(D, 1)+L+1=6).

第2个有效一致字符串为″ビ-″s(C,2)=3、e(C,2)=4s(D,2)=4e(D,2)=5The second valid consistent character string is "ビ-" s(C, 2)=3, e(C, 2)=4s(D, 2)=4e(D, 2)=5

因e(C,2)-(M-1)=3,所以将从待检索的字符串的第4个字符以后开始的字符串与从文献的5、6、7或8开始的字符串进行比较,以检索第3个有效一致字符串(因e(D,2)+1=6、e(D,2)+L+1=9)。Because e(C, 2)-(M-1)=3, the character string starting after the 4th character of the character string to be retrieved is compared with the character string starting from 5, 6, 7 or 8 of the document Compare to retrieve the third valid consistent character string (because e(D, 2)+1=6, e(D, 2)+L+1=9).

第3个有效一致字符串为″エム″s(C,3)=5e(C,3)=6The third valid consistent character string is "エム" s(C, 3)=5e(C, 3)=6

                             s(D,3)=7e(D,3)=8s(D, 3)=7e(D, 3)=8

因为已经到了待检索的字符串的尾端,所以第3个有效一致字符串是最后的一个。Since the end of the string to be retrieved has been reached, the third valid consistent string is the last one.

            アイビ-エム      アイビ-エム

            1   2   31 2 3

            アイ·ビ-·エム…      アイ·ビ-·エム…

            1     2    31 2 3

编号为有效一致字符串的编号。因此,″相似字符串″为从s(D,1)到e(D,3)的″アイ·ビ-·エム″。″相似度″=(6/6,6/8)的最小值=6/8=0.75(例2)number is the number of a valid consistent string. Therefore, the "similar character string" is "アイ·ビ-·エム" from s(D, 1) to e(D, 3). The minimum value of "similarity"=(6/6, 6/8)=6/8=0.75 (example 2)

               1 2 3 4 5 6 7 8 9 10待检索字符串C:     ソフトウエアメ-カ-                                                                             

               1 2 3 4 5 6 7 8 9…文献    D:ソフト开发メ-カ-…                                                                                 

  ソフトウエアメ-カ-ソフトウエエア-カ-

  1          21 2

  ソフト开发メ-カ-…Soft developed メ-カ-…

  1          2″相似字符串″=″ソフト开发メ-カ-″相似度=(7/10,7/9)的最小值=0.7(例3)1 2 "similar character string" = "soft development メ-カ-" similarity = (7/10, 7/9) minimum value = 0.7 (Example 3)

               1 2 3 4待检索字符串C:    在宅起诉                                           

               1 2 3 4 5 6 7 8 9…文献        D:  在宅のままで起诉にふみきつた。开头最长一致字符串为″在宅″,因此第1个有效一致字符串为″在宅″s(C,1)=1e(C,1)=2                                                                         The longest consistent string at the beginning is "在屋", so the first valid consistent string is "在屋" s(C, 1)=1e(C, 1)=2

                         s(D,1)=1e(D,1)=2s(D,1)=1e(D,1)=2

将从待检索的字符串的第2个字符以后开始的字符串(因e(C,1)-(M-1)=1)与从文献的3、4、5或6开始的字符串进行比较(因e(D,1)+1=3、e(D,1)+L+1=6),检索第2个有效一致字符串。The character string starting from the second character of the character string to be retrieved (because e(C, 1)-(M-1)=1) is compared with the character string starting from 3, 4, 5 or 6 of the document Compare (because e(D, 1)+1=3, e(D, 1)+L+1=6), and search for the second valid matching character string.

由于找不到第2个有效一致字符串,而且因为已经到了等检索的字符串的尾端,所以只有第1个是有效一致字符串。Since the second valid consistent string cannot be found, and because the end of the string to be retrieved has been reached, only the first one is a valid consistent string.

在宅起诉sue at home

11

在宅のままで起诉にふみきつた。Sue at the house のままでにふみきつた.

                   1因此,第1个″相似字符串″为从s(D,1)到e(D,3)的″在宅″。相似度=(2/4,2/2)的最小值=0.5        1 Therefore, the first "similar character string" is "at home" from s(D, 1) to e(D, 3). Similarity = Minimum of (2/4, 2/2) = 0.5

″在″后面的开头的非有效一致字符为″の″。从″の″后面检索第2个″相似字符串″,则The leading non-valid consistent character after "behind" is "の". Retrieve the second "similar character string" after "の", then

在宅起诉sue at home

11

在宅のままで起诉にふみきつた。Sue at the house のままでにふみきつた.

                   1 1

但是,在文献中,″在宅″和″起诉″相距4个字符,在这个例子中因L=3,所以上述″起诉″不能看作是有效一致字符串。However, in the literature, "at home" and "suit" are separated by 4 characters. In this example, since L=3, the above-mentioned "suit" cannot be regarded as a valid consistent character string.

(例4)(Example 4)

           1 2 3待检索字符串C:银行员    1 2 3 String to be retrieved C: Banker

         1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

文献    D:A银行行员のBさんDocument D: A bank clerk の Bさん

开头的最长一致字符串为″银行″,因此The longest consistent string at the beginning is "bank", so

第1个有效一致字符串为″银行″s(C,1)=1e(C,1)=2The first valid consistent character string is "bank" s(C, 1)=1e(C, 1)=2

                             s(D,1)=2e(D,1)=3s(D,1)=2e(D,1)=3

                                    (式7)(Formula 7)

将从待检索字符串的第2个字符以后开始的字符串(因e(C,1)-(M-1)=1)与从文献的4、5、6或7开始的字符串进行比较(因e(D,1)+1=4、e(D,1)+L+1=7),检索第2个有效一致字符串。Compare the character string starting from the second character of the character string to be retrieved (because e(C, 1)-(M-1)=1) with the character string starting from 4, 5, 6 or 7 of the document (Because e(D, 1)+1=4, e(D, 1)+L+1=7), the second valid matching character string is retrieved.

第2个有效一致字符串为″行员″s(C,2)=2e(C,2)=3The second effective consistent character string is "member" s(C, 2)=2e(C, 2)=3

                             s(D,2)=4e(D,2)=5s(D,2)=4e(D,2)=5

因为已经到了待检索字符串的尾端,所以有效一致字符串有两个。Because the end of the string to be retrieved has been reached, there are two valid consistent strings.

银行员banker

11

22

A银行行员のBさんA bank clerk の Bさん

1     21 2

1.1. 0.5 1→3.5″相似字符串″为从s(D,1)到e(D,2)的”银行行员″。″相似度″=(3/3,3.5/4)的最小值=3.5/4=0.875E8.接近人的感觉的模糊例子ソフトウエアメ-カ-ソフトウエアのメ-カ-        0.9091.1. 0.5 1→3.5 "similar character string" is "bank clerk" from s(D, 1) to e(D, 2). The minimum value of "similarity" = (3/3, 3.5/4) = 3.5/4 = 0.875E8. Fuzzy example close to people's feeling

              ソフトウエア开发メ-カ-      0.833    ソフトウエエアDevelopment メ-カ- 0.833

              ソフトウエアの开发メ-カ-    0.769      ソフトウエアのDevelopment メ-カ- 0.769

这个例子表示随着多余字符的夹入,″相似度″降低。This example shows that "similarity" decreases with the inclusion of redundant characters.

              ニツトウエアメ-カ-    0.800    ニツトウエエアメ-カ- 0.800

              ソフトメ-カ-          0.700        ソフトメ-カ-   0.700

              ソフトウエア          0.600          ソフトウエエア 0.600

这个例子表示随着一致字符的减少,″相似性″降低。理学部长选举    理学部长选举      1.000This example shows that "similarity" decreases as the number of identical characters decreases. Science Minister Election Science Minister Election 1.000

            理学部部长选举    0.929     Election of Minister of Science 0.929

            理学部の长选举    0.857E9.索引的结构和″相似字符串″检索的关系The relationship between the structure of the index and the retrieval of "similar strings"

通过M值的适当设定,可以用本发明的索引结构高速实现搜索″相似字符串″的模糊检索。常数N、M的确定方法N:存储在索引内的字形的字符数M:模糊检索的有效一致字符串的最小长度L:模糊检索中,″相似字符串″中的非有效一致字符串的最大长度。By properly setting the value of M, the index structure of the present invention can be used to realize the fuzzy retrieval of "similar character strings" at high speed. The determination method of constant N, M N: the number of characters of the font stored in the index M: the minimum length of the effective consistent string of fuzzy retrieval L: in the fuzzy retrieval, the maximum length of the non-effective consistent string in "similar character string" length.

如N取得大,则因字形种类数目增加,检到的字形数据量减少,所以检索速度较快,但却会使索引文件的容量增加。在一般的日语文献中,以N=2可以获得充分的检索速度。If N is made large, the number of font types increases and the amount of font data detected decreases, so the retrieval speed is faster, but the capacity of the index file will increase. In general Japanese documents, sufficient retrieval speed can be obtained with N=2.

如果按照M≥N的条件确M,则在模糊检索中可以获得充分的检索速度。如从M越小、模糊检索可以越细致的角度考虑,取M=N被认为是令人满意的。E10.确定相似性的第2实施例If M is determined according to the condition of M≥N, sufficient retrieval speed can be obtained in fuzzy retrieval. Considering that the smaller M is, the finer the fuzzy retrieval can be, it is considered satisfactory to take M=N. E10. Second embodiment of determining similarity

在第2实施例的模糊检索处理中,特别考虑到所说的“中间夹杂的不一致字符越多越感到不相似”、“中间夹杂的不一致字符过多就感觉不到是相同字符串”,兼顾这两个方面。如果遇到文献中输入的字符串是按一致字符串、不一致字符串、一致字符串的顺序排列,而相似字符串又是在后面一个一致字符串之前摘取的,从而就会使相似程度降低,这是不合逻辑的。例如,当输入字符串为“在宅起诉”、而文献1为“在宅のままで起诉”、文献2为“在宅”时,按所说的″文献1“在宅のままで起诉”、文献2“在宅”都有相似字符串,但相似度却是以“在宅”者为高″那样的规则,结果是与人的感觉相反。如判断为“在宅のままで起诉”的相似度比“在宅”高、或在文献1中有“在宅”和“起诉”两个相似字符串,那就合乎逻辑了。In the fuzzy retrieval process of the second embodiment, it is particularly considered that "the more inconsistent characters are mixed in, the more dissimilar they feel" and "the more inconsistent characters are mixed in, the less likely they are the same character string", taking into account both both aspects. If the strings entered in the literature are arranged in the order of consistent strings, inconsistent strings, and consistent strings, and similar strings are extracted before the following consistent strings, the degree of similarity will be reduced. , which is illogical. For example, when the input character string is "Zaizhai sue", and document 1 is "Zaizhai の ままで sue", and document 2 is "Zai Zhai", as said "Document 1 "Zaizhai の ままで sue", Document 2" "Zai Zhai" has similar character strings, but the similarity is based on the rule that "Zai Zhai" is the highest", and the result is contrary to people's feelings. If it is judged that the similarity of "Zai Zhai の ままで sue" is higher than that of "Zai Zhai", or there are two similar strings of "Zai Zhai" and "Sue" in Document 1, it is logical.

以下,说明第2实施例的处理。如参照图6的流程图,在本实施例中,步骤602~612是相同的,而表示第i+1个有效一致字符串检索条件的步骤614则有以下的变动。Next, the processing of the second embodiment will be described. Referring to the flow chart of FIG. 6 , in this embodiment, steps 602 to 612 are the same, but step 614 representing the i+1th effective matching character string retrieval condition has the following changes.

s(C,i+1)>e(C,i)-(M-1)  …(式1)s(C, i+1)>e(C, i)-(M-1) ... (Formula 1)

s(D,i+1)>e(D,i)    …(式2)s(D, i+1) > e(D, i) ... (Formula 2)

而且,and,

s(D,i+1)-e(D,i)-1s(D, i+1)-e(D, i)-1

+max(e(C,i)-s(C,i+1)+1.0)≤L…(式3)+max(e(C, i)-s(C, i+1)+1.0)≤L...(Formula 3)

s(C,i)、e(C,i)、s(D,i)、e(D,i)等的定义仍如前述。The definitions of s(C, i), e(C, i), s(D, i), e(D, i) etc. are still as above.

式1容许象前述的“理学部部长”的“部”那样的重复出现的字符在M-1个以下,此外,这意味着凡是与输入字符串中的字符顺序相同的顺序出现的字符串都是有效的。Formula 1 allows repeated characters like the "Department" of the aforementioned "Department of Science" to be less than M-1. In addition, this means that all character strings that appear in the same order as the characters in the input string are Effective.

式2意味着在文献中有效一致字符串不重复。Equation 2 means that valid consistent strings do not repeat in the literature.

式3意味着夹在中间的不一致字符和象“理学部部长”的“部”那样的重复出现的字符加在一起容许在L个字符以下。Formula 3 means that the inconsistency characters sandwiched in between and the recurring characters like "Department" of "Department of Science" are allowed to be less than L characters.

在本实施例中,象前一实施例一样,计算在文献内的与检索字符串相似的各个字符串中有效一致字符串所占的比例,其中,凡是所占比例小的都不选入相似度,而对相似字符串加分,通过除以满分,(完全一致时的给分)求出比值,进行计算。按以下规则给各字符加分,经过累加,算出相似字符串的分数。因此,在图6的步骤620中进行如下处理。属于第1个有效一致字符串的字符    …1分属于第i个(i>1)有效一致字符串的字符In this embodiment, as in the previous embodiment, the proportion of effective and consistent character strings among the character strings similar to the search character string in the document is calculated, and those with a small proportion are not selected as similar characters. degree, and add points to similar character strings, by dividing by the full score, (the points given when they are completely consistent) to find the ratio and calculate. Add points to each character according to the following rules, and calculate the scores of similar character strings after accumulation. Therefore, the following processing is performed in step 620 of FIG. 6 . Characters belonging to the 1st valid consistent string ... 1 point characters belonging to the i-th (i>1) valid consistent string

在检索字符串中的位置≥e(C,i-1)+1(式4)…1分Position in the search string ≥ e(C, i-1)+1 (Formula 4)...1 point

在检索字符串中的位置≤e(C,i-1)+1(式5)…-1/(2*L)分不属于有效一致字符串的字符    …-1/L分The position in the search string ≤ e(C, i-1)+1 (formula 5)...-1/(2*L) points are not characters in a valid consistent string ...-1/L points

在本实施例中,在确定第i个相似字符串的过程中,从第i个相似字符串开头字符向后找,在构成第i个相似字符串中找出不属于有效一致字符串的字符,然后开始进行比较,从而找出第i+1个相似字符串。In this embodiment, in the process of determining the i-th similar character string, look backward from the beginning character of the i-th similar character string, and find characters that do not belong to an effective consistent character string in forming the i-th similar character string , and then start the comparison to find the i+1th similar string.

不属于有效一致字符串的字符的负分,是为了考虑到要兼顾“中间夹杂的不一致字符越多越感到不相似”、“中间夹杂的不一致字符过多就感觉不到是相同字符串”这两个方面而设定的。一个非一致字符串的负分总计最大为1/L*L=1,所以取下一个一致字符串的正分的最小值,当N≥1(对日语特别推荐2)时负分值不会超过其正分。另外,式5表示前述的“理学部部长”的“部”那样的重复出现的字符,式4表示不是重复出现字符的单纯一致字符。对于式5所表示的字符,要增加较比单纯非一致字符小的负分,以针对出现重复字符的情况。E11.在第2实施例中的确定相似字符串和相似度的例子The negative points for characters that do not belong to a valid consistent string are to take into account that "the more inconsistent characters are mixed in, the more dissimilar they feel" and "the more inconsistent characters are mixed in, the less they feel that they are the same string". set by two aspects. The maximum negative score of a non-consistent character string is 1/L*L=1, so take the minimum value of the positive score of the next consistent character string, and when N≥1 (especially recommended 2 for Japanese), the negative score will not exceed its positive score. In addition, Equation 5 represents a repeated character such as "Department" of the above-mentioned "Minister of Faculty of Science", and Equation 4 represents a simple coincident character that is not a repeated character. For the characters represented by formula 5, a negative score smaller than that of simple non-consistent characters should be added to address the situation of repeated characters. E11. Example of determining similar character strings and similarity in the second embodiment

作为示例,仍设N=2,L=3As an example, still set N=2, L=3

(例5)(Example 5)

输入字符串C:    アイビ-エムInput string C: アイビ-エム

                1 2 3 4 5 6 7 8…                                                  

文献的一部分D:…アイ·ビ-·エム…Part of the document D: ...アイ·ビ-·エム...

最初的最长一致字符串为“アイ”,因此第1个有效一致字符串为“アイ”The initial longest consistent string is "アイ", so the first valid consistent string is "アイ"

s(C,1)=1e(C,1)=2s(C,1)=1e(C,1)=2

s(D,1)=1e(D,1)=2由式1、式2、式3可知第2个有效一致字符串为“ビ-”s(D, 1) = 1e(D, 1) = 2 From Formula 1, Formula 2, and Formula 3, it can be seen that the second valid consistent character string is "ビ-"

s(C,2)=3e(C,2)=4s(C,2)=3e(C,2)=4

s(D,2)=4e(D,2)=5由式1、式2、式3可知第3个有效一致字符串为“エム”s(D, 2)=4e(D, 2)=5 From Formula 1, Formula 2, and Formula 3, it can be seen that the third valid consistent character string is "エム"

s(C,3)=5e(C,3)=6s(C,3)=5e(C,3)=6

s(D,3)=7e(D,3)=8因为已经到了准备检索的字符串的末尾,所以有效一致字符串有3个。s(D, 3)=7e(D, 3)=8 Since the end of the character string to be retrieved has been reached, there are 3 valid consistent character strings.

C:アイビ-エムC: アイビ-エム

   1    2    31 2 3

D:アイ·ビ-·エムD: アイ·ビ-·エム

   1    2    31 2 3

分数    1.-1.1.1.1.1.Score 1.-1.1.1.1.1.

            -1/3    -1/3-1/3 -1/3

相似字符串为从s(D,1)到e(D,3)的“アイ·ビ-·エム”。相似度=((1*6+(-1/3*2)/6)=0.88(例6)The similar character string is "アイ·ビ-·エム" from s(D, 1) to e(D, 3). Similarity = ((1*6+(-1/3*2)/6) = 0.88 (Example 6)

                1 2 3 4 5 6 7 8 9 10                    1 2 3 4 5 6 7 8 9 10

输入字符串C:    ソフトウエアメ-カ-Input string C: ソフトウエアメ-カ-

                1 2 3 4 5 6 7 8 9…                                  1 2 3 4 5 6 7 8 9…

文献的一部分D: …ソフト开发メ-カ-…Part D of the document: ...ソフト develops メ-カ-...

C:  ソフトウエアメ-カ-C: ソフトウエアメ-カ-

     1           21 2

D:  ソフト开发メ-カ-…D: Soft developed メ-カ-…

     1           2相似字符串=″ソフト开发メ-カ-″相似度=((1*7+(-1/3)*2)/10)=0.63(例7)1 2 Similar character strings = "Soft development メ-カ-" similarity = ((1*7+(-1/3)*2)/10) = 0.63 (Example 7)

                 1 2 3 41 2 3 4

输入字符串C:    在宅起诉Enter the string C: sue at home

                   1 2 3 4 5 6 7 8 91011121314…                  1 2 3 4 5 6 7 8 91011121314…

文献的一部分D:    在宅のままで起诉にふみきつた。Part D of the document: In the house の ま ま で sue にふみきつた.

最初的一致字符串为“在宅”,因此第1个有效一致字符串为“在宅”,因下一个一致字符串为“起诉”不满足式3,所以只有第1个是有效一致字符串。The initial consistent character string is "in the house", so the first valid consistent character string is "in the house", because the next consistent character string is "sued" which does not satisfy formula 3, so only the first one is a valid consistent character string.

C:    在宅起诉C: sue at home

       1 1

D:    在宅のままで起诉にふみきつた。D: Sue にふみきつた in the house のままで.

       1相似字符串为“在宅”。相似度=2/4=0.51 A similar character string is "at home". Similarity = 2/4 = 0.5

″在″后面的开头的非有效一致字符为″の″。应从″の″向后检索第2个相似字符串。The leading non-valid consistent character after "behind" is "の". The second similar character string should be retrieved backward from "の".

C:    在宅起诉C: sue at home

            1 1

D:    在宅のままで起诉にふみきつた。D: Sue にふみきつた in the house のままで.

                   1因此,第2个相似字符串为“起诉”。(例8)          1 Therefore, the second similar character string is "prosecution". (Example 8)

                1 2 3 4 5 6 71 2 3 4 5 6 7

输入字符串C:    理学部长に就任Input character string C:  

                 1 2 3 4 5 6 7 81 2 3 4 5 6 7 8

文献的一部分D:…理学部部长に就任…有效一致字符串为“理学部”、“部长に就任”两个。A part D of the document: ... the head of the Faculty of Science に taking office... The effective consistent character strings are "Faculty of Science" and "Minister に taking office".

C: 理学部长に就任C: The Minister of Science takes office

    1 1

>2>2

D: 理学部部长に就任D: The head of the Faculty of Science takes office

  1       21 2

1.1.1.       1.1.1.11.1.1. 1.1.1.1

          -1/6相似字符串为“理学部部长に就任”。第2个“部”满足式5。因此,相似度=((1*7+(-1/6)*1)/7)=0.97。E12.第2实施例的结果汇总-1/6 The similar character string is "Minister of the Faculty of Science takes office". The second "part" satisfies Equation 5. Therefore, similarity=((1*7+(−1/6)*1)/7)=0.97. E12. Summary of Results of the Second Example

输入字符串           文献中             相似性input string in the literature similarity

ソフトメ-カ-      ソフトのメ-カ-         0.95ソフトメ-カ- ソフトのメ-カ- 0.95

                  ソフトの开发メ-カ-     0.85政治资金规正法案    政治资金规正法         0.87    ソフトの发展メ-カ- 0.85 Political Fund Regulation Act Political Fund Regulation Act 0.87

                  政治资金               0.50理学部长に就任      理学部部长に就任       0.97                                                                                                 

                  理学部の长に就任       0.95                                                                0.95

如上所述,按照本发明,可以获得对文本文件或数据库使用特有的索引结构高速实现凭人的感觉的模糊检索的效果。As described above, according to the present invention, it is possible to obtain the effect of high-speed realization of fuzzy retrieval based on human perception by using a unique index structure for text files or databases.

Claims (56)

1. An information retrieval method for finding out, by computer processing, a similarity of document character strings similar to a retrieval character string among documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) a step of extracting a partial character string having a length of M characters or more ((M is a predetermined integer of 2 or more) from the beginning of the search character string and detecting a start position and an end position matching the extracted partial character string in the document (hereinafter, a partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string);
(c) searching for the valid matching character string if a response indicating that no valid matching character string is detected occurs in step (b), by shifting a character from the start position of the partial character string of the search character string, and then by selecting a partial character string having a length of at least M characters (M is a predetermined integer of at least 2);
(d) searching for a valid matching character string having a starting position within a distance of L characters (L is a predetermined integer of 1 or more) from a partial character string starting position of the search character string and a search starting position in the document, respectively, by a length corresponding to the valid matching character string just detected, if a response for detecting the valid matching character string occurs;
(e) continuing the step (d) as long as the valid matching character string is detected;
(f) and calculating a similarity between the search string and a string from the start position of the first valid matching string of the document to the end position of the last valid matching string of the document based on information existing in the valid matching strings at least in the string from the start position of the first valid matching string of the document to the end position of the last valid matching string of the document.
2. The information retrieval method according to claim 1, characterized in that: m is 2, and L is 3 or more.
3. The information retrieval method according to claim 1, characterized in that: the calculation of the similarity takes a small value among the proportion of valid matching character strings in the search character string and the proportion of valid matching character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
4. The information retrieval method according to claim 1, characterized in that: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
5. An information retrieval method for finding, by computer processing, a position where a retrieval string appears in a document stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) inputting similarity;
(c) a step of extracting a partial character string having a length of one or more than M characters (M is a predetermined integer of 2 or more) from the head of the search character string, and detecting a start position and an end position that match the partial character string in the above document, hereinafter, a partial character string having a length of one or more than M characters, which is determined by the start position and the end position, is referred to as a valid matching character string one;
(d) searching for the valid matching character string if a response indicating that no valid matching character string is detected occurs in step (c), by shifting a character from a start position of a partial character string of the search character string, and by selecting a partial character string of a predetermined integer having a length of at least M characters and having a length of at least M2;
(e) a step of searching for a valid matching character string from a partial character string start position of the search character string and a search start position in the document, if a response to the detection of the valid matching character string occurs, by a length corresponding to the valid matching character string just detected, and from a predetermined integer having a distance from the start position to the valid matching character string just detected within L characters, L being 1 or more;
(f) continuing the step (e) as long as the valid matching character string is detected;
(g) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
(h) If a response is obtained by the calculation that the similarity is greater than the similarity input in the step (b), the contents of the valid matching character string contained in the document are displayed.
6. The information retrieval method according to claim 5, characterized in that: m is 2, and L is 3 or more.
7. The information retrieval method according to claim 5, characterized in that: the calculation of the similarity takes a small value among the proportion of the valid matching character strings in the search character string and the proportion of the valid matching character strings in the character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
8. The information retrieval method according to claim 5, characterized in that: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
9. The information retrieval method according to claim 8, characterized in that: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
10. An information retrieval method for detecting, by computer processing, a similarity of a document character string similar to a retrieval character string in a database of a plurality of documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a retrieval character string;
(b) a step of extracting a partial character string having a length of M characters or more and M being a predetermined integer of 2 or more from the beginning of the search character string and detecting a start position and an end position matching the extracted partial character string in the same document in the database, wherein the partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string-;
(c) if no valid matching character string is detected in the step (b), searching the valid matching character string by shifting a character from the starting position of the partial character string of the search character string, and then taking a partial character string of which the length is more than M characters and M is a predetermined integer more than 2;
(d) searching for a valid matching character string having a distance between a start position and a valid matching character string detected immediately before, within L characters, L being a predetermined integer of 1 or more, by shifting only a length corresponding to the valid matching character string detected immediately before from a partial character string start position of the search character string and a search start position in the same document, if a response for detecting a valid matching character string occurs;
(e) continuing the step of step (d) as long as the valid consistent character string is found;
(f) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
11. The information retrieval method according to claim 10, characterized in that: m is 2 and L is 3.
12. The information retrieval method according to claim 10, characterized in that: the calculation of the similarity takes a small value between the proportion of the valid matching character strings in the search character string and the proportion of the valid matching character strings in the character strings from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
13. The information retrieval method according to claim 10, characterized in that: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
14. An information retrieval method for checking out, by computer processing, where a retrieval string appears in a database of a plurality of documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) inputting similarity;
(c) a step of extracting a partial character string having a length of M characters or more and M being a predetermined integer of 2 or more from the beginning of the search character string and detecting a start position and an end position matching the extracted partial character string in the same document in the database, wherein the partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string-;
(d) searching for the valid matching character string if a response indicating that no valid matching character string is detected occurs in step (c), by shifting a character from a start position of a partial character string of the search character string, and by selecting a partial character string of a predetermined integer having a length of at least M characters and having a length of at least M2;
(e) searching for a valid matching character string having a distance of L characters or less from the start position of the partial character string of the search character string and the search start position in the same document, L being a predetermined integer of 1 or more, if a response indicating that a valid matching character string is detected occurs, by shifting only the length of the valid matching character string just detected from the start position of the partial character string of the search character string and the search start position in the same document;
(f) continuing the step of step (e) as long as the valid consistent character string is found;
(g) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document.
(h) If a response is obtained through the calculation that the degree of similarity is greater than the degree of similarity input in the step (b), the contents of the valid matching character string contained in the document are displayed.
15. The information retrieval method according to claim 14, characterized in that: the method includes a step of labeling the plurality of documents with a unique number or symbol in advance.
16. The information retrieval method as recited in claim 15, wherein: the above-mentioned inherent numbers or symbols are numbers in order.
17. The information retrieval method according to claim 14, characterized in that: m is 2 and L is 3.
18. The information retrieval method according to claim 14, characterized in that: the calculation of the similarity takes a small value among the proportion of the valid matching character strings in the search character string and the proportion of the valid matching character strings in the character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
19. The information retrieval method according to claim 14, characterized in that: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
20. The information retrieval method as recited in claim 19, wherein: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
21. An information retrieval system for detecting, by computer processing, a position where a retrieval character string appears in a document stored in a retrievable manner, the system having:
(a) means for inputting a search string;
(b) a device for searching for a matching start position and end position in the above document, wherein a partial character string having a length of M characters or more to a predetermined integer having M characters or more to 2 or more is taken as a search character string, and a partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string;
(c) a means for searching for a valid matching character string by using the means (b) from the beginning of the search character string, if a response for detecting a valid matching character string occurs, by shifting only the length of the valid matching character string just detected from the start position of the partial character string of the search character string and the search start position in the document, and by shifting the start position of the valid matching character string within L characters, L being a predetermined integer of 1 or more, from the start position of the valid matching character string just detected, and if a response for detecting a valid matching character string does not occur, by shifting one character from the start position of the partial character string of the search character string and then taking out a partial character string of M characters or more (M being a predetermined integer of 2 or more) and searching for the valid matching character string;
(d) means for continuing said step (d) as long as said valid matching string is detected;
(e) and a means for calculating the similarity between the search string and the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
22. An information retrieval system for detecting, by computer processing, a position where a retrieval character string appears in a document stored in a retrievable manner, the system comprising:
(a) means for inputting a search string;
(b) means for inputting the similarity;
(c) a device for searching for a matching start position and end position in the above document, wherein a partial character string having a length of M characters or more to a predetermined integer of 2 or more is taken as a search character string, and a partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string;
(d) a means for searching for a valid matching character string by applying the means (c) from the beginning of the search character string and by shifting only the length of the valid matching character string just detected from the start position of the partial character string of the search character string and the search start position in the document based on the occurrence of a response to detect a valid matching character string, the valid matching character string having a distance of L characters or less from the start position of the valid matching character string just detected and L being a predetermined integer of 1 or more, and if no response to detect a valid matching character string has occurred, the valid matching character string being searched by shifting one character from the start position of the partial character string of the search character string and then taking out a partial character string of M characters or more (M being a predetermined integer of 2 or more);
(e) means for continuing said step (d) as long as said valid matching string is detected;
(f) means for calculating a similarity between a character string at least between a start position of a first valid matching character string of the document and an end position of a last valid matching character string of the document and the search character string, based on information present in the valid matching character string;
(g) and (c) means for displaying the contents of the valid matching character strings included in the document if the calculated similarity is greater than the similarity inputted in the step (b).
23. The information retrieval system as recited in claim 22, wherein: m is 2 and L is 3.
24. The information retrieval system as recited in claim 22, wherein: the calculation of the similarity takes a small value among the proportion of valid matching character strings in the search character string and the proportion of valid matching character strings in the character strings from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
25. The information retrieval system as recited in claim 22, wherein: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
26. The information retrieval system as recited in claim 25, wherein: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
27. An information retrieval system for detecting, by computer processing, a position where a retrieval character string appears in a database of a plurality of documents stored in a retrievable manner, the system comprising:
(a) means for inputting a search string;
(b) means for inputting similarities;
(c) a device for searching for a matching start position and end position in the above document, wherein a partial character string having a length of M characters or more to a predetermined integer having M characters or more to 2 or more is taken from the beginning of the search character string, and a partial character string having a length of M characters or more determined from the start position and the end position is referred to as an effective matching character string;
(d) a means for searching for a valid matching character string by using the means (c) from the beginning of the search character string if a response for detecting a valid matching character string occurs, and by shifting only the length of the valid matching character string just detected from the start position of the partial character string of the search character string and the search start position in the same document, and searching for a valid matching character string whose start position is within L characters of the distance from the valid matching character string just detected-L is a predetermined integer of 1 or more, and if a response for detecting a valid matching character string does not occur, by shifting one character from the start position of the partial character string of the search character string, and then retrieving a partial character string of M characters or more and having a length of M being a predetermined integer of 2 or more;
(e) means for continuing said step (d) as long as said valid matching string is detected;
(f) means for calculating a similarity between a character string at least between a start position of a first valid matching character string of the same document and an end position of a last valid matching character string of the same document and the search character string, based on information present in the valid matching character string, from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document;
(g) and (c) means for displaying the contents of the valid matching character strings included in the document if the similarity calculated as described above is greater than the similarity inputted in the step (b).
28. The information retrieval system as recited in claim 27, wherein: m is 2 and L is 3.
29. The information retrieval system as recited in claim 27, wherein: the calculation of the similarity takes a small value among the proportion of the valid matching character strings in the search character string and the proportion of the valid matching character strings in the character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
30. The information retrieval system as recited in claim 27, wherein: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and not a score is subtracted when it belongs to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the full matching score.
31. The information retrieval system as recited in claim 30, wherein: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
32. A method of making an index file with which glyphs of length N characters can be retrieved at high speed in documents stored in a retrievable manner by computer processing, the method comprising the steps of:
(a) sequentially scanning the document and writing the font of any N characters appearing in the document and the information of the appearance position of the font of the N characters in the document into a storage area;
(b) a step of, if a response that all the documents have been scanned has occurred in the step (a), sorting information written in the storage area by the fonts, and adding the position information corresponding to the sorted information to the different fonts;
(c) and a step of creating and outputting a file using the font as a key so as to search the added position information corresponding to the font.
33. The index documentation method of claim 32 wherein: the search in the step (c) is a binary search.
34. The information retrieval method according to claim 1 or claim 5, characterized in that: in the above document, search for M characters in a glyph is performed using an index file created by the method of claim 33 and using position information detected from the index file.
35. A method of making an index file with which a font of N characters in length can be retrieved at high speed in a database of a plurality of documents stored in a retrievable manner by computer processing, the method comprising the steps of:
(a) a step of adding a symbol or a number for identifying each of the plurality of documents individually;
(b) sequentially scanning the database and writing information on the fonts of arbitrary N characters appearing in each document of the database and the positions where the fonts of the N characters appear in the document into a storage area corresponding to the attached document identification symbol or number;
(c) if a response that all the documents in the database have been scanned is generated in step (b), sorting the information written in the storage area by the font, and adding the document identification number or serial number corresponding to the sorted information and the position information in the document to the different fonts;
(d) and a step of creating and outputting a file using the font as a key so as to search the attached document identification symbol or number and the position information corresponding to the font.
36. The index documentation method of claim 35 wherein: the search in the step (d) is a binary search.
37. The information retrieval method according to claim 10 or claim 14, characterized in that: in the above document, search for M characters in a glyph is performed using an index file created by the method according to claim 36, and using position information detected from the index file.
38. The index documentation method of claim 37 wherein: both M and N are 2, and L is 3 or more.
39. A method of indexing a document, using which a glyph of length N characters can be retrieved at high speed in a document stored in a retrievable manner by computer processing, the method comprising the steps of:
(a) sequentially scanning the documents, writing a pre-specified separator appearing in the documents and information on the appearance position of the separator in the documents into a storage area, and simultaneously writing the font of any N characters appearing in the documents continuously and information on the appearance position of the font of the N characters in the documents into the storage area;
(b) a step of, if a response that all the documents have been scanned has occurred in step (a), sorting information written in the storage area in accordance with the font, and adding the position information corresponding to the sorted information to the different fonts;
(c) and a step of creating and outputting a file using the font as a key so as to search the added position information corresponding to the font.
40. The information retrieval method according to claim 1 or claim 5, characterized in that: in the above document, search for M characters in a zigzag form is performed using an index file created by the method according to claim 39, and using position information detected from the index file.
41. The information retrieval method as recited in claim 40, wherein: and a step of searching for the separator in the document and adding position information detected when searching for M characters in a font in the document to a corresponding position.
42. A system for creating an index file, with which a font having a length of N characters can be retrieved at high speed in a document stored in a retrievable manner by computer processing, comprising:
(a) means for sequentially scanning the document and writing information on the position where the N character patterns appear in the document and the font pattern of any N characters appearing in the document consecutively in the document into a storage area;
(b) means for sorting information written in the storage area in accordance with the font when a response that all the documents have been scanned appears in the means (a), and adding the position information corresponding to the sorted information to the different fonts;
(c) and a device for generating and outputting a file by using the font as a key so as to search the added position information corresponding to the font.
43. A system for creating an index file, with which a font having a length of N characters can be retrieved at high speed in a database of a plurality of documents stored in a retrievable manner by computer processing, comprising:
(a) attaching a symbol or number to each of the plurality of documents for individually identifying the document;
(b) means for sequentially scanning the database and writing information on the fonts of arbitrary N characters appearing consecutively in each document of the database and the positions where the fonts of the N characters appear in the document into a storage area corresponding to the attached document identification symbol or number;
(c) means for sorting information written in the storage area in accordance with the font if a response that all documents in the database have been scanned is generated in the means (b), and adding the document identification number or serial number corresponding to the sorted information and position information in the document to the different font;
(d) and a device for generating and outputting a file by using the font as a key so as to search the attached document identification symbol or number and the position information corresponding to the font.
44. An index documentation system according to claim 43 wherein: all of the above N are 2.
45. A system for creating an index file, with which a font having a length of N characters can be retrieved at high speed in a document stored in a retrievable manner by computer processing, comprising:
(a) means for sequentially scanning the document, writing a previously specified delimiter appearing in the document and information on the position where the delimiter appears in the document into a storage area, and writing a font of arbitrary N characters appearing in the document and information on the position where the font of the N characters appears in the document into the storage area;
(b) means for sorting the information written in the storage area in accordance with the font when a response that all the documents have been scanned appears in the means (a), and adding the position information corresponding to the sorted information to the different fonts;
(c) and a device for generating and outputting a file by using the font as a key so as to search the added position information corresponding to the font.
46. A system for creating an index file, with which a font having a length of N characters can be retrieved at high speed in a database of a plurality of documents stored in a retrievable manner by computer processing, comprising:
(a) means for attaching a symbol or number to each of the plurality of documents for individual identification thereof;
(b) means for sequentially scanning the database, writing a previously designated delimiter appearing in each document of the database and information on the position of the delimiter appearing in the document into a corresponding storage area to which the document identification symbol or number is appended, and writing information on the position of the character pattern of any N characters and the position of the character pattern of the N characters appearing in each document of the database into the corresponding storage area to which the document identification symbol or number is appended;
(c) means for sorting information written in the storage area in accordance with the font when a response that all documents in the database have been scanned is generated in the means (b), and adding the document identification symbol or number and position information in the document corresponding to the sorted information in each of the different fonts;
(d) and a device for generating and outputting a file by using the font as a key so as to search the attached document identification symbol or number and the position information corresponding to the font.
47. An information retrieval method for enabling, by computer processing, the detection of a position where a retrieval character string appears in a database of a plurality of documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) inputting similarity;
(c) a step of searching for a start position and an end position that match a partial character string having a length of at least M characters and a predetermined integer having M of 2 or more from the beginning of the search character string in the same document in the database, wherein the partial character string having a length of at least M characters and determined by the start position and the end position is referred to as an effective matching character string;
(d) get
The starting position of the ith valid coincident string in the literature is s (D, i)
The end position of the ith valid consistent character string in the literature is e (D, i)
The starting position of the ith valid consistent character string in the character string to be searched is s (C, i)
The end position of the ith effective consistent character string in the character string to be searched is e (C, i)
A step of retrieving the i +1 th valid identical character string satisfying the following two conditions:
e(D,i)+1≤s(D,i+1)≤e(D,i)+L+1
and is
s(C,i+1)>e(C,i)-(M-1)
In the above formula, L is a predetermined integer of 1 or more,
(e) continuing the step (d) as long as the valid matching character string is detected;
(f) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document.
(g) If the similarity obtained by the above calculation is greater than the similarity input in the above step (b), the contents of the above valid matching character strings contained therein are displayed in the above document.
48. The information retrieval method as recited in claim 47, wherein: m is 2 and L is 3 or more.
49. The information retrieval method as recited in claim 47, wherein: the calculation of the similarity takes a small value among the proportion of valid matching character strings in the search character string and the proportion of valid matching character strings in the character strings from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
50. The information retrieval method as recited in claim 47, wherein: the similarity calculation takes a small value between the proportion of the search string in the valid matching strings and the proportion of the valid matching strings between the start position of the first valid matching string of the document and the end position of the last valid matching string of the document.
51. An information retrieval method for detecting, by computer processing, where a retrieval string appears in a database of a plurality of documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) inputting similarity;
(c) a step of extracting a partial character string having a length of M characters or more and M being a predetermined integer of 2 or more from the beginning of the search character string and detecting a start position and an end position matching the partial character string in the same document in the database, wherein the partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string …;
(d) get
The starting position of the ith valid coincident string in the literature is s (D, i)
The end position of the ith valid consistent character string in the literature is e (D, i)
The starting position of the ith valid matching character string in the character string to be searched is s (C, i)
The end position of the ith valid matching character string in the character string to be searched is e (C, i)
A step of retrieving the i +1 th valid coincident character string satisfying the following conditions:
s(C,i+1)>e(C,i)-(M-1)
s(D,i+1)>e(D,i)
and is
s(D,i+1)-e(D,i)-1+max(e(C,i)-s(C,i+1)
+1.0)≤L
(in the above formula, L is a predetermined integer of 1 or more)
(e) Continuing the step (d) as long as the valid matching character string is detected;
(f) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document.
(g) Displaying the contents of the valid matching character strings contained in the documents if the calculated similarity is greater than the similarity inputted in the step (b).
52. The information retrieval method as recited in claim 51, wherein: m is 2, and L is 3 or more.
53. The information retrieval method as recited in claim 51, wherein: the calculation of the similarity takes a small value among the proportion of valid matching character strings in the search character strings and the proportion of valid matching character strings in the character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
54. The information retrieval method as recited in claim 51, wherein: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and not a score is subtracted when it belongs to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the full matching score.
55. The information retrieval method as recited in claim 51, wherein: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
56. And subtracting 1/(2L) for the characters which belong to the ith valid consistent character string and the corresponding search character string characters belong to the (i-1) th valid consistent character string.
CN 95118142 1994-11-22 1995-11-01 Information searching method and system Pending CN1151558A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP287642/94 1994-11-22
JP6287642A JP2669601B2 (en) 1994-11-22 1994-11-22 Information retrieval method and system

Publications (1)

Publication Number Publication Date
CN1151558A true CN1151558A (en) 1997-06-11

Family

ID=17719873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 95118142 Pending CN1151558A (en) 1994-11-22 1995-11-01 Information searching method and system

Country Status (3)

Country Link
JP (1) JP2669601B2 (en)
KR (1) KR960018993A (en)
CN (1) CN1151558A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100357946C (en) * 2004-06-09 2007-12-26 金宝电子(上海)有限公司 Electronic device and method for quickly comparing and searching character strings
WO2008098495A1 (en) * 2007-02-14 2008-08-21 Jie Bai Method and device for determing object file
CN100424676C (en) * 1999-07-01 2008-10-08 株式会社日立制作所 Geographical name presentation method, method and apparatus for geographical name string identification
US7831241B2 (en) 2004-12-22 2010-11-09 Research In Motion Limited Entering contacts in a communication message on a mobile device
CN101517363B (en) * 2006-08-18 2012-09-26 谷歌公司 Providing routing information based on ambiguous locations
CN101939743B (en) * 2007-12-24 2013-10-16 高通股份有限公司 Apparatus and methods for retrieving/downloading content on a communication device
CN103425629A (en) * 2012-05-24 2013-12-04 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN104090875A (en) * 2013-04-01 2014-10-08 鸿富锦精密工业(深圳)有限公司 Information retrieval system and information retrieval method
CN113971702A (en) * 2021-10-29 2022-01-25 深圳市道通科技股份有限公司 Picture compression method, decompression method and electronic equipment

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3275816B2 (en) * 1998-01-14 2002-04-22 日本電気株式会社 Symbol string search method, symbol string search device, and recording medium recording symbol string search program
JP4042580B2 (en) * 2003-01-28 2008-02-06 ヤマハ株式会社 Terminal device for speech synthesis using pronunciation description language
CN1645374A (en) * 2005-01-17 2005-07-27 徐文新 Digit marking character string searching technology
JP4893624B2 (en) 2005-09-02 2012-03-07 日本電気株式会社 Data clustering apparatus, clustering method, and clustering program
JP5900367B2 (en) * 2013-01-30 2016-04-06 カシオ計算機株式会社 SEARCH DEVICE, SEARCH METHOD, AND PROGRAM
CN105550175B (en) * 2014-10-28 2019-03-01 阿里巴巴集团控股有限公司 The recognition methods of malice account and device
CN108133016A (en) * 2017-12-22 2018-06-08 大连景竣科技有限公司 System and method for locating office documents
CN112733524A (en) * 2020-12-31 2021-04-30 浙江省方大标准信息有限公司 Method, system and device for automatically correcting standard serial numbers and batch checking standard states

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3099298B2 (en) * 1991-03-20 2000-10-16 株式会社日立製作所 Document search method and apparatus
JPH07109603B2 (en) * 1990-12-12 1995-11-22 株式会社テレマティーク国際研究所 Information retrieval processing method and retrieval file creation device

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100424676C (en) * 1999-07-01 2008-10-08 株式会社日立制作所 Geographical name presentation method, method and apparatus for geographical name string identification
CN100357946C (en) * 2004-06-09 2007-12-26 金宝电子(上海)有限公司 Electronic device and method for quickly comparing and searching character strings
US7831241B2 (en) 2004-12-22 2010-11-09 Research In Motion Limited Entering contacts in a communication message on a mobile device
CN101116357B (en) * 2004-12-22 2012-12-12 捷讯研究有限公司 Entering contacts in a communication message on a mobile device
US8675845B2 (en) 2004-12-22 2014-03-18 Blackberry Limited Entering contacts in a communication message on a mobile device
CN101517363B (en) * 2006-08-18 2012-09-26 谷歌公司 Providing routing information based on ambiguous locations
WO2008098495A1 (en) * 2007-02-14 2008-08-21 Jie Bai Method and device for determing object file
CN101939743B (en) * 2007-12-24 2013-10-16 高通股份有限公司 Apparatus and methods for retrieving/downloading content on a communication device
CN103425629A (en) * 2012-05-24 2013-12-04 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN103425629B (en) * 2012-05-24 2017-05-03 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN104090875A (en) * 2013-04-01 2014-10-08 鸿富锦精密工业(深圳)有限公司 Information retrieval system and information retrieval method
CN113971702A (en) * 2021-10-29 2022-01-25 深圳市道通科技股份有限公司 Picture compression method, decompression method and electronic equipment

Also Published As

Publication number Publication date
JPH08147320A (en) 1996-06-07
JP2669601B2 (en) 1997-10-29
KR960018993A (en) 1996-06-17

Similar Documents

Publication Publication Date Title
CN1109994C (en) Document processor and recording medium
CN1171162C (en) Apparatus and method for retrieving character strings based on character classification
CN1155906C (en) data processing method, system, processing program and recording medium
CN1215433C (en) Online character identifying device, method and program and computer readable recording media
CN1151558A (en) Information searching method and system
CN1170240C (en) Structured document retrieval display method and device
CN1194319C (en) Method for retrieving, listing and sorting table-formatted data, and recording medium recorded retrieving, listing or sorting program
CN1151456C (en) Feature character sequence extraction and similar document retrieval method and device
CN1101032C (en) Related term extraction apparatus, related term extraction method, and computer-readable recording medium having related term extration program recorded thereon
CN1158627C (en) Method and device for character recognition
CN1855103A (en) System and methods for dedicated element and character string vector generation
CN1281191A (en) Information retrieval method and information retrieval device
CN1552032A (en) Database
CN1315020A (en) Method and apparatus for free-form data processing
CN1053852A (en) Name Determination in Directory Database
CN1578954A (en) Machine translation
CN1132564A (en) Method and apparatus for data storage and retrieval
CN1331449A (en) Method and relative system for dividing or separating text or decument into sectional word by process of adherence
CN1608259A (en) Machine translation
CN1046625A (en) The technology of in the structural formula file, making, expanding and shrinking the constituent element mark
CN1310173C (en) Table format data presenting method, inserting method, deleting method, and updating method
CN1869989A (en) System and method for generating structured representation from structured description
CN1752963A (en) Document information processing apparatus, document information processing method, and document information processing program
CN1942877A (en) Information extraction system
CN1647069A (en) Conversation control system and conversation control method

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication