CN102479187B

CN102479187B - GBK character inquiry system based on even-odd check and its implementation

Info

Publication number: CN102479187B
Application number: CN201010555486.7A
Authority: CN
Inventors: 陈运文
Original assignee: Shengle Information Technolpogy Shanghai Co Ltd
Current assignee: Shanghai Zongzhang Technology Group Co.,Ltd.
Priority date: 2010-11-23
Filing date: 2010-11-23
Publication date: 2016-09-14
Anticipated expiration: 2030-11-23
Also published as: CN102479187A

Abstract

The invention discloses a GBK character query system based on a parity check, which includes a character basic query module, a GBK parity check module and a Chinese character boundary verification module; the character basic query module is used for reading GBK encoded strings and query characters The string is used for basic query, find the position that satisfies the query string in the GBK coded string, and record the start position and end position; the GBK parity check module is used to perform code verification on the basic query results obtained by the character basic query module , using a forward-backward bidirectional parity check method; the Chinese character boundary verification module is used to judge whether the start position and the end position are on the boundary of GBK double-byte Chinese characters according to the parity value check result of the GBK parity check module. In addition, the invention also discloses the realization method of the system. The invention can search and determine the boundaries of Chinese characters, thereby successfully solving the problem of query mismatch easily caused by the intersection of high and low byte codes of Chinese characters.

Description

GBK Character Query System Based on Parity Check and Its Realization Method

技术领域technical field

本发明涉及一种汉字字符查询系统，尤其涉及一种基于奇偶校验的GBK字符查询系统；此外，本发明还涉及该基于奇偶校验的GBK字符查询系统的实现方法。The present invention relates to a Chinese character query system, in particular to a parity-based GBK character query system; moreover, the invention also relates to a method for realizing the parity-based GBK character query system.

背景技术Background technique

一、GBK编码的汉字边界问题1. Chinese character boundary problem of GBK encoding

GBK编码是计算机中表示汉字字符的最常用方法之一。它的全称为《汉字内码扩展规范》(GBK)，英文名称Chinese Internal Code Specification，中华人民共和国全国信息技术标准化技术委员会1995年12月1日制订，国家技术监督局标准化司、电子工业部科技与质量监督司1995年12月15日联合以技监标函1995 229号文件的形式，将它确定为技术规范指导性文件，发布和实施。GBK编码在GB2312-80标准基础上的内码扩展规范，使用了双字节编码方案，其编码范围从8140至FEFE(剔除xx7F)，共23940个码位，共收录了21003个汉字，完全兼容GB2312-80标准，支持国际标准ISO/IEC10646-1和国家标准GB13000-1中的全部中日韩汉字，并包含了BIG5编码中的所有汉字。GBK编码的高字节范围是0x81-0xFE，低字节范围是0x40-7E和0x80-0xFE。GBK encoding is one of the most common methods for representing Chinese characters in computers. Its full name is "Chinese Character Internal Code Extension Specification" (GBK), the English name is Chinese Internal Code Specification, formulated by the National Information Technology Standardization Technical Committee of the People's Republic of China on December 1, 1995, the Standardization Department of the State Bureau of Technical Supervision and the Ministry of Electronics Industry Jointly with the Quality Supervision Department on December 15, 1995, in the form of Technical Supervision Letter No. 1995 229, it was determined as a technical specification guidance document, released and implemented. GBK encoding is an internal code extension specification based on the GB2312-80 standard, using a double-byte encoding scheme, and its encoding range is from 8140 to FEFE (excluding xx7F), with a total of 23,940 code points and a total of 21,003 Chinese characters, fully compatible The GB2312-80 standard supports all Chinese, Japanese and Korean Chinese characters in the international standard ISO/IEC10646-1 and the national standard GB13000-1, and includes all Chinese characters in the BIG5 encoding. The high byte range of GBK encoding is 0x81-0xFE, and the low byte range is 0x40-7E and 0x80-0xFE.

GBK汉字编码在确定汉字边界时可能会遇到问题，主要表现在：GBK Chinese character encoding may encounter problems when determining the boundaries of Chinese characters, mainly in:

1.GBK编码中，部分编码和ASC-II编码有重叠，导致ASC字符查找的误命中问题；1. In the GBK encoding, part of the encoding overlaps with the ASC-II encoding, which leads to the problem of false hits in ASC character search;

2.双字节编码方式，没有标志位，难以确定单个汉字的起始字节和终止字节的位置，导致进行单个汉字的查询时，无法确定汉字边界。2. Double-byte encoding method, without flag bits, it is difficult to determine the position of the start byte and end byte of a single Chinese character, resulting in the inability to determine the boundary of Chinese characters when querying a single Chinese character.

因此，需要一种GBK编码的自匹配系统，能查找并确定汉字的边界，从而成功解决上述查询误匹配问题。Therefore, a self-matching system of GBK encoding is needed, which can search and determine the boundaries of Chinese characters, so as to successfully solve the above-mentioned query mismatching problem.

二、GBK汉字边界不明所导致的问题2. The problem caused by the unclear boundary of GBK Chinese characters

GBK编码系统内，中文汉字采用了双字节的编码方式，即每个汉字由2个字节来表示。例如“靳怀堾水文化文集”这句话，在计算机看来，是一段字符串，用16进制表示为：“BD F9 BB B3 89 40 CB AE CE C4 BB AF CE C4 BC AF”。In the GBK encoding system, Chinese characters adopt a double-byte encoding method, that is, each Chinese character is represented by 2 bytes. For example, the sentence "Jin Huaicheng Water Culture Collection" is a string in the eyes of a computer, expressed in hexadecimal as: "BD F9 BB B3 89 40 CB AE CE C4 BB AF CE C4 BC AF".

由于GBK编码缺乏标志位，计算机在判断该段字符串时，难以判断出汉字的边界。由此导致的问题是，在进行字符查找时，由于汉字边界不确定，导致字符匹配错位问题。Due to the lack of flags in the GBK code, it is difficult for the computer to judge the boundaries of Chinese characters when judging the string. The problem caused by this is that when searching for characters, due to the uncertain boundaries of Chinese characters, it leads to the problem of character matching misplacement.

例如上述字符串中，汉字“堾”，GBK编码为[89,40]。而编码40也表示是ASC符号“@”。因此，当符号‘@’时，会发现在字符串“靳怀堾水文化文集”中出现了该符号，与实际情况不符。For example, in the above character string, the GBK encoding of the Chinese character "堾" is [89,40]. The code 40 also represents the ASC symbol "@". Therefore, when the symbol '@' is used, it will be found that this symbol appears in the character string "Jin Huaicheng Water Culture Collection", which is inconsistent with the actual situation.

除了ASC字符查找的问题，在汉字查找时也会出现问题。这是由于GBK汉字编码的边界不明确，也会导致汉字错位问题。In addition to the problem of ASC character lookup, there will also be problems when looking up Chinese characters. This is because the boundary of GBK Chinese character encoding is not clear, which will also lead to the problem of Chinese character misplacement.

例如汉字字符串“大学生健康教育”，对应的GBK编码为“B4 F3 D1 A7 C9FA BD A1 BF B5 BD CC D3 FD”，在其中如果查找汉字“笱”时，会发现是存在的。由于“笱”的编码为[F3,D1],刚好是汉字“大”的低字节F3，和“学”字的高字节D1组合而成，从而导致了该错误。For example, the Chinese character string "Health Education for College Students" corresponds to the GBK code "B4 F3 D1 A7 C9FA BD A1 BF B5 BD CC D3 FD". If you search for the Chinese character "笱", you will find that it exists. Because the encoding of "笱" is [F3, D1], which happens to be the combination of the low byte F3 of the Chinese character "big" and the high byte D1 of the Chinese character "xue", which leads to this error.

发明内容Contents of the invention

本发明要解决的技术问题是提供一种基于奇偶校验的GBK字符查询系统，该系统能查找并确定汉字的边界，从而成功解决汉字高低字节编码交叉容易导致的查询误匹配问题。为此，本发明还提供该基于奇偶校验的GBK字符查询系统的实现方法。The technical problem to be solved by the present invention is to provide a GBK character query system based on a parity check, which can search and determine the boundaries of Chinese characters, thereby successfully solving the query mismatch problem easily caused by the intersection of high and low byte codes of Chinese characters. For this reason, the present invention also provides the realization method of the GBK character query system based on the parity check.

为解决上述技术问题，本发明提供一种基于奇偶校验的GBK字符查询系统，该系统包括三个模块：字符基础查询模块、GBK奇偶校验模块和汉字边界验证模块；In order to solve the above-mentioned technical problems, the present invention provides a GBK character query system based on parity check, which system includes three modules: character basic query module, GBK parity check module and Chinese character boundary verification module;

该字符基础查询模块用于对读取的GBK编码字符串与查询字符串进行基础查询，在GBK编码字符串中找到满足查询字符串出现的位置，并记录其起始位置和终止位置；The character basic query module is used to perform basic query on the read GBK coded string and query string, find the position in the GBK coded string that satisfies the occurrence of the query string, and record its start position and end position;

该GBK奇偶校验模块用于对该字符基础查询模块所得的基础查询结果进行编码校验，首先沿起始位置向前依次检测字节以判断步长计数器count的奇偶值，然后沿终止位置向后依次检测字节以判断步长计数器count的奇偶值；The GBK parity check module is used to perform encoding verification on the basic query result obtained by the character basic query module. First, the bytes are detected forward sequentially along the starting position to judge the parity value of the step counter count, and then along the end position to Then detect the bytes in order to judge the parity value of the step counter count;

该汉字边界验证模块用于根据该GBK奇偶校验模块的奇偶值校验结果来判断起始位置和终止位置是否处在GBK双字节汉字的边界上。The Chinese character boundary verification module is used to judge whether the start position and the end position are on the boundary of GBK double-byte Chinese characters according to the parity value check result of the GBK parity check module.

所述字符基础查询模块的查询方式为从左至右依次扫描GBK编码字符串，找到满足查询字符串出现的位置，并进行记录起始位置为headpos_k，终止位置为tailpos_k，其中数字k表示在当前GBK编码字符串中第k次基础查询匹配到的结果。The query mode of the character basic query module is to scan the GBK coded string from left to right, find the position that meets the query string, and record the starting position as headpos_k and the ending position as tailpos_k, where the number k represents the current The result matched by the kth basic query in the GBK-encoded string.

所述GBK奇偶校验模块中沿起始位置向前依次检测字节以判断步长计数器count的奇偶值具体采用如下方法：首先，将步长计数器count初始设为0；然后，沿headpos_k向前，依次检测字节T_i，其中i＝headpos_k-1,headpos_k-2,…,0；然后，计算T_i&0x80是否为0，并根据T_i&0x80运算的数值进行处理逻辑；最后判断步长计数器count的奇偶值，如果count为奇数，则记录head_valid＝0；如果count为偶数，则记录head_valid＝1，其中符号&表示按位进行布尔与运算,head_valid表示查询字符串起始位置的布尔值。In the GBK parity check module, the bytes are sequentially detected forward along the starting position to determine the parity value of the step counter count. Specifically, the following method is used: first, the step counter count is initially set to 0; then, forward along the headpos_k , detect byte T_i successively, wherein i=headpos_k-1, headpos_k-2,...,0; Then, calculate whether T_i&0x80 is 0, and carry out processing logic according to the numerical value of T_i&0x80 operation; Finally judge the parity value of step counter count, If count is an odd number, record head_valid=0; if count is even, record head_valid=1, where the symbol & represents a bitwise Boolean AND operation, and head_valid represents the Boolean value of the starting position of the query string.

所述GBK奇偶校验模块中沿终止位置向后依次检测字节以判断步长计数器count的奇偶值具体采用如下方法：首先，将步长计数器count初始设为0；然后，沿tailpos_k向后，依次检测字节T_i，其中i＝tailpos_k+1,tailpos_k+2,…；然后，计算T_i&0x80是否为0，并根据T_i&0x80运算的数值进行处理逻辑；最后判断步长计数器count的奇偶值，如果count为奇数，则记录tail_valid＝0；如果count为偶数，则记录tail_valid＝1，其中符号&表示按位进行布尔与运算，tail_valid表示查询字符串终止位置的布尔值。In the GBK parity check module, the bytes are sequentially detected backward along the termination position to determine the parity value of the step counter count. Specifically, the following method is used: first, the step counter count is initially set to 0; then, backward along tailpos_k, Detect byte T_i sequentially, where i=tailpos_k+1, tailpos_k+2,...; then, calculate whether T_i&0x80 is 0, and perform processing logic according to the value of T_i&0x80 operation; finally judge the parity value of the step counter count, if count is If it is an odd number, record tail_valid=0; if the count is even, record tail_valid=1, where the symbol & represents a bitwise Boolean AND operation, and tail_valid represents the Boolean value at the end position of the query string.

所述汉字边界验证模块的验证方法具体为：如果head_valid＝1且tail_valid＝1，则当前GBK字符的匹配结果有效，将该次查询所得的匹配信息：headpos_k,tailpos_k进行记录；如果head_valid＝0或tail_valid＝0，则当前GBK字符的匹配结果无效，忽略本次匹配结果，并设置k＝k+1，继续返回GBK奇偶校验模块，重新进行下一个查询位置的验证工作，直至k为字符基础查询模块的最后一个匹配结果才结束查询，输出查询结果。The verification method of described Chinese character boundary verification module is specifically: if head_valid=1 and tail_valid=1, then the matching result of current GBK character is valid, and the matching information of this query gained: headpos_k, tailpos_k is recorded; If head_valid=0 or tail_valid=0, the matching result of the current GBK character is invalid, ignore the matching result this time, and set k=k+1, continue to return to the GBK parity check module, and re-do the verification of the next query position until k is the character basis The last matching result of the query module ends the query and outputs the query result.

此外，本发明还提供一种基于奇偶校验的GBK字符查询系统的实现方法，包括如下步骤：In addition, the present invention also provides a method for realizing a GBK character query system based on a parity check, including the following steps:

(1)读取GBK编码的字符串，导入到给定字符串中，以text_str标记；(1) Read the string encoded by GBK, import it into a given string, and mark it with text_str;

(2)读取查询字符串，标记为query_str；(2) Read the query string, marked as query_str;

(3)将query_str和text_str进行基础查询，方式为从左至右依次扫描text_str，找到满足query_str出现的位置，并进行记录起始位置为headpos_k，终止位置为tailpos_k，其中数字k表示在当前串text_str中第k次基础查询匹配到的结果；(3) Perform basic query on query_str and text_str by scanning text_str from left to right to find the position where query_str appears, and record the starting position as headpos_k and the ending position as tailpos_k, where the number k represents the position in the current string text_str The result matched by the kth basic query in ;

(4)在获取步骤(3)的基础查询结果headpos_k和tailpos_k之后进行编码校验，具体为：步骤A：沿headpos_k向前依次检测字节T_i以判断步长计数器count的奇偶值，其中，i＝headpos_k-1,headpos_k-2,…,0；步骤B：沿tailpos_k向后依次检测字节T_i以判断步长计数器count的奇偶值，其中i＝tailpos_k+1,tailpos_k+2,…；(4) Perform encoding verification after obtaining the basic query results headpos_k and tailpos_k in step (3), specifically: Step A: Detect byte T_i sequentially forward along headpos_k to determine the parity value of the step counter count, where i =headpos_k-1, headpos_k-2,...,0; Step B: detect the byte T_i sequentially backward along tailpos_k to judge the parity value of the step counter count, wherein i=tailpos_k+1, tailpos_k+2,...;

(5)根据步骤(4)的奇偶值校验结果来判断headpos_i和tailpos_i是否处在GBK双字节汉字的边界上；(5) judge whether headpos_i and tailpos_i are on the boundary of GBK double-byte Chinese characters according to the parity check result of step (4);

(6)输出GBK字符查询结果。(6) Output GBK character query result.

在步骤(4)中，所述步骤A具体为：1)将步长计数器count初始设为0；2)沿headpos_k向前，依次检测字节T_i，其中，i＝headpos1-1,headpos1-2,…,0；3)计算T_i&0x80是否为0；4)根据T_i&0x80运算的数值进行如下处理逻辑：如果T_i&0x80等于0，则记录head_valid＝1，并终止本步骤后续流程，直接进入步骤B；如果T_i&0x80不等于0，则将计数器count＝count+1,再根据T_i的数值，进行以下两个处理子逻辑：如果i＝0,则终止该步骤，并输出count数值；如果i>0，则将T_i设为T_i-1,并重新执行上述步骤A的运算，即继续沿本字符串向前检测；5)判断步长计数器count的奇偶值，如果count为奇数，则记录head_valid＝0；如果count为偶数，则记录head_valid＝1，其中符号&表示按位进行布尔与运算,head_valid表示查询字符串起始位置的布尔值。In step (4), the step A is specifically: 1) initially setting the step counter count to 0; 2) moving forward along headpos_k, and sequentially detecting bytes T_i, wherein, i=headpos1-1, headpos1-2 ,...,0; 3) Calculate whether T_i&0x80 is 0; 4) Perform the following processing logic according to the calculated value of T_i&0x80: if T_i&0x80 is equal to 0, record head_valid=1, and terminate the subsequent process of this step, and directly enter step B; if T_i&0x80 If it is not equal to 0, then set the counter count=count+1, and then perform the following two processing sub-logic according to the value of T_i: if i=0, then terminate this step, and output the value of count; if i>0, then set T_i Set it as T_i-1, and re-execute the operation of the above step A, that is, continue to detect forward along the character string; 5) judge the parity value of the step counter count, if count is an odd number, then record head_valid=0; if count is If it is an even number, record head_valid=1, where the symbol & represents a bitwise Boolean AND operation, and head_valid represents the Boolean value of the starting position of the query string.

在步骤(4)中，所述步骤B具体为：1)将步长计数器count初始设为0；2)沿tailpos_k向后，依次检测字节T_i，其中i＝tailpos+1,tailpos1+2,…；3)计算T_i&0x80是否为0；4)根据T_i&0x80运算的数值进行如下处理逻辑：如果T_i&0x80等于0，则记录tail_valid＝1，并终止本步骤后续流程；如果T_i&0x80不等于0，则将计数器count＝count+1,再根据T_i的数值，进行以下两个处理子逻辑：如果i和当前字符串text_str的长度相同，则终止该步骤，并输出count数值；如果i<当前字符串text_str的长度，则将T_i设为T_i+1,并重新执行上述步骤B的运算，即沿本字符串继续向后检测；5)判断步长计数器count的奇偶值，如果count为奇数，则记录tail_valid＝0；如果count为偶数，则记录tail_valid＝1，其中符号&表示按位进行布尔与运算，tail_valid表示查询字符串终止位置的布尔值。In step (4), the step B is specifically: 1) initially setting the step counter count to 0; 2) detecting the byte T_i sequentially along tailpos_k, wherein i=tailpos+1, tailpos1+2, ...; 3) Calculate whether T_i&0x80 is 0; 4) Perform the following processing logic according to the value of T_i&0x80 operation: if T_i&0x80 is equal to 0, then record tail_valid=1, and terminate the follow-up process of this step; if T_i&0x80 is not equal to 0, then count the counter =count+1, then according to the value of T_i, carry out the following two processing sub-logic: if the length of i is the same as that of the current character string text_str, then terminate this step, and output the value of count; if i<the length of the current character string text_str, Then set T_i as T_i+1, and re-execute the operation of the above step B, that is, continue to detect backward along the character string; 5) judge the parity value of the step counter count, if count is an odd number, then record tail_valid=0; If count is an even number, record tail_valid=1, where the symbol & represents a bitwise Boolean AND operation, and tail_valid represents the Boolean value of the end position of the query string.

步骤(5)具体为：如果head_valid＝1且tail_valid＝1时，则当前GBK字符的匹配结果有效，将该次查询所得的匹配信息：headpos_k,tailpos_k进行记录；如果head_valid＝0或tail_valid＝0时，则当前GBK字符的匹配结果无效，忽略本次匹配结果，并设置k＝k+1，继续返回步骤(4)，重新进行下一个查询位置的验证工作。Step (5) is specifically: if head_valid=1 and tail_valid=1, then the matching result of the current GBK character is valid, and the matching information obtained by the query: headpos_k, tailpos_k is recorded; if head_valid=0 or tail_valid=0 , then the matching result of the current GBK character is invalid, ignore the matching result this time, and set k=k+1, continue to return to step (4), and perform the verification work of the next query position again.

步骤(6)具体为：如果k已是步骤(3)的最后一个匹配结果，则结束本次查询的所有验证工作，将成功匹配的全部headpos_k,tailpos_k字符查询结果进行输出；如果k不是步骤(3)最后一个匹配结果，则设置k＝k+1，继续返回步骤(4)，重新进行下一个查询位置的验证工作。Step (6) is specifically: if k is the last matching result of step (3), then end all the verification work of this query, and output all headpos_k and tailpos_k character query results that are successfully matched; if k is not the step ( 3) For the last matching result, set k=k+1, continue to return to step (4), and perform the verification work of the next query position again.

本发明的有益效果在于：本发明针对GBK编码的字符串，在进行字符查询时，由于汉字高低字节编码交叉容易导致的误匹配问题，提出了一种自匹配的基于奇偶校验的GBK字符查询系统，该系统由字符基础查询模块、GBK奇偶校验模块、汉字边界验证模块三部分组成，利用汉字双字节编码的特点，使用了向前向后双向奇偶校验的方法，对字符匹配结果进行逐一验证，该系统的实践证明能很好的处理GBK字符查询的问题，既能剔除高低字节误匹配的结果，又能保留合法匹配的结果。The beneficial effect of the present invention is: the present invention is aimed at the character string of GBK code, when performing character query, because the Chinese character high and low byte code intersection easily causes the wrong matching problem, proposes a kind of self-matching GBK character based on parity check Query system, the system is composed of three parts: basic character query module, GBK parity check module, and Chinese character boundary verification module. Using the characteristics of double-byte encoding of Chinese characters, the method of forward and backward bidirectional parity check is used to match characters. The results are verified one by one. The practice of the system has proved that it can handle the problem of GBK character query very well. It can not only eliminate the result of wrong matching of high and low bytes, but also retain the result of legal matching.

附图说明Description of drawings

图1是本发明系统的模块结构及流程示意图。Fig. 1 is a schematic diagram of the module structure and flow chart of the system of the present invention.

具体实施方式detailed description

如图1所示，本发明提出一种自匹配的基于奇偶校验的GBK字符查询系统，该系统包括三个模块：字符基础查询模块、GBK奇偶校验模块、汉字边界验证模块。通过该三个模块相互衔接，可以良好的处理GBK编码中的字符串查找时所遇到的字符匹配错位问题。As shown in Fig. 1, the present invention proposes a self-matching GBK character query system based on parity check, which includes three modules: character basic query module, GBK parity check module, and Chinese character boundary verification module. Through the mutual connection of these three modules, the character matching misplacement problem encountered during character string search in GBK encoding can be well handled.

下面举一实施例来具体说明该基于奇偶校验的GBK字符查询系统的实现方法，其包括如下步骤(见图1)：Give an embodiment below to specify the implementation method of this GBK character query system based on parity, which includes the following steps (seeing Fig. 1):

1.该系统首先读取GBK编码的字符串(即读取GBK匹配文本)，导入到给定字符串中，这里以text_str标记。并以字节(Byte)为单位，将该字符串的每一个字节依次命名：T_0,T_1,T_2,T_3,T_4……1. The system first reads the GBK-encoded string (that is, reads the GBK matching text) and imports it into a given string, marked with text_str. And take byte (Byte) as the unit, name each byte of the string in turn: T_0, T_1, T_2, T_3, T_4...

2.读取查询字符串，这里标记为query_str，同样以字节(Byte)为单位进行标记，命名为：Q_0,Q_1,Q_2,Q_3,Q_4,….2. Read the query string, marked here as query_str, also marked in bytes (Byte), named: Q_0, Q_1, Q_2, Q_3, Q_4,….

3.字符基础查询模块：3. Character basic query module:

将query_str和text_str进行基础查询，方式为从左至右依次扫描text_str，找到满足query_str出现的位置，并进行记录起始位置为headpos_k，终止位置为tailpos_k(其中数字k表示在当前串text_str中第k次基础查询匹配到的结果，即第一次为headpos_1和tailpos_1,第二次为headpos_2和tailpos_2，以此类推)，即text_str中，从T_headpos_k至T_tailpo_k之间的字符串，和query_str完全相同。Perform basic query on query_str and text_str by scanning text_str sequentially from left to right to find the position where query_str appears, and record the starting position as headpos_k and the ending position as tailpos_k (where the number k represents the kth position in the current string text_str The results matched by the second basic query, that is, headpos_1 and tailpos_1 for the first time, headpos_2 and tailpos_2 for the second time, and so on), that is, the strings from T_headpos_k to T_tailpo_k in text_str are exactly the same as query_str.

4.GBK奇偶校验模块：4. GBK parity module:

该模块在获取字符基础查询模块所得的基础查询结果headpos_k和tailpos_k之后，进行编码校验。具体处理流程为：After the module obtains the basic query results headpos_k and tailpos_k obtained by the character basic query module, it performs code verification. The specific processing flow is:

步骤A：Step A:

1)设置步长计数器count初始为0；1) Set the step counter count to 0 initially;

2)沿headpos_k向前，依次检测字节T_i(i＝headpos1-1,headpos1-2,…,0)；2) forward along headpos_k, sequentially detect byte T_i (i=headpos1-1, headpos1-2,...,0);

3)计算T_i&0x80是否为0(这里运算参数取0x80的原因为，按照GBK编码的规则，高字节的编码范围大于等于0x80是合法汉字高字节开始的标志)。其中符号&表示按位进行布尔“与”运算，与运算(And)规则为：0&0＝0，0&1＝0，1&0＝0，1&1＝1；按位与运算为，将字节以二进制展开，并对每一位进行与计算，例如：如果T_i＝0x9E,则二进制表示为10011110；而0x80的二进制展开为10000000。此时：T_i&0x80＝10011110&10000000＝10000000；3) Calculate whether T_i&0x80 is 0 (the reason why the operation parameter is 0x80 here is that, according to the rules of GBK encoding, the encoding range of the high byte is greater than or equal to 0x80, which is a sign that the high byte of legal Chinese characters begins). Wherein, the symbol & represents a bitwise Boolean "AND" operation, and the operation (And) rule is: 0&0=0, 0&1=0, 1&0=0, 1&1=1; the bitwise AND operation is to expand the byte in binary, And calculate each bit, for example: if T_i=0x9E, the binary representation is 10011110; and the binary expansion of 0x80 is 10000000. At this time: T_i&0x80＝10011110&10000000＝10000000;

4)根据T_i&0x80运算的数值，进行以下处理逻辑：4) According to the value calculated by T_i&0x80, perform the following processing logic:

-如果等于0，则记录head_valid＝1；并终止本步骤后续流程，直接进入下面的步骤B；- If it is equal to 0, record head_valid=1; and terminate the follow-up process of this step, and directly enter the following step B;

-如果不等于0，则将计数器count＝count+1,再根据T_i的数值，进行以下两个处理子逻辑：- If it is not equal to 0, set the counter count=count+1, and then perform the following two processing sub-logic according to the value of T_i:

如果i＝0(即表示已经到字符串的首部),则终止该步骤，并输出count数值；If i=0 (meaning that the head of the character string has been reached), the step is terminated and the count value is output;

如果i>0，则将T_i设为T_i-1,并重新执行上述步骤A的运算，即继续沿本字符串向前检测；If i>0, then set T_i as T_i-1, and re-execute the operation of the above step A, that is, continue to detect forward along the string;

5)当步骤A中以上处理正常完成后，判断步长计数器count的奇偶值：5) After the above processing in step A is completed normally, judge the parity value of the step counter count:

-如果count为奇数，则记录head_valid＝0；- if count is odd, record head_valid=0;

-如果count为偶数，则记录head_valid＝1。- If count is even, record head_valid=1.

步骤B:Step B:

2)沿tailpos_k向后，依次检测字节T_i(i＝tailpos+1,tailpos1+2,…)；2) Backward along tailpos_k, sequentially detect bytes T_i (i=tailpos+1, tailpos1+2,...);

3)计算T_i&0x80是否为0(方法和步骤A相同)；3) Calculate whether T_i&0x80 is 0 (the method is the same as step A);

-如果等于0，则记录tail_valid＝1，并终止本步骤后续流程；- If it is equal to 0, record tail_valid=1, and terminate the subsequent process of this step;

如果i＝text_str.length(表示i和当前字符串text_str的长度相同，即已经到字符串的尾部),则终止该步骤，并输出count数值；If i=text_str.length (representing that the length of i and current character string text_str is the same, promptly has reached the end of character string), then terminate this step, and output count numerical value;

如果i<text_str.length，则将T_i设为T_i+1,并重新执行上述步骤的运算，即沿本字符串继续向后检测；If i<text_str.length, set T_i to T_i+1, and re-execute the operation of the above steps, that is, continue to detect backwards along the string;

5)当步骤B中以上处理正常完成后，判断步长计数器count的奇偶值：5) After the above processing in step B is completed normally, judge the parity value of the step counter count:

-如果count为奇数，则记录tail_valid＝0；- if count is odd, record tail_valid=0;

-如果count为偶数，则记录tail_valid＝1。- If count is even, record tail_valid=1.

5.汉字边界验证模块：5. Chinese character boundary verification module:

将GBK奇偶校验模块的结果输入汉字边界验证模块，进行处理。根据head_valid和tail_valid的数值，来判断headpos_i和tailpos_i是否处在GBK双字节汉字的边界上，处理逻辑为：Input the result of the GBK parity check module into the Chinese character boundary verification module for processing. According to the values of head_valid and tail_valid, it is judged whether headpos_i and tailpos_i are on the boundary of GBK double-byte Chinese characters. The processing logic is:

-如果head_valid＝1且tail_valid＝1时，认为当前GBK字符的匹配结果有效，将该次查询所得的匹配信息：headpos_k,tailpos_k进行记录；- If head_valid=1 and tail_valid=1, it is considered that the matching result of the current GBK character is valid, and the matching information obtained by this query: headpos_k, tailpos_k is recorded;

-如果head_valid＝0或tail_valid＝0时，认为当前GBK字符的匹配结果无效，忽略本次匹配结果，并设置k＝k+1，继续返回GBK奇偶校验模块，重新进行下一个查询位置的验证工作；- If head_valid=0 or tail_valid=0, consider the matching result of the current GBK character to be invalid, ignore the matching result this time, and set k=k+1, continue to return to the GBK parity check module, and re-do the verification of the next query position Work;

6.输出GBK字符查询结果6. Output GBK character query results

以上判断完毕后，则循环进行字符串后续查询位置的验证，处理逻辑为：After the above judgment is completed, the verification of the subsequent query position of the string is performed in a loop, and the processing logic is as follows:

-如果k已是字符基础查询模块的最后一个匹配结果，则结束本次查询的所有验证工作，将成功匹配的全部headpos_k,tailpos_k信息(即GBK字符查询结果)进行输出。- If k is already the last matching result of the character basic query module, all verification work of this query is ended, and all headpos_k and tailpos_k information (that is, GBK character query results) that are successfully matched are output.

-如果k不是字符基础查询模块最后一个匹配结果，则设置k＝k+1，继续返回GBK奇偶校验模块，重新进行下一个查询位置的验证工作。- If k is not the last matching result of the character basic query module, then set k=k+1, continue to return to the GBK parity check module, and re-do the verification of the next query position.

本发明提出的一种基于奇偶校验的GBK字符查询系统，该系统针对GBK编码的字符串，在进行字符查询时，由于汉字高低字节编码交叉容易导致的误匹配问题，提出了一套查询系统，该系统由字符基础查询模块、GBK奇偶校验模块、汉字边界验证模块三部分组成，利用汉字双字节编码的特点，使用了向前向后双向奇偶校验的方法，对字符匹配结果进行逐一验证，该系统的实践证明能很好的处理GBK字符查询的问题，既能剔除高低字节误匹配的结果，又能保留合法匹配的结果。A GBK character query system based on parity check proposed by the present invention, the system proposes a set of query for GBK-coded character strings. The system consists of three parts: character basic query module, GBK parity check module, and Chinese character boundary verification module. Using the characteristics of Chinese character double-byte encoding, the method of forward and backward bidirectional parity check is used to check the character matching results. Verified one by one, the practice of the system has proved that it can handle the GBK character query very well, it can not only eliminate the results of high and low byte mismatches, but also retain the legal matching results.

Claims

1. a GBK character query system based on parity check, is characterized in that, the system comprises three modules: character basic query module, GBK parity check module and Chinese character boundary verification module;

The character basic query module is used to perform basic query on the read GBK coded string and query string, find the position in the GBK coded string that satisfies the occurrence of the query string, and record its start position and end position;

The GBK parity check module is used to perform encoding verification on the basic query result obtained by the character basic query module. First, the bytes are detected forward sequentially along the starting position to judge the parity value of the step counter count, and then along the end position to Then detect the bytes in order to judge the parity value of the step counter count;

The Chinese character boundary verification module is used to judge whether the start position and the end position are on the boundary of GBK double-byte Chinese characters according to the parity value check result of the GBK parity check module.

2. the GBK character query system based on parity as claimed in claim 1, is characterized in that, the query mode of described character basic query module is to scan GBK coded character string successively from left to right, finds and meets query character string to occur The starting position is headpos_k and the ending position is tailpos_k, where the number k represents the result matched by the kth basic query in the current GBK encoded string.

3. the GBK character query system based on parity as claimed in claim 2, is characterized in that, in described GBK parity module, detect byte forward successively along starting position to judge the parity value of step counter count Specifically, the following method is adopted: first, the step counter count is initially set to 0; then, along the headpos_k forward, sequentially detect the byte T_i, where i=headpos_k-1, headpos_k-2,...,0; then, calculate whether T_i&0x80 It is 0, and the processing logic is performed according to the numerical value of T_i&0x80 operation; finally judge the parity value of the step counter count, if the count is odd, record head_valid=0; if the count is even, record head_valid=1, where the symbol & means press The bit performs Boolean AND operation, and head_valid represents the Boolean value of the starting position of the query string.

4. the GBK character query system based on parity as claimed in claim 2, is characterized in that, in the described GBK parity module, detect byte successively along end position to judge the specific parity value of step length counter count The following method is adopted: first, the step counter count is initially set to 0; then, along the tailpos_k backward, sequentially detect the byte T_i, wherein i=tailpos_k+1, tailpos_k+2,...; then, calculate whether T_i&0x80 is 0, And perform processing logic according to the numerical value of T_i&0x80 operation; finally judge the parity value of the step counter count, if count is an odd number, then record tail_valid=0; AND operation, tail_valid Boolean value indicating where the query string terminates.

5. the GBK character query system based on parity as claimed in claim 3 or 4, is characterized in that, the verification method of described Chinese character boundary verification module is specifically: if head_valid=1 and tail_valid=1, then current GBK character The matching result of the current GBK character is valid, and the matching information obtained from this query: headpos_k, tailpos_k is recorded; if head_valid=0 or tail_valid=0, the matching result of the current GBK character is invalid, the matching result of this time is ignored, and k=k+ 1. Continue to return to the GBK parity check module, and re-do the verification of the next query position until k is the last matching result of the character basic query module before ending the query and outputting the query result.

6. an implementation method based on the GBK character query system of parity check, it is characterized in that, comprising the steps:

(1) Read the string encoded by GBK, import it into a given string, and mark it with text_str;

(2) Read the query string, marked as query_str;

(3) Perform basic query on query_str and text_str by scanning text_str from left to right to find the position where query_str appears, and record the starting position as headpos_k and the ending position as tailpos_k, where the number k represents the position in the current string text_str The result matched by the kth basic query in ;

(4) Perform encoding verification after obtaining the basic query results headpos_k and tailpos_k in step (3), specifically: Step A: Detect byte T_i sequentially forward along headpos_k to determine the parity value of the step counter count, where i =headpos_k-1, headpos_k-2,...,0; Step B: detect the byte T_i sequentially backward along tailpos_k to judge the parity value of the step counter count, wherein i=tailpos_k+1, tailpos_k+2,...;

(5) judge whether headpos_i and tailpos_i are on the boundary of GBK double-byte Chinese characters according to the parity check result of step (4);

(6) Output GBK character query result.

7. the implementation method of the GBK character query system based on parity as claimed in claim 6, is characterized in that, in step (4), described step A is specifically: 1) step length counter count is initially set to 0; 2) Move forward along headpos_k, and detect byte T_i sequentially, wherein, i=headpos1-1, headpos1-2,...,0; 3) Calculate whether T_i&0x80 is 0; 4) Perform the following processing logic according to the calculated value of T_i&0x80 : If T_i&0x80 is equal to 0, then record head_valid=1, and terminate the subsequent process of this step, and directly enter step B; if T_i&0x80 is not equal to 0, then set the counter count=count+1, and then perform the following two processes according to the value of T_i Sub-logic: If i=0, then terminate this step and output the count value; if i>0, then set T_i to T_i-1, and re-execute the operation of the above step A, that is, continue to detect forward along the string ; 5) judge the parity value of the step counter count, if count is an odd number, then record head_valid=0; if count is an even number, then record head_valid=1, wherein the symbol & represents a bitwise Boolean AND operation, and head_valid represents a query string A boolean value for the starting position.

8. the realization method based on the GBK character query system of parity as claimed in claim 6 is characterized in that, in step (4), described step B is specifically: 1) the step length counter count is initially set to 0; 2) Backwards along tailpos_k, sequentially detect byte T_i, wherein i=tailpos+1, tailpos1+2,...; 3) Calculate whether T_i&0x80 is 0; 4) Carry out the following processing logic according to the numerical value of T_i&0x80 operation: if T_i&0x80 Equal to 0, then record tail_valid=1, and terminate the subsequent process of this step; if T_i&0x80 is not equal to 0, then set the counter count=count+1, and then perform the following two processing sub-logic according to the value of T_i: if i and the current character If the length of the string text_str is the same, this step is terminated and the value of count is output; if i<the length of the current string text_str, then T_i is set to T_i+1, and the operation of the above step B is re-executed, that is, continue along this string 5) judge the parity value of the step counter count, if count is an odd number, then record tail_valid=0; if count is an even number, then record tail_valid=1, wherein the symbol & represents a bitwise Boolean AND operation, and tail_valid represents A boolean value for where the query string ends.

9. the implementation method of the GBK character query system based on parity as claimed in claim 6, is characterized in that, step (5) is specifically: if head_valid=1 and tail_valid=1, then the matching result of current GBK character Valid, record the matching information obtained from this query: headpos_k, tailpos_k; if head_valid=0 or tail_valid=0, the matching result of the current GBK character is invalid, ignore this matching result, and set k=k+1, Continue to return to step (4), and perform the verification work of the next query position again.

10. the realization method based on the GBK character query system of parity as claimed in claim 6 is characterized in that, step (6) is specifically: if k has been the last matching result of step (3), then end this For all the verification work of the second query, all headpos_k and tailpos_k character query results that are successfully matched are output; if k is not the last matching result of step (3), set k=k+1, continue to return to step (4), and proceed again Validation job for next query location.