[go: up one dir, main page]

CN101488127B - Bit mark character string fuzzy retrieval method for grouping character and labellng with bit - Google Patents

Bit mark character string fuzzy retrieval method for grouping character and labellng with bit Download PDF

Info

Publication number
CN101488127B
CN101488127B CN200510057491.4A CN200510057491A CN101488127B CN 101488127 B CN101488127 B CN 101488127B CN 200510057491 A CN200510057491 A CN 200510057491A CN 101488127 B CN101488127 B CN 101488127B
Authority
CN
China
Prior art keywords
character
character string
mark
unit
place value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200510057491.4A
Other languages
Chinese (zh)
Other versions
CN101488127A (en
Inventor
徐文新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN200510057491.4A priority Critical patent/CN101488127B/en
Publication of CN101488127A publication Critical patent/CN101488127A/en
Application granted granted Critical
Publication of CN101488127B publication Critical patent/CN101488127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a character string retrieval fuzzy method for grouping the characters and labeling by bits. The method includes: dividing total character elements into n groups, and recording character P element information which forms a character string S by data W having n bits; if the character element P1 of the S belongs to j group, then setting jth bit of the W as 1; P2 belonging to k group, then setting kth bit of the W as 1 -, taking the W after all the character elements are labeled as a '' bit value'' of the S, comparing Wa of Sa with Wb of Sb, judging that the Sb does not contain or possibly contains all the character elements of Sa; and aiming at the possible Sb, judging whether Sa is contained by using a character comparison method. Based on the above method, according to logic algebra, multiple labeling and comparison bit value methods can be extrapolated. The method can improve retrieval speed by many times, but the character string is needed to be labeled firstly, so that it is suitable for the character string search with a fixed retrieval range.

Description

To character grouping and with the character string fuzzy retrieval method of position mark
Technical field
The present invention relates to a kind of character string fuzzy retrieval method, fast and effeciently determine the scope that object character string may exist, re-use existing character match method location object character string, be applicable to the string search field that range of search is relatively fixed.
Background technology
Common character string fuzzy search adopts successive appraximation mode to carry out, as judged whether comprise character f in character string S=" bdopfqew ", computing machine compares with first character b and the f of main string S, compare with second of S character d and f again, by that analogy, until the 5th of S the character is identical with f, the match is successful, and this is the simplest situation.If substring T length is more than 2 characters, simple pattern matching algorithm, i.e. BF algorithm are with S 1with T 1relatively, if difference, with S 2with T 1relatively, the like, until some character S of S iwith T 1identical, then by the S after them i+1with T 1+1compare, if also identical, then continue down to compare, as some character S of S i+nwith T 1+ntime different, then return, then with S i+1with T 1compare as a new round, repeat above process, until the character in T is all than complete, then the match is successful, otherwise it fails to match.Along with the length of search key T increases, the corresponding increase of complexity of character match.Pattern matching algorithm after improvement, i.e. KMP algorithm, concerning the alphabetic writing of small size character set, avoid backtracking, but the Chinese character string large for character set, monocase frequency is low, essential meaning is little.In brief, BF algorithm and KMP algorithm are all carry out successive appraximation to the character of main string and substring.
Within 2004, I proposes " prime number replacing character string search technology ", and the method can improve the speed 2-3 of character string fuzzy search doubly, but implements the method for long character string, needs more space storage prime number product value.In order to improve the speed of character string fuzzy search, and the demand reduced memory space, the present invention proposes the composition information carrying out tab character string with n position (bit) of data, data after after mark claim " place value " of this character string, the place value of two character strings is compared, and in conjunction with common character successive appraximation method, realize the fuzzy search of character string.Test shows, speed is the several times of general character successive appraximation fuzzy search and even more than tens times.
Summary of the invention
The present invention is a kind of character string retrieval technique, and object improves the speed of character string fuzzy search.With position (bit) several characters units corresponding, with the corresponding alphabet unit in n position, namely divide alphabet unit to be n group, and with the n of data individual be 0 position, be designated as W f, mark the character metamessage of composition character string.If several character string S character element P 1belong to n-th group, be then correspondingly labeled as 1 by n-th of W, similarly, according to other character element P of S 2, P 3, P 4affiliated group marks W, and complete the W after alphabet meta-tag, record the information of S, be called " place value " of S, this mode is called 1 mark.According to the principle of logic algebra, also can be the position of 1 with the n of data, be designated as mark the character metamessage of composition character string.If S character element P belongs to n-th group, then correspondingly by data n-th be labeled as 0, this mode is called 0 mark.By to S a" place value " W a, with S b" place value " W b, compare, can S be judged b" do not comprise ", " comprising " or " may comprise " S ball characters unit.Such as, to W awith W bcarry out an implication operation, if there is implication relation all positions, then S bcomprise and maybe may comprise S a" all character unit ".If need, then judge S by common character successive appraximation method bwhether comprise " S a".Be " position mark retrieval ", " bit mark character string retrieval " " bit mark character string retrieval technique " hereinafter referred to as the inventive method.
Test shows, position mark retrieval can significantly improve the fuzzy search speed of character string.Outside speed advantage, another feature of position mark retrieval is that multiple keyword query is equally convenient with single keyword query.Position mark both can be used for the retrieval of ordinary meaning, namely judge whether database character string comprises keyword, also can do " inverse retrieval ", judge whether keyword comprises database character string, can be used in phonetic entry, Pinyin Input and Chinese word segmenting, coupling basic sentence patterns or word.
As the expansion of basic skills, if can be used for the figure place n marked, be more than 2 times of the average length m of character string, can mark, to improve screening effeciency with the corresponding one group of character unit of the combination of several position (bit).
Bit mark character string retrieval technique, as a kind of character string algorithm, needs first to carry out position mark to the character string of range of search, so be applicable to the string searching field that range of search relatively fixes.
1. basic skills
During bit mark character string retrieval is implemented, can with the character of " position " or " combination of position " corresponding ordinary meaning, as: a, b, A, B, ∑ #, ∞, ま, :, in, state; Also the corresponding Chinese radical in position can be used as required, as: Rolling, Si, so that stroke, as Pie, Dian etc.; For alphabetic writing, as english character string " day and night ", the corresponding word in position can be used, as day, and, night, or with the corresponding monogram in position, as ay, ai, the initial consonant of the Chinese phonetic alphabet, simple or compound vowel of a Chinese syllable or syllable can be represented in Chinese phonetic alphabet input or phonetic entry, can phonetic symbol be represented for other Languages, as θ, si η.For convenience of explanation, claim the character unit corresponding to combination of position or position to be " character unit ", be designated as P.
The method of bit mark character string retrieval, easily realizes in alphabetic writing string search.If with one 32, each is the data 0000,0000,0000,0000,0000,0000,0000,0000 of 0, is designated as W f, with W fcorresponding 26 English alphabets in 1st to the 26th position of (also can be from right-to-left), may not consider for other 6 from left to right, also can corresponding punctuation mark.Word big comprises b, g, i, so 3 positions such as corresponding 2,7,9 are labeled as " 1 ", then data become 0100,0010,1000,0000,0000,0000,0000,0000, are called the place value of big, are designated as W a, this mark mode is one token.
Equally, bigger comprises b, e, g, i, r, therefore by W f5 positions such as 2,5,7,9,18 be labeled as " 1 ", then data become 0100,1010,1000,0000,0100,0000,0000,0000, are called the place value of bigger, are designated as W b.
Can find out, all big are the position of 1, and bigger also must be 1; But bigger is the position of 1, big may be 1, also may be 0, as the 5th, 18 position; Bigger is the position of 0, and big must be 0, as 21 positions such as the 1st, 3,4,6,8,10.By W awith W bcarry out " position is contained " computing, result is that all positions are " 1 ": 1111,1111,1111,1111,1111,1111,1111,1111, is designated as W t, if unsigned long integer, be exactly 4294967295.Can be formulated as: W a→ W b=W t, character string bigger comprises big.
If the place value of biggest is designated as W c, have equally: W a→ W c=W t, character string biggest comprises big.
If the place value of BIG is designated as W a, by the place value W of big acarry out " position is contained " computing with it, same has: W a→ W a=W t, character string BIG equals big.Here do not consider that alphabet size is write, only the place value of the character string that explanation two is identical carries out " position is contained " computing, and all positions are 1.
If the place value of digger is designated as W dif, by the place value W of big acarry out " position is contained " computing with it, result is 1011,1111,1111,1111,1111,1111,1111,1111, is not equal to W t, that is: W a→ W d≠ W t, character string digger does not comprise big.
That is, by carrying out " position is contained " computing, if result is not equal to W to the place value W of two character strings t, then consequent corresponding character string " does not comprise " (" no more than " and " being not equal to ", lower same) character string corresponding to preceding paragraph.
Scheme does not above consider the capital and small letter of letter, if case sensitive, then need 52 positions, this also can realize.But the corresponding position of each character unit, always unfeasible.As, GBK scope has 21000 Chinese characters, with the corresponding Chinese character in a position, the data of storage place value can take a lot of space, wherein having considerable is insignificant blank, and read data and bit comparison operation use time can increase meaninglessly, and in enforcement, as required can with one " position " corresponding multiple Chinese character.An easy scheme is by 21000 of GBK scope Chinese characters and other symbol, is divided into 32 groups, stores the place value of character string with long data according to coding, is provided with character string " Gu Yanzhi long river, desert setting sun circle ", then:
Chinese character Greatly Unconcerned Lonely Cigarette Directly Long River Fall Day Circle
Coding 46323 50350 47554 53708 54961 45988 47827 49892 51413 54450
Group 20 15 3 13 18 5 20 5 22 19
Chinese character Frost Cold Long River ? Fall Day My god Margin ?
Coding 52138 49380 45988 47827 ? 49892 51413 52460 53700 ?
Group 11 5 5 20 ? 5 22 13 5 ?
Coding in table obtains with function code in excel, and wherein " greatly " " river " coexists 20 groups, and " length " " falls " to coexisting 5 groups, so be only 1 by 8 corresponding positions, then and the place value W of " Gu Yanzhi long river, the desert setting sun is justified " hfor:
0010,1000,0000,1010,0111,0100,0000,0000
The W of " the long river setting sun " can be obtained equally i" place value " is: 0000,1000,0000,0000,0001,0100,0000,0000, with W iwith W hmake position " position is contained " computing: W i→ W h=W t, there is relation of inclusion in two character strings.
In " white cold long river ", " frost " is the 11st group, and " cold " and " length " are the 5th group, then " place value " W jfor: 0000,1000,0010,0000,0001,0000,0000,0000, with W jwith W hmake position " position is contained " computing: W j→ W h≠ W t, and also there is not relation of inclusion in two character strings.That is, with one " position " corresponding multiple Chinese character, obtain place value, " position is contained " computing is carried out, if result is not equal to W to the place value of two character strings t, then consequent corresponding character string " does not comprise " " the alphabet unit " of the character string corresponding to preceding paragraph, also just must not comprise this " character string ".For alphabetic writing, in like manner with the corresponding multiple word in a position, can judge that a character string " does not comprise ", the character of " may comprise " another character string unit.
But, " position is contained " computing is carried out, if result equals W to the place value of two character strings t, consequent corresponding character string is also " may comprise " character string corresponding to preceding paragraph.
If the place value of gibber is designated as W e, by the place value W of big acarry out " position is contained " computing with it, have equally:
W a→W e=W T
But character string gibber does not comprise big.
" place value " W of " setting sun ends of the earth " kfor: 0000,1000,0000,1000,0000,0100,0000,0000, with and W hdo position " position is contained " computing:
W k→W h=W T
But " Gu Yanzhi long river, desert setting sun circle " does not comprise " setting sun ends of the earth ".
That is, bit mark character string retrieval is distinguishing with general character string successive appraximation method.Even if the place value presence bit implication relation of two character strings, also not necessarily there is relation of inclusion in two character strings.If with " position " correspondence " a character unit ", the place value presence bit implication relation of two character strings, then the character string of consequent correspondence " comprises " the alphabet unit of character string corresponding to preceding paragraph.But owing to not considering the arrangement of character unit, can not affirm whether two character strings exist relation of inclusion.If mark is with " position " correspondence " one group of character unit ", multiple character unit is had among one group of character unit, then the character string " likely " of consequent correspondence comprises the alphabet unit of character string corresponding to preceding paragraph, also " likely " do not comprise character string corresponding to preceding paragraph alphabet unit, more can not affirm whether two character strings exist relation of inclusion.But as required, can judge whether two character strings exist relation of inclusion by character successive appraximation method again.
Bit mark character string is retrieved, and carry out the preceding paragraph of bit arithmetic and consequent place value, can be all the label information of the character unit of a character string, also can be the label information of all character units of multiple character string, is referred to as the place value of " several character strings " S." several character strings " is the appositive concept of S, and in other words, S refers to one or more character string.
If carry out bit mark character string retrieval to database, character string record S 1, S 2, S 3... place value be respectively W 1, W 2, W 3...Search key S tplace value be designated as W t, with W twith W ncarry out an implication operation, result is W t, then character string S ncomprise and maybe may comprise S tall characters unit.By qualified S nthe record set formed, is designated as R i.Retrieving the time used is referred to as T i.Usual needs are at R imiddle character string successive appraximation method obtains net result collection R z, retrieving the time used is referred to as T z.
Therefore the used time that bit mark character string search method is total is T i+ T zif wish to make T ilow, concerning the processors of 32, the data of preferably storing place value are exactly 32.For certain character string fuzzy search, R zbe certain, reduce T zmethod be make PRELIMINARY RESULTS collection R as far as possible iminimum.The figure place n of storage place value is fewer, and the character unit of a position correspondence is more, and screening effect is poorer, R ialso can be larger; Character string average length is longer, R ialso can be larger.Affect R isize also has another factor: search key is longer, R ialso less.
Bidding remembers that figure place used is n, and " 1 " of character string place value is m, and " 1 " of search key place value is k, then the screening probability of bit comparison can calculate with following formula, and its value is the smaller the better:
P Ii = C m k C n k = m ! k ! ( m - k ) ! n ! k ! ( n - k ) ! = m ! ( n - k ) ! n ! ( m - k ) !
N is that the screening probability calculation of 32, m and k part value is as follows:
m k=1 k=2 k=3 k=4
24 0.75 0.556451613 0.408065 0.295495
22 0.6875 0.465725806 0.310484 0.20342
20 0.625 0.383064516 0.229839 0.134733
18 0.5625 0.308467742 0.164516 0.085095
16 0.5 0.241935484 0.112903 0.050612
14 0.4375 0.183467742 0.073387 0.027836
12 0.375 0.133064516 0.044355 0.013765
10 0.3125 0.090725806 0.024194 0.00584
8 0.25 0.056451613 0.01129 0.001947
In three factors of impact screening probability, during user search, length keywords used determines the size of k, record string length determines the size of m, but length keywords used is uncontrollable when record string length, retrieval, controllable factor marks figure place n used, and the number m of " 1 ", the size of k after the certain string token of length, also affect by n, so n and m keeps enough ratio very important.
Some suggestion:
The cpu of 1.32, marks with lint-long integer, is beneficial to place value most and compares, if signless integer, then W is 32 positions, if there is sign bit integer, then only has 31 positions to be convenient to mark.If character string average length is greater than 16 character units, the database that record number is many, in SQL SEVER 2000,63 positions of available data types bigint mark, and correspondingly, character unit are divided into 63 groups.Certainly, for 32 bit processors, store place value with bigint, read and carry out bit comparison and must use the more time.In fact, any data type of being convenient to carry out bit arithmetic in any database all can be used for tab character string, if naturally better by the data type without sign bit.For 64 cpu, 64 positions certainly should be adopted to carry out tab character string, to make full use of the performance of cpu, improve the dispersion of " place value ".
If 2. cpu is 32, the database that between character string, length differs greatly, can be divided into two tables by record, short character string list 32 positions mark, long string table 64 marks, is joined together by two table Query Results in inquiry with union order.
3. the grouping of character unit take frequency equilibrium as the best, also will take into account the speed of mark place value.Carry out modular arithmetic by Hanzi internal code, remainder number divides into groups, and easily realize, but be not optimal group, and modular arithmetic division arithmetic speed is slow.Can consider that Chinese character is divided into 32 groups by certain 5 position of getting Hanzi internal code, speed can be faster, and effect may be better.
Outside the retrieval of ordinary meaning, can carry out database character string " inverse retrieval ", namely with database character string S by position labeling method nw nvalue is preceding paragraph, with search key S tw tvalue is a consequent implication operation that carries out, and the position whether all according to result is " 1 ", i.e. W n→ W t=W t, can filter out those composition character units in database may be search key S tthe S comprised n.
Nature, the calculating of the screening probability of inverse retrieval will be conversely.If database and keyword all use one token, marking figure place used is n, S after the mark of position nplace value W n" 1 " be m, " 1 " after search key mark is k, then search key S tcomprise the probability that maybe may comprise its all character unit can calculate with following formula:
P Ii = C k m C n m = k ! m ! ( k - m ) ! n ! m ! ( n - m ) ! = k ! ( n - m ) ! n ! ( k - m ) !
2. test analysis
With said method to one based on Chinese character, have the English and digital database of part to test: to record several 267, more than 000 bar, character amount 3,473, more than 000, character string average length 12.989, with lint-long integer mark place value, programming language is VB, and operating system is Window xp, CPU Celeron 800Hz, internal memory 256M, Jijia 810 mainboard, hard disk 40G.The retrieval overall used time, be designated as T.Obviously, the mark retrieval of any once position, all must read the place value of all records, compare with it with the place value of keyword, the time used can think a constant, is called T 0, but not easily directly record.The relatively rear qualified record set of place value, is designated as R 1, to R 1carry out successive appraximation retrieval used time T 1also not easily directly record, but and R 1size is relevant, R 1can obtain.Another factor of impact retrieval used time is net result collection R 2size.According to T, R of 120 test gained 1, R 2, regretional analysis obtains following equation:
T ^ = 0.268 + 0.000,008,625 R 1 + 0.000,0270,2 R 2
Coefficient of determination after adjustment
The significance test F=5265.814 of regression model
Be greater than F 0.01(2,117)=4.791
The significance test t of constant 0=86.610
R 1significance test t 1=74.263
R 2significance test t 2=25.489
All be greater than t 0.01(117)=2.6185
Visible, regression equation confidence level is very high.Constant wherein, namely reads the place value of all records, compares with it the time used, i.e. T with the place value of keyword 0, be 0.268 second.120 2.1739 seconds average used times of character string successive appraximation method of simultaneously carrying out.The time T that visible place value is more used 0be only 1/8th of the usual successive appraximation method used time.But the overall used time of bit mark character string retrieval goes back and R 1, R 2relevant, n=31 in this test, character string average length 12.989, have overlapping phenomenon after mark, the mean number m of " 1 ", about 11 to 12, calculates by 12 here, if the character unit distribution of database character string is normal above, " 1 " distributing equilibrium after mark, can calculate R by probability 1, net result collection R 2have nothing to do with screening effect, be appointed as 240 here, then ideally k is the retrieval used time of 1 to 12 after keyword tag, can be as follows by regression equation calculation:
k R 1 R 2 T
1 103354.8 240 1.16592
2 37896.77 240 0.601344
3 13067.85 240 0.387195
4 4200.381 240 0.310713
5 1244.557 240 0.285219
6 335.0732 240 0.277375
7 80.41756 240 0.275178
8 16.75366 240 0.274629
9 2.91368 240 0.27451
10 0.39732 240 0.274488
11 0.03784 240 0.274485
12 0.001892 240 0.274485
Visible, compared with usual character string retrieving method average 2.1739 seconds used times, bit mark character string retrieval has very large performance advantage.But the packet conditions of character unit during distribution situation, the mark of the data type used especially size of n, database character unit, all affect the used time that bit mark character string is retrieved, so this regression equation only has reference significance when different hardware and mark.
Only have 26 letters in English, if character string is very long, every bar record is containing each letter, then mark retrieval in position can lose screening effect.But record several 242,000 to one, character amount 7,493, the Database in English data of 000 are tested, show that bit mark character string retrieval is also feasible to alphabetic writing usually.This database character string data type is varchar, and field long 56, character string on average grows 30.846, marks 26 letters, case insensitive, ignore space and other character with 26 bit of 1 lint-long integer.Because the frequency of utilization of letter English letter is unbalanced, after mark, statistics only has 3213052 " 1 ", the average of each record " 1 " is 13.2266, illustrates that being marked with a large amount of overlaps occurs on the one hand, illustrates that not every bar record comprises each letter on the other hand.
Carry out 120 character successive appraximation retrievals to this database, the average used time is 4.573 seconds, and carry out position mark retrieval, according to T, R1, R2 of 120 times, regretional analysis obtains following equation simultaneously:
T ^ = 0.265 + 0.000,019,36 R 1 + 0.000,0362 , 3 R 2
Coefficient of determination after adjustment
The significance test F=55405.88 of regression model
Be greater than F 0.01(2,117)=4.791
The significance test t of constant 0=36.400
R 1significance test t 1=121.733
R 2significance test t 2=77.409
All be greater than t 0.01(117)=2.6185
Be the alphabetical distribution statistics table of this database below:
Letter Containing the record number of this letter Number percent Letter Containing the record number of this letter Number percent
t 227167 0.935136 h 114263 0.470365
r 222338 0.915257 u 109376 0.450248
e 221713 0.912685 g 97339 0.400697
a 213249 0.877842 f 83452 0.343531
c 210116 0.864945 W 71228 0.293211
n 204225 0.840695 b 67065 0.276074
o 201082 0.827757 y 60698 0.249864
i 195734 0.805742 v 59351 0.244319
s 195639 0.805351 k 47379 0.195036
l 169753 0.698791 x 15150 0.062365
d 159202 0.655357 q 7258 0.029878
m 125449 0.516413 j 6351 0.026144
p 122168 0.502906 z 6307 0.025963
From upper table, in this database, t, r, e tri-letter frequency are the highest, take tree as search key, and actual retrieval obtains the record 7 containing tree, and R 1be 188491,3.655 seconds used times, the successive appraximation simultaneously carried out 4.465 seconds used times of retrieval.As shown the alphabetical frequency of t, r, e tri-by upper, can be calculated:
R1=242,924*0.935136*0.915257*0.912685=242,924*0.781158=189762
By the regression equation calculation retrieval used time be again:
T ^ = 0.265 + 0.000,019,36 * 189762 + 0.000,0362,3 * 7 = 3.939
Visible, even the complete like this word be made up of high frequency letter of tree, position mark retrieval still has performance advantage.In fact, in English, most of word is more than 4 letters, if wherein there is a low frequency letter, the screening effect of position mark just very well, is equivalent to " short-board effect ".Enumerate 30 test datas for reference:
3. the overlapping possibility of a mark
In alphabetic writing, number of letters is little, so the frequency that letter repeats is high, as the words such as bigger, biggest have two g, but only the 7th " position " is labeled as 1.For Chinese character string, the probability that in a character string, same Chinese character repeats is low, but Chinese character is divided into groups, several Chinese characters in a character string may belong to same group, as in grouping above, " greatly " " river " coexists 20 groups, and " length " " falls " to coexisting 5 groups, is overlapped in one " position " after mark.Even if grouping realizes character unit frequency equilibrium, also there is the problem of string token overlap.If the figure place being used for marking is n, the number of character unit is m.The probability wherein do not overlapped completely is:
P = A n m n m = n ( n - 1 ) · · · ( n - m + 1 ) n m
Obviously, time n is certain, m often increases by 1, and the new new-added item of molecule just successively decreases 1, and the new new-added item of denominator is constant, and the probability do not overlapped completely is more and more lower.If mark with 32 positions, the long m of character string is 7, the probability do not overlapped completely:
P ( 7 ) = A 32 7 32 7 = 32 ( 32 - 1 ) · · · ( 32 - 7 + 1 ) 32 7 = 32 * 31 * 30 * 29 * 28 * 27 * 26 32 * 32 * 32 * 32 * 32 * 32 * 32 = 16,963,914,240 34,359,738,368 = 0.4937
Following table marks with 32 positions, when the long m of character string is 1 to 24, and the probability do not overlapped completely:
The overlapping loss just meaning information, overlap is to a certain degree inevitable, is also acceptable, but the ratio of overlap is too high, can affect the performance of position mark retrieval, so the probability overlapped also merits attention.With n position mark, character string is long is m character unit, and the overlapping probability calculation formula for k " 1 " is:
P = C n k · A n m
Wherein A = Σ m 1 + m 2 + · · · + m k = m m ! m 1 ! m 2 ! · · · m k !
Here summation gets allly to meet m 1+ m 2+ ... + m kthe positive integer solutions of=m, its group number is
If with 32 position marks, character string length is 7 character units, and overlap is 3 positions, i.e. m 1+ m 2+ m 3=7, have 15 groups, probability calculation is as follows:
7! m 1 m 2 m 3 m 1 m 2 m 3 7!/(m 1!*m 2!*m 3!)
5040 1 1 5 1 1 120 42
5040 1 2 4 1 2 24 105
5040 1 3 3 1 6 6 140
5040 1 4 2 1 24 2 105
5040 1 5 1 1 120 1 42
5040 2 1 4 2 1 24 105
5040 2 2 3 2 2 6 210
5040 2 3 2 2 6 2 210
5040 2 4 1 2 24 1 105
5040 3 1 3 6 1 6 140
5040 3 2 2 6 2 2 210
5040 3 3 1 6 6 1 140
5040 4 1 2 24 1 2 105
5040 4 2 1 24 2 1 105
5040 5 1 1 120 1 1 42
? ? ? ? ? ? ? 1806
P = C 32 3 · A 32 7 = 4960 * 1806 34,359,738,368 = 0.000261
With 32 positions mark, character string length is 7 character units, mark afterwards k be the probability of 1-7 as following table:
k The number of permutations Arrangement number percent Number percent adds up The k* number of permutations
7 16963914240 0.493714884 0.493714884 1.18747E+11
6 13701623040 0.398769714 0.892484598 82209738240
5 3383116800 0.098461658 0.990946256 16915584000
4 302064000 0.008791219 0.999737475 1208256000
3 8957760 0.000260705 0.99999818 26873280
2 62496 1.81887E-06 0.999999999 124992
1 32 9.31323E-10 1 32
? ? ? ? 2.19108E+11
After mark, the weighted mean number of " 1 " is 6.376881392.
4. a logic algebra principle that mark is relevant and 0 mark
Some positions of the place value W of character string S are " 1 ", mean that proposition " character string S has in certain character unit or certain group character unit " is true exactly.For two character string S aand S b, W acertain k position be 1, mean proposition " character string S exactly ahave in certain k character unit or certain k group character unit " be true; If W ball corresponding certain k positions are 1, mean proposition " character string S exactly bhave in certain k character unit or certain k group character unit " be also true.Nature, W bother certain m-k position may be had to be 1, and W athese certain m-k positions are 0.This relation logic algebra formula is just expressed as: W a→ W b=W t, legibility directly perceived.But and the programming language of not all and Database Systems have " position is contained " operational symbol, so need to convert by logic algebra principle in application.
From Boolean calculation theorem known, if a → b=1, then same, the place value for n position also has: if W a→ W b=W t, then
W ‾ a | W b = W T .
Also can obtain:
W a|W b=W b
W a&W b=W a
From logical operation theorem W a → W b ⇔ W ‾ b → W ‾ a It is known,
W a→ W b=W t, then
So available one each " position " is the data W of 1 tmark place value, if character string comprises a certain character unit, then corresponding position is labeled as 0, this mode is called 0 mark.As:
The place value of big be 10111101011111111111111111111111
The place value of bigger be 10110101011111111011111111111111
Obviously W ‾ b → W ‾ a = W T .
Also can obtain:
W ‾ b | W ‾ a = W ‾ a
More directly perceived with truth table:
W a W b W a→W b=W T W a&W b=W a W a|W b=W b ? ?
0 0 1 0 0 ? ?
0 1 1 0 1 ? ?
1 0 0 0 1 ? ?
1 1 1 1 1 ? ?
Described above is, how to use modal bit arithmetic symbol to carry out place value and compare, the principle of screening probability communicates.Certainly, during 0 mark, m, k all refer to the number of 0.But the bit arithmetic that each programming language and database provide symbol quantity differs, the concrete operations of other bit arithmetic symbol, can release by logic algebra principle.Use logic algebra principle, carry out equivalence transformation, show that formula should for the purpose of brief being suitable for, instead of become more complicated and more complicated certainly.
More in addition, if programming language or Database Systems provide the function of putting 1 or setting to 0 of " position ", then position " mark " can directly use.If Database Systems programming language does not provide position 1, then "or" (or) computing of available position, carries out 1 mark.
First character unit is divided into groups, if with 32 position marks, then to n-th group of assignment 2 32-n.From scale-of-two, then this value from left to right n-th be 1, and all the other positions are 0, are referred to as " basic place value "." the basic place value " of character string all characters unit is carried out "or" (or) computing, the place value of this character string can be obtained.Certainly, also can to n-th group of " basic place value " assignment 2 n-1, from scale-of-two, be then 1 from n-th of right-to-left, and all the other positions are 0.In fact, in certain database, mark record character string and search key, character unit and position have fixing corresponding, no matter and be from right-to-left, still from left to right, or other order.
If carry out 0 mark, then first give n-th group of character unit, composing with n-th is 0 and all the other are " the basic place value " of 1, then " the basic place value " of all for specific character string character units is carried out " with " (and) computing, the place value of this character string can be obtained
As for mark overlapping possibility computing method, 0 mark and 1 marks and communicates.
5. multidigit mark and packet marking
1. multidigit mark
On the basis that list " position " (bit) marks, the following describes many " position " (bit) mark.If the figure place n of storage place value is more, and the length of character string is few, the corresponding one group of character unit of combination of an available j position marks, then have the combination of individual position, correspondingly divides character unit to be group, if S character element P belongs to r group, is then labeled as 1 by j the position forming the combination of r position in W.When j is 1 and above-mentioned list " position " (bit) mark.
If n is 32, j is 2, then character unit is divided into group, is 496 groups, by one token, if character unit belongs to the 1st group, then 1,2 two of W " position " is labeled as " 1 "; If character unit belongs to the 2nd group, then 1,3 two of W " position " is labeled as 1, if character unit belongs to the 3rd group, then 1,4 two of W " position " is labeled as 1 ...
Suppose the place value the 2nd of Chinese character " front " word, 5 two positions are 1, and the place value the 23rd of " scape " word, 29 two positions are 1, then word " prospect " has 2, 5, 23, 29 4 positions are 1, establish again the place value the 2nd of " far ", 23 two positions are 1, the place value the 5th of " greatly ", 29 two positions are 1, then word " long-range " is also 2, 5, 23, 29 4 positions are 1, retrieve by the place value of " prospect ", there will be in result " long-range ", vice versa, namely there will be the character string containing " uncorrelated group of character unit " in result for retrieval, but the Chinese character contained by each group comparatively single " position " is labeled as few, so screening effect can be improved.Adopt the object of multidigit mark to be improve screening effect, its screening probability calculation is identical with unit markings principle.Calculating character string length is 4 character units below, and search key length is 2 character units, screening probability when j is 1,2,3,4, and its value is more low better:
J is 2, does not consider overlap, then after string token, m is 4*2=8 individual 1, and after search key mark, k is 2*2=4 individual 1.
J is 3, and namely character unit is divided into group, does not consider overlap by namely 4960 groups, then after string token, m is 4*3=12 individual 1, and after search key mark, k is 2*3=6 individual 1.
J is 4, and namely character unit is divided into group, does not consider overlap by namely 35960 groups, then after string token, m is 4*4=16 individual 1, and after search key mark, k is 2*4=8 individual 1.
P 1 = C m k C n k = m ! k ! ( m - k ) ! n ! k ! ( n - k ) ! = m ! ( n - k ) ! n ! ( m - k ) ! = 4 ! ( 32 - 2 ) ! 32 ! ( 4 - 2 ) ! = 4 * 3 32 * 31 = 0.012097
P 2 = 8 ! 4 ! ( 8 - 4 ) ! 32 ! 4 ! ( 32 - 4 ) ! = 8 ! 4 ! 32 ! 28 ! = 8 * 7 * 6 * 5 * 32 * 31 * 30 * 29 = 0.00194661
P 3 = 12 ! 6 ! ( 12 - 6 ) ! 32 ! 6 ! ( 32 - 6 ) ! = 12 ! 6 ! 32 ! 26 ! = 12 * 11 * 10 * 9 * 8 * 7 * 32 * 31 * 30 * 29 * 28 * 27 = 0.00101965
P 4 = 16 ! 8 ! ( 16 - 8 ) ! 32 ! 8 ! ( 32 - 8 ) ! = 16 ! 8 ! 32 ! 24 ! = 16 * 15 * 14 * 13 * 12 * 11 * 10 * 9 32 * 31 * 30 * 29 * 28 * 27 * 26 * 25 = 0.00122358
If when search key length is 1,2 character unit, multidigit mark is especially meaningful.But be a timing at storage place value n, j value is not the bigger the better, and as calculated display above, when n be 32, m is 4, j is 3 best results, continues to increase, can not improve screening effect, and the ratio regular meeting increase of overlap during mark, marking operation is also more complicated.If carry out multidigit mark to long character string, n is sufficiently large.
2. packet marking:
As 32 points of 17 and 15 liang of groups by a unsigned long integer, mark respectively, the long m of character string is 4, does not consider to mark overlapping problem, and after mark being 41, if keyword length is 2, is 21 after mark, then screen probability to be:
P = 4 ! ( 17 - 2 ) ! 17 ! ( 4 - 2 ) ! × 4 ! ( 15 - 2 ) ! 15 ! ( 4 - 2 ) ! = 4 * 3 17 * 16 × 4 * 3 15 * 14 = 12 272 × 12 210 = 0.002521
Obviously, also screening effect can be improved.Certainly, carrying out this optimization, is that the position n of mark will be several times as much as m equally.This optimal way, for doing accommodation during long character string, namely uses the character unit packet mode that two kinds different, marks respectively obtain two groups of W, W by two data '.During retrieval, first obtain the W of keyword with the wherein labeling method of a group t, with the W of database nvalue compares, and filters out R 1after.At R 1in, the W ' of keyword is obtained with the labeling method of second group t, with the W ' of database nvalue compares and obtains R 2, then at R 2middlely obtain net result collection R by common successive appraximation method z.Compared with upper a kind of optimization method, be the memory space increasing place value equally, this optimal way need not read whole W ' n, with W ' tcompare.
0 mark, 1 mark can carry out multidigit mark, packet marking, and packet marking also can combine with multidigit mark, and different groups also can be 0 mark, 1 mark respectively.Certainly, be suitable for as good.0 mark, multidigit mark all can be used for inverse retrieval with packet marking.
The speed of bit mark character string retrieval faster than the speed of prime number replacing character retrieval, but when character unit kind is many, with a position correspondence character unit, is not easily implemented; As with multidigit mark, in result for retrieval, do not comprise " object group character unit " and the character string that only comprises " uncorrelated group of character unit " can mix to come in.With prime number replacing character unit, can avoid occurring this specific admixture phenomenon in result for retrieval, in application, first can make preliminary screening with bit mark character string retrieval, make quadratic search with prime number replacing character retrieval again, as needs, then obtain net result by common character string successive appraximation method.
6. the application of bit mark character string retrieval technique in phonetic entry and Chinese phonetic alphabet input:
Phonetically similar word, the homonym of Chinese are many, therefore need to screen suitable words by phonetic in Chinese speech input and Pinyin Input, and position labeling method " inverse retrieval " can be used for this aspect.
Do not consider tone, Chinese has more than 400 syllable, is not easy to the corresponding syllable in a position, the initial consonant of the corresponding Chinese phonetic alphabet in 64 positions of available 8 bytes and simple or compound vowel of a Chinese syllable.Be provided with Chinese character by words collocation and basic sentence patterns database, wherein have basic sentence patterns " he has graduated ", phonetic is " tabiyele ", place value W after mark 1be:
1000,0101,0000,0000,0000,0001,1011,0000,0000,0000,0000,0000,0000,0000,0000,0000
If speech conversion or Pinyin Input " tazaojiubiyele ", place value W after mark tbe:
1000,0101,0001,0010,0000,0001,1011,0000,1000,0000,1000,0000,0000,0000,0000,0000
W 1→W t=W T
Then " he has graduated " is that " tazaojiubiyele " can the basic sentence patterns of reference, can obtain after process " he zaojiu has graduated ", wherein " graduation " is predicate, it is verb, and phonetic is that the word of " zaojiu " has " for a long time " " achievements " " jujube wine " in dictionary, if the other sides such as grammer, semanteme, word frequency can play booster action, then can convert to further " he has graduated for a long time ".Due to the diversity that sound coordinates, compare with place value in the PRELIMINARY RESULTS obtained, the sentence pattern of phonetic as " tebayile " and so on also there will be.Can on the basis of bit mark character string, do postsearch screening by the prime number replacing method of the corresponding syllable of a prime number, obtain closer to result.
If have " building " and the collocation " building " of measure word " " in words collocation and basic sentence patterns, there are again " building " and the collocation " high building " of often modifying its adjective " height ", can mark by their the initial and the final.If speech conversion or Pinyin Input " zhedongxiezilouhengao ", mark by the initial and the final equally, with the place value W of the collocation of database whole words and basic sentence patterns ncompare with it, collocation such as " building " " high buildings " can be filtered out for referencial use, can obtain after process " zhe xiezi building hen is high ", xiezi or xiezilou wherein can obtain from dictionary " writing " or " office building ", if the other sides such as grammer, semanteme, word frequency can play booster action, " zhedongxiezilouhengao " can be converted to " this office building is very high ".Other collocation of Chinese and corresponding place value can be obtained similarly.Administrative division is as, " provinces and cities " " county of province "; Numeral is as, " seven or eight " " multifarious "; Measure word " weight " " first angle "; And even the combination of surname " Lee " " Liu ".If input " hubeishengxianningshi ", according to the method described above, can obtain in " xianning city of hubei province ".If input " zhanglonglihu ", can obtain " Lee long hu ".If input " qiwansanqian ", can obtain " 70,000 san thousand ".Equally, can on the basis of bit mark character string, do postsearch screening by the prime number replacing method of the corresponding syllable of a prime number, obtain closer to result.
In addition, the frequency of utilization due to mandarin initial, simple or compound vowel of a Chinese syllable is unbalanced, during unit markings, with initial consonant, the simple or compound vowel of a Chinese syllable of 64 of the data of 8 bytes " position " corresponding Chinese phonetic alphabet, with some syllable as li retrieve time, due to 1 with i be all high frequency initial consonant, simple or compound vowel of a Chinese syllable, screening effect may be bad, can consider with dibit mark, be namely that 32, j is for 2 with n, i.e. 496 groups of positions, 400 syllables of corresponding Chinese, and do to calculate to database, make it balanced as far as possible.Certainly, mark by 400 syllable dibits, in PRELIMINARY RESULTS, certainly there will be " non-object " sentence pattern and collocation, can retrieve on the basis of screening by mark in place, do postsearch screening by the prime number replacing method of the corresponding syllable of prime number, obtain closer to result.
Also can use technique in Chinese word segmenting, by the word in dictionary, with reference to collocation and sentence pattern, by phonetic mark place value, keyword (actual is sentence) also by phonetic mark place value, carries out an implication operation, obtains R i, effectively can reduce seek scope; Can certainly by the word in dictionary, with reference to collocation and sentence pattern, by the packet marking place value of ISN, keyword is also by same packet marking place value, carry out an implication operation, effectively can reduce seek scope equally, then be aided with other method and complete decomposition to keyword (namely sentence).
Because liaison, voice rheological phenomena are many in English, clearly not clear like Chinese syllable, also can with bit mark character string retrieval technique screening with reference to vocabulary, phrase.If database has phrase in the morning, by wherein vowel, consonant i, n, m, n, i, η carry out mark and obtain place value.If after phonetic entry conversion, corresponding to the wherein fuzzyyer, can reject, with other louder i, n, m, n, i, η carry out mark and obtain place value, with compare with the place value of Database Reference vocabulary, phrase, phrase in the morning can be obtained, and other pronunciation containing i, n, m, the phrase of n, i, η or word.On this basis, reduce the scope further by the method for prime number replacing syllable, and then based on context, assist with grammer, semanteme, word frequency, select in the morning.
Foregoing describe method and relevant probability calculation formula, the logic algebra principle of bit mark character string retrieval, and lay particular emphasis on the screening of the character string record in database.From programming angle, the data structure of database is diversified, and various data structure is also not only for database.But bit mark character string retrieval technique, as a kind of character string algorithm, can be used for the string searching of various data structure, as the character string fuzzy search in large array.As for the character string fuzzy search of little array, then needs may not be had, because the mark of character string also takes time and space.
Embodiment
The present invention has obtained good realization in Chinese character string database and the storehouse fuzzy search of English character string data, here is to the Chinese character string database in SQL SERVER2000, carry out the vb6.0 code marked with " 1 ", the bit mark character string fuzzy search of other programming language, database can refer to enforcement.
1. building database
If database shuku has table biao, wherein have field shuming, data type is nvarchar, and length is 40.Separately set up field wei, data type is " long ", namely 4 bytes, has 32 positions, and wherein one " position " is sign bit, and all the other 31 positions may be used for mark.
Directly " position " is not set to the order of " 1 " or " 0 " in 2 vb6.0, therefore utilizes the inclusive-OR operation of position to make " mark " to database character string
dim?shuzu(30)As?Long
' the long array of definition 31 elements.
shuzu(0)=1
For?x=1?To?30
shuzu(x)=2*shuzu(x-1)
Next
' to 31 element assignment of long array, from 1,2,4,8,16 to 1073741824, from scale-of-two, have one to be 1, and all the other positions are 0, namely " basic place value ".
Dim?biaostr?As?String
' storage is when the character string of pre-treatment
Dim?weizhi?As?Long
' the place value of storage character string
Dim?weizhilin?As?Long
' the basic place value of a storage character
Dim?x?As?Integer
biaors.MoveFirst
' move to first record of database record set biaors
Do
weizhilin=0
weizhi=0
With?biaors
biaostr=.Fields(″shuming″)
End?With
' read in the character string of a record, invest string variable biaostr
For?x=1?To?Len(biaostr)
index=Abs(Asc?W(Mid(biaostr,x,1))Mod?31)
' from string variable biaostr, get a character, and by this character ISN, with 31 for mould does computing, then take absolute value, and invest index, namely character unit is divided into groups.
weizhilin=shuzu(index)
' array shuzu (index) value is invested weizhilin, is one of 1,2,4,8,16 to 1073741824.
weizhi=weizhi?Or?weizhilin
' " basic place value " the weizhilin value of a character and weizhi are done the inclusive-OR operation of position.
Next
' circulation terminates, and obtains " place value " weizhi of current string
With?biaors
.Fields(″wei″)=weizhi
End?With
biaors.Update
' " place value " weizhi is stored in the field wei of current record
biaors.MoveNext
' process next record
Loop?While?Not?biaors.EOF
3. utilize position " with " computing carries out the fuzzy search of database character string
Dim?shuzu(30)As?Long
shuzu(0)=1.
For?x=1?To?30
shuzu(x)=2*shuzu(x-1)
Next
' the long array of definition 31 elements, assignment, from 1,2,4,8,16 to 1073741824, completes consistent with the array of " mark ".
Dim?weizhi?As?Long
' the place value of a storage character string
Dim?weizhilin?As?Long
' the place value of a storage character
Dim?textstr?As?String
' storing retrievable character string
Dim?x?As?Integer
weizhilin=0
weizhi=0
textstr=Text1.Text
' Text1.Text is the search key in text box
For?x=1?To?Len(biaostr)
index=Abs(Asc?W(Mid(textstr,x,1))Mod?31)
weizhilin=shuzu(index)
weizhi=weizhi?Or?weizhilin
Next
' carry out mark to search key and obtain " place value " weizhi, method is consistent with database character string " mark " method.
strQuery=″select*from(SELECT*FROM?biao?WHERE(wei&″&weizhi&″)=″&weizhi&″)DERIVEDTBL?WHERE(shuming?like′%″&textstr&″%′)″
' in SQL SERVER 2000, there is no an implication operation symbol, here " place value " wei that " place value " weizhi of search key and database respectively record is done " with " (and) computing, filter out " with " (and) operation result equals the record (also can use (wei| " & weizhi & ")=wei) of " place value " weizhi of keyword, obtains PRELIMINARY RESULTS collection R i, then make quadratic search in usual character string fuzzy search mode, obtain net result R z.This is the query statement of SQL SERVER 2000, and other database may be slightly different.
Adodc?1.RecordSource=strQuery
Adodc?1.Refresh
' perform retrieval
DataList?1.ListField=″shuming″
DataList?l.ReFill
' in list box, show current result for retrieval.

Claims (6)

1. a character string fuzzy retrieval method, is characterized in that:
Mark with the data W there is n being the position (bit) of 0 the character metamessage forming character string, be a combination with wherein j position, then have the combination of individual position, correspondingly divides alphabet unit to be group, with the corresponding one group of character unit of the combination of a position, marks; If character string S character element P 1belong to r group, then j position of the combination of formation r the position of W is labeled as 1, similarly, according to other character element P of S 2, P 3, P 4affiliated group marks W, completes the W after alphabet meta-tag, records the character metamessage of S, is called the 1 mark place value of character string S;
With having the data that n is the position (bit) of 1 mark and form the character metamessage of character string, be a combination with wherein j position, then have the combination of individual position, correspondingly divides alphabet unit to be group, with the corresponding one group of character unit of the combination of a position, marks; If character string S character element P 1belong to r group, then will j the position of combination of formation r position be labeled as 0, similarly, according to other character element P of S 2, P 3, P 4affiliated group is right mark, complete after alphabet meta-tag record the character metamessage of S, be called the 0 mark place value of character string S;
Note character string S a1 mark place value be W a, note S a0 mark place value be note character string S b1 mark place value be W b, note S b0 mark place value adopt below method for character string comparison:
I. with W awith W bcompare, if all W abe the position of 1, W bcorresponding position is also 1, then S bcomprise and maybe may comprise S aall characters unit;
II. with with compare, if all be the position of 1, corresponding position is also 1, then S bcomprise and maybe may comprise S aall characters unit;
III. with W awith compare, if with W aall corresponding positions are 1 time different, then S bcomprise and maybe may comprise S aall characters unit;
IV. with with W bcompare, if with W ball corresponding positions are 0 time different, then S bcomprise and maybe may comprise S aall characters unit.
2. in accordance with the method for claim 1, it is characterized in that: to W a, with W b, compare and realize with bit arithmetic symbol: note W tfor all positions are the place value of 1, then method I can use W a→ W b=W t, W a| W b=W b, W aaMP.AMp.Amp W b=W arealize.
3. in accordance with the method for claim 1, it is characterized in that: to W a, with W b, compare, if do not meet criterion, then S bdo not comprise S aall characters unit; If meet criterion, when adopting single position to mark to each group character unit and only having a character unit in each group character unit, then S bcomprise S aall characters unit, can not S be affirmed bwhether comprise S a; If meet criterion, when having multiple character unit in each group character unit when adopting single position to mark to each group character unit, or when adopting multiple position to mark to each group character unit, then S bs may be comprised aall characters unit, can not S be affirmed bwhether comprise S a; As required, then by character successive appraximation method S is judged bwhether comprise S a.
4. in accordance with the method for claim 1, it is characterized in that: character unit comprises Chinese radical, stroke; The letter of the Chinese phonetic alphabet, initial consonant, simple or compound vowel of a Chinese syllable, syllable; The letter of other Languages, syllable, word; The phonetic symbol of other Languages; Or their combination.
5. in accordance with the method for claim 1, it is characterized in that: character string S comprises one or more character string, when comprising a character string, its place value W and mark a character string character metamessage, when comprising multiple character string, its place value W and mark all character metamessages of multiple character string.
6. in accordance with the method for claim 1, it is characterized in that: search key S tthere is 1 mark place value W twith 0 mark place value and character string S nthere is 1 mark place value W nwith 0 mark place value by to W t, with W n, compare: can filter out to comprise and maybe may comprise search key S tthe S of all character units nif need, then judge S by character successive appraximation method nwhether comprise S t, also can filter out search key S tcomprise the S that maybe may comprise its all character unit nif needed, S can be judged by character successive appraximation method again twhether comprise S n.
CN200510057491.4A 2005-01-17 2005-09-13 Bit mark character string fuzzy retrieval method for grouping character and labellng with bit Active CN101488127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200510057491.4A CN101488127B (en) 2005-01-17 2005-09-13 Bit mark character string fuzzy retrieval method for grouping character and labellng with bit

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CNA2005100233835A CN1645374A (en) 2005-01-17 2005-01-17 Digit marking character string searching technology
CN200510023383.5 2005-01-17
CN200510057491.4A CN101488127B (en) 2005-01-17 2005-09-13 Bit mark character string fuzzy retrieval method for grouping character and labellng with bit

Publications (2)

Publication Number Publication Date
CN101488127A CN101488127A (en) 2009-07-22
CN101488127B true CN101488127B (en) 2015-01-07

Family

ID=34875846

Family Applications (2)

Application Number Title Priority Date Filing Date
CNA2005100233835A Pending CN1645374A (en) 2005-01-17 2005-01-17 Digit marking character string searching technology
CN200510057491.4A Active CN101488127B (en) 2005-01-17 2005-09-13 Bit mark character string fuzzy retrieval method for grouping character and labellng with bit

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CNA2005100233835A Pending CN1645374A (en) 2005-01-17 2005-01-17 Digit marking character string searching technology

Country Status (2)

Country Link
CN (2) CN1645374A (en)
WO (1) WO2006074586A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4271227B2 (en) * 2006-10-30 2009-06-03 株式会社エスグランツ Bit string search device, search method and program
CN101794283A (en) * 2009-02-03 2010-08-04 华为技术有限公司 Method and system for processing character strings and matcher
CN102682033A (en) * 2011-03-17 2012-09-19 环达电脑(上海)有限公司 Method for querying words by matching binary characteristic values
CN103870537B (en) * 2013-12-03 2017-02-01 山东金质信息技术有限公司 Intelligent word segmentation method for standard retrieval
CN106933938A (en) * 2015-12-30 2017-07-07 唯溥思株式会社 The document retrieval method and literature index method encoded using multibyte
CN115310424B (en) * 2022-08-31 2024-12-06 快意电梯股份有限公司 A method of expression processing based on VB6

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1281191A (en) * 1999-07-19 2001-01-24 松下电器产业株式会社 Information retrieval method and information retrieval device
JP2003152548A (en) * 2001-11-14 2003-05-23 Canon Inc String search method in data compression
US6785677B1 (en) * 2001-05-02 2004-08-31 Unisys Corporation Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2669601B2 (en) * 1994-11-22 1997-10-29 インターナショナル・ビジネス・マシーンズ・コーポレイション Information retrieval method and system
JP4298138B2 (en) * 2000-06-21 2009-07-15 株式会社日立製作所 Information retrieval method, apparatus for implementing the same, and recording medium recording the processing program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1281191A (en) * 1999-07-19 2001-01-24 松下电器产业株式会社 Information retrieval method and information retrieval device
US6785677B1 (en) * 2001-05-02 2004-08-31 Unisys Corporation Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector
JP2003152548A (en) * 2001-11-14 2003-05-23 Canon Inc String search method in data compression

Also Published As

Publication number Publication date
WO2006074586A1 (en) 2006-07-20
CN1645374A (en) 2005-07-27
CN101488127A (en) 2009-07-22

Similar Documents

Publication Publication Date Title
Chang et al. Automatic information extraction from semi-structured web pages by pattern discovery
CN102184169B (en) Method, device and equipment used for determining similarity information among character string information
US5649023A (en) Method and apparatus for indexing a plurality of handwritten objects
CN108829801A (en) A Method for Extracting Event-triggered Words Based on Document-Level Attention Mechanism
CN100483417C (en) Method for catching limit word information, optimizing output and input method system
CN102063508B (en) Generalized suffix tree based fuzzy auto-completion method for Chinese search engine
CN113326267B (en) Address matching method based on inverted index and neural network algorithm
CN105488077A (en) Content tag generation method and apparatus
WO2007016133A2 (en) Processor for fast contextual matching
CN102419778A (en) Information searching method for mining and clustering sub-topics of query sentences
CN103823814A (en) Information processing method and information processing device
CN111831820B (en) Correlation analysis method between news and cases based on case element guidance and deep clustering
CN102929864B (en) A kind of tone-character conversion method and device
CN101488127B (en) Bit mark character string fuzzy retrieval method for grouping character and labellng with bit
CN102135956A (en) Word position tagging-based Tibetan word segmentation method
CN103853792A (en) Automatic image semantic annotation method and system
CN101944086A (en) Whole word index dictionary
CN101470701A (en) Text analyzer supporting semantic rule based on finite state machine and method thereof
CN116226323A (en) Scoring function construction method and related device for semantic retrieval
CN102117285A (en) Search method based on semantic indexing
Dang et al. WordNet-based suffix tree clustering algorithm
CN102122296A (en) Search result clustering method and device
CN101499056A (en) Backward reference sentence pattern language analysis method
CN105426490A (en) Tree structure based indexing method
Phan et al. Automated data extraction from the web with conditional models

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant