CN101488127B - Bit mark character string fuzzy retrieval method for grouping character and labellng with bit - Google Patents
Bit mark character string fuzzy retrieval method for grouping character and labellng with bit Download PDFInfo
- Publication number
- CN101488127B CN101488127B CN200510057491.4A CN200510057491A CN101488127B CN 101488127 B CN101488127 B CN 101488127B CN 200510057491 A CN200510057491 A CN 200510057491A CN 101488127 B CN101488127 B CN 101488127B
- Authority
- CN
- China
- Prior art keywords
- character
- character string
- mark
- unit
- place value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 150000001875 compounds Chemical class 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 claims 2
- 238000002372 labelling Methods 0.000 abstract description 6
- 238000012216 screening Methods 0.000 description 24
- 230000000694 effects Effects 0.000 description 11
- 238000003860 storage Methods 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 9
- 239000000203 mixture Substances 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 240000008866 Ziziphus nummularia Species 0.000 description 1
- 230000004308 accommodation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 235000019504 cigarettes Nutrition 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a character string retrieval fuzzy method for grouping the characters and labeling by bits. The method includes: dividing total character elements into n groups, and recording character P element information which forms a character string S by data W having n bits; if the character element P1 of the S belongs to j group, then setting jth bit of the W as 1; P2 belonging to k group, then setting kth bit of the W as 1 -, taking the W after all the character elements are labeled as a '' bit value'' of the S, comparing Wa of Sa with Wb of Sb, judging that the Sb does not contain or possibly contains all the character elements of Sa; and aiming at the possible Sb, judging whether Sa is contained by using a character comparison method. Based on the above method, according to logic algebra, multiple labeling and comparison bit value methods can be extrapolated. The method can improve retrieval speed by many times, but the character string is needed to be labeled firstly, so that it is suitable for the character string search with a fixed retrieval range.
Description
Technical field
The present invention relates to a kind of character string fuzzy retrieval method, fast and effeciently determine the scope that object character string may exist, re-use existing character match method location object character string, be applicable to the string search field that range of search is relatively fixed.
Background technology
Common character string fuzzy search adopts successive appraximation mode to carry out, as judged whether comprise character f in character string S=" bdopfqew ", computing machine compares with first character b and the f of main string S, compare with second of S character d and f again, by that analogy, until the 5th of S the character is identical with f, the match is successful, and this is the simplest situation.If substring T length is more than 2 characters, simple pattern matching algorithm, i.e. BF algorithm are with S
1with T
1relatively, if difference, with S
2with T
1relatively, the like, until some character S of S
iwith T
1identical, then by the S after them
i+1with T
1+1compare, if also identical, then continue down to compare, as some character S of S
i+nwith T
1+ntime different, then return, then with S
i+1with T
1compare as a new round, repeat above process, until the character in T is all than complete, then the match is successful, otherwise it fails to match.Along with the length of search key T increases, the corresponding increase of complexity of character match.Pattern matching algorithm after improvement, i.e. KMP algorithm, concerning the alphabetic writing of small size character set, avoid backtracking, but the Chinese character string large for character set, monocase frequency is low, essential meaning is little.In brief, BF algorithm and KMP algorithm are all carry out successive appraximation to the character of main string and substring.
Within 2004, I proposes " prime number replacing character string search technology ", and the method can improve the speed 2-3 of character string fuzzy search doubly, but implements the method for long character string, needs more space storage prime number product value.In order to improve the speed of character string fuzzy search, and the demand reduced memory space, the present invention proposes the composition information carrying out tab character string with n position (bit) of data, data after after mark claim " place value " of this character string, the place value of two character strings is compared, and in conjunction with common character successive appraximation method, realize the fuzzy search of character string.Test shows, speed is the several times of general character successive appraximation fuzzy search and even more than tens times.
Summary of the invention
The present invention is a kind of character string retrieval technique, and object improves the speed of character string fuzzy search.With position (bit) several characters units corresponding, with the corresponding alphabet unit in n position, namely divide alphabet unit to be n group, and with the n of data individual be 0 position, be designated as W
f, mark the character metamessage of composition character string.If several character string S character element P
1belong to n-th group, be then correspondingly labeled as 1 by n-th of W, similarly, according to other character element P of S
2, P
3, P
4affiliated group marks W, and complete the W after alphabet meta-tag, record the information of S, be called " place value " of S, this mode is called 1 mark.According to the principle of logic algebra, also can be the position of 1 with the n of data, be designated as
mark the character metamessage of composition character string.If S character element P belongs to n-th group, then correspondingly by data
n-th be labeled as 0, this mode is called 0 mark.By to S
a" place value " W
a,
with S
b" place value " W
b,
compare, can S be judged
b" do not comprise ", " comprising " or " may comprise " S
ball characters unit.Such as, to W
awith W
bcarry out an implication operation, if there is implication relation all positions, then S
bcomprise and maybe may comprise S
a" all character unit ".If need, then judge S by common character successive appraximation method
bwhether comprise " S
a".Be " position mark retrieval ", " bit mark character string retrieval " " bit mark character string retrieval technique " hereinafter referred to as the inventive method.
Test shows, position mark retrieval can significantly improve the fuzzy search speed of character string.Outside speed advantage, another feature of position mark retrieval is that multiple keyword query is equally convenient with single keyword query.Position mark both can be used for the retrieval of ordinary meaning, namely judge whether database character string comprises keyword, also can do " inverse retrieval ", judge whether keyword comprises database character string, can be used in phonetic entry, Pinyin Input and Chinese word segmenting, coupling basic sentence patterns or word.
As the expansion of basic skills, if can be used for the figure place n marked, be more than 2 times of the average length m of character string, can mark, to improve screening effeciency with the corresponding one group of character unit of the combination of several position (bit).
Bit mark character string retrieval technique, as a kind of character string algorithm, needs first to carry out position mark to the character string of range of search, so be applicable to the string searching field that range of search relatively fixes.
1. basic skills
During bit mark character string retrieval is implemented, can with the character of " position " or " combination of position " corresponding ordinary meaning, as: a, b, A, B, ∑ #, ∞, ま,
:, in, state; Also the corresponding Chinese radical in position can be used as required, as: Rolling, Si, so that stroke, as Pie, Dian etc.; For alphabetic writing, as english character string " day and night ", the corresponding word in position can be used, as day, and, night, or with the corresponding monogram in position, as ay, ai, the initial consonant of the Chinese phonetic alphabet, simple or compound vowel of a Chinese syllable or syllable can be represented in Chinese phonetic alphabet input or phonetic entry, can phonetic symbol be represented for other Languages, as
θ, si η.For convenience of explanation, claim the character unit corresponding to combination of position or position to be " character unit ", be designated as P.
The method of bit mark character string retrieval, easily realizes in alphabetic writing string search.If with one 32, each is the data 0000,0000,0000,0000,0000,0000,0000,0000 of 0, is designated as W
f, with W
fcorresponding 26 English alphabets in 1st to the 26th position of (also can be from right-to-left), may not consider for other 6 from left to right, also can corresponding punctuation mark.Word big comprises b, g, i, so 3 positions such as corresponding 2,7,9 are labeled as " 1 ", then data become 0100,0010,1000,0000,0000,0000,0000,0000, are called the place value of big, are designated as W
a, this mark mode is one token.
Equally, bigger comprises b, e, g, i, r, therefore by W
f5 positions such as 2,5,7,9,18 be labeled as " 1 ", then data become 0100,1010,1000,0000,0100,0000,0000,0000, are called the place value of bigger, are designated as W
b.
Can find out, all big are the position of 1, and bigger also must be 1; But bigger is the position of 1, big may be 1, also may be 0, as the 5th, 18 position; Bigger is the position of 0, and big must be 0, as 21 positions such as the 1st, 3,4,6,8,10.By W
awith W
bcarry out " position is contained " computing, result is that all positions are " 1 ": 1111,1111,1111,1111,1111,1111,1111,1111, is designated as W
t, if unsigned long integer, be exactly 4294967295.Can be formulated as: W
a→ W
b=W
t, character string bigger comprises big.
If the place value of biggest is designated as W
c, have equally: W
a→ W
c=W
t, character string biggest comprises big.
If the place value of BIG is designated as W
a, by the place value W of big
acarry out " position is contained " computing with it, same has: W
a→ W
a=W
t, character string BIG equals big.Here do not consider that alphabet size is write, only the place value of the character string that explanation two is identical carries out " position is contained " computing, and all positions are 1.
If the place value of digger is designated as W
dif, by the place value W of big
acarry out " position is contained " computing with it, result is 1011,1111,1111,1111,1111,1111,1111,1111, is not equal to W
t, that is: W
a→ W
d≠ W
t, character string digger does not comprise big.
That is, by carrying out " position is contained " computing, if result is not equal to W to the place value W of two character strings
t, then consequent corresponding character string " does not comprise " (" no more than " and " being not equal to ", lower same) character string corresponding to preceding paragraph.
Scheme does not above consider the capital and small letter of letter, if case sensitive, then need 52 positions, this also can realize.But the corresponding position of each character unit, always unfeasible.As, GBK scope has 21000 Chinese characters, with the corresponding Chinese character in a position, the data of storage place value can take a lot of space, wherein having considerable is insignificant blank, and read data and bit comparison operation use time can increase meaninglessly, and in enforcement, as required can with one " position " corresponding multiple Chinese character.An easy scheme is by 21000 of GBK scope Chinese characters and other symbol, is divided into 32 groups, stores the place value of character string with long data according to coding, is provided with character string " Gu Yanzhi long river, desert setting sun circle ", then:
Chinese character | Greatly | Unconcerned | Lonely | Cigarette | Directly | Long | River | Fall | Day | Circle |
Coding | 46323 | 50350 | 47554 | 53708 | 54961 | 45988 | 47827 | 49892 | 51413 | 54450 |
Group | 20 | 15 | 3 | 13 | 18 | 5 | 20 | 5 | 22 | 19 |
Chinese character | Frost | Cold | Long | River | ? | Fall | Day | My god | Margin | ? |
Coding | 52138 | 49380 | 45988 | 47827 | ? | 49892 | 51413 | 52460 | 53700 | ? |
Group | 11 | 5 | 5 | 20 | ? | 5 | 22 | 13 | 5 | ? |
Coding in table obtains with function code in excel, and wherein " greatly " " river " coexists 20 groups, and " length " " falls " to coexisting 5 groups, so be only 1 by 8 corresponding positions, then and the place value W of " Gu Yanzhi long river, the desert setting sun is justified "
hfor:
0010,1000,0000,1010,0111,0100,0000,0000
The W of " the long river setting sun " can be obtained equally
i" place value " is: 0000,1000,0000,0000,0001,0100,0000,0000, with W
iwith W
hmake position " position is contained " computing: W
i→ W
h=W
t, there is relation of inclusion in two character strings.
In " white cold long river ", " frost " is the 11st group, and " cold " and " length " are the 5th group, then " place value " W
jfor: 0000,1000,0010,0000,0001,0000,0000,0000, with W
jwith W
hmake position " position is contained " computing: W
j→ W
h≠ W
t, and also there is not relation of inclusion in two character strings.That is, with one " position " corresponding multiple Chinese character, obtain place value, " position is contained " computing is carried out, if result is not equal to W to the place value of two character strings
t, then consequent corresponding character string " does not comprise " " the alphabet unit " of the character string corresponding to preceding paragraph, also just must not comprise this " character string ".For alphabetic writing, in like manner with the corresponding multiple word in a position, can judge that a character string " does not comprise ", the character of " may comprise " another character string unit.
But, " position is contained " computing is carried out, if result equals W to the place value of two character strings
t, consequent corresponding character string is also " may comprise " character string corresponding to preceding paragraph.
If the place value of gibber is designated as W
e, by the place value W of big
acarry out " position is contained " computing with it, have equally:
W
a→W
e=W
T
But character string gibber does not comprise big.
" place value " W of " setting sun ends of the earth "
kfor: 0000,1000,0000,1000,0000,0100,0000,0000, with and W
hdo position " position is contained " computing:
W
k→W
h=W
T
But " Gu Yanzhi long river, desert setting sun circle " does not comprise " setting sun ends of the earth ".
That is, bit mark character string retrieval is distinguishing with general character string successive appraximation method.Even if the place value presence bit implication relation of two character strings, also not necessarily there is relation of inclusion in two character strings.If with " position " correspondence " a character unit ", the place value presence bit implication relation of two character strings, then the character string of consequent correspondence " comprises " the alphabet unit of character string corresponding to preceding paragraph.But owing to not considering the arrangement of character unit, can not affirm whether two character strings exist relation of inclusion.If mark is with " position " correspondence " one group of character unit ", multiple character unit is had among one group of character unit, then the character string " likely " of consequent correspondence comprises the alphabet unit of character string corresponding to preceding paragraph, also " likely " do not comprise character string corresponding to preceding paragraph alphabet unit, more can not affirm whether two character strings exist relation of inclusion.But as required, can judge whether two character strings exist relation of inclusion by character successive appraximation method again.
Bit mark character string is retrieved, and carry out the preceding paragraph of bit arithmetic and consequent place value, can be all the label information of the character unit of a character string, also can be the label information of all character units of multiple character string, is referred to as the place value of " several character strings " S." several character strings " is the appositive concept of S, and in other words, S refers to one or more character string.
If carry out bit mark character string retrieval to database, character string record S
1, S
2, S
3... place value be respectively W
1, W
2, W
3...Search key S
tplace value be designated as W
t, with W
twith W
ncarry out an implication operation, result is W
t, then character string S
ncomprise and maybe may comprise S
tall characters unit.By qualified S
nthe record set formed, is designated as R
i.Retrieving the time used is referred to as T
i.Usual needs are at R
imiddle character string successive appraximation method obtains net result collection R
z, retrieving the time used is referred to as T
z.
Therefore the used time that bit mark character string search method is total is T
i+ T
zif wish to make T
ilow, concerning the processors of 32, the data of preferably storing place value are exactly 32.For certain character string fuzzy search, R
zbe certain, reduce T
zmethod be make PRELIMINARY RESULTS collection R as far as possible
iminimum.The figure place n of storage place value is fewer, and the character unit of a position correspondence is more, and screening effect is poorer, R
ialso can be larger; Character string average length is longer, R
ialso can be larger.Affect R
isize also has another factor: search key is longer, R
ialso less.
Bidding remembers that figure place used is n, and " 1 " of character string place value is m, and " 1 " of search key place value is k, then the screening probability of bit comparison can calculate with following formula, and its value is the smaller the better:
N is that the screening probability calculation of 32, m and k part value is as follows:
m | k=1 | k=2 | k=3 | k=4 |
24 | 0.75 | 0.556451613 | 0.408065 | 0.295495 |
22 | 0.6875 | 0.465725806 | 0.310484 | 0.20342 |
20 | 0.625 | 0.383064516 | 0.229839 | 0.134733 |
18 | 0.5625 | 0.308467742 | 0.164516 | 0.085095 |
16 | 0.5 | 0.241935484 | 0.112903 | 0.050612 |
14 | 0.4375 | 0.183467742 | 0.073387 | 0.027836 |
12 | 0.375 | 0.133064516 | 0.044355 | 0.013765 |
10 | 0.3125 | 0.090725806 | 0.024194 | 0.00584 |
8 | 0.25 | 0.056451613 | 0.01129 | 0.001947 |
In three factors of impact screening probability, during user search, length keywords used determines the size of k, record string length determines the size of m, but length keywords used is uncontrollable when record string length, retrieval, controllable factor marks figure place n used, and the number m of " 1 ", the size of k after the certain string token of length, also affect by n, so n and m keeps enough ratio very important.
Some suggestion:
The cpu of 1.32, marks with lint-long integer, is beneficial to place value most and compares, if signless integer, then W is 32 positions, if there is sign bit integer, then only has 31 positions to be convenient to mark.If character string average length is greater than 16 character units, the database that record number is many, in SQL SEVER 2000,63 positions of available data types bigint mark, and correspondingly, character unit are divided into 63 groups.Certainly, for 32 bit processors, store place value with bigint, read and carry out bit comparison and must use the more time.In fact, any data type of being convenient to carry out bit arithmetic in any database all can be used for tab character string, if naturally better by the data type without sign bit.For 64 cpu, 64 positions certainly should be adopted to carry out tab character string, to make full use of the performance of cpu, improve the dispersion of " place value ".
If 2. cpu is 32, the database that between character string, length differs greatly, can be divided into two tables by record, short character string list 32 positions mark, long string table 64 marks, is joined together by two table Query Results in inquiry with union order.
3. the grouping of character unit take frequency equilibrium as the best, also will take into account the speed of mark place value.Carry out modular arithmetic by Hanzi internal code, remainder number divides into groups, and easily realize, but be not optimal group, and modular arithmetic division arithmetic speed is slow.Can consider that Chinese character is divided into 32 groups by certain 5 position of getting Hanzi internal code, speed can be faster, and effect may be better.
Outside the retrieval of ordinary meaning, can carry out database character string " inverse retrieval ", namely with database character string S by position labeling method
nw
nvalue is preceding paragraph, with search key S
tw
tvalue is a consequent implication operation that carries out, and the position whether all according to result is " 1 ", i.e. W
n→ W
t=W
t, can filter out those composition character units in database may be search key S
tthe S comprised
n.
Nature, the calculating of the screening probability of inverse retrieval will be conversely.If database and keyword all use one token, marking figure place used is n, S after the mark of position
nplace value W
n" 1 " be m, " 1 " after search key mark is k, then search key S
tcomprise the probability that maybe may comprise its all character unit can calculate with following formula:
2. test analysis
With said method to one based on Chinese character, have the English and digital database of part to test: to record several 267, more than 000 bar, character amount 3,473, more than 000, character string average length 12.989, with lint-long integer mark place value, programming language is VB, and operating system is Window xp, CPU Celeron 800Hz, internal memory 256M, Jijia 810 mainboard, hard disk 40G.The retrieval overall used time, be designated as T.Obviously, the mark retrieval of any once position, all must read the place value of all records, compare with it with the place value of keyword, the time used can think a constant, is called T
0, but not easily directly record.The relatively rear qualified record set of place value, is designated as R
1, to R
1carry out successive appraximation retrieval used time T
1also not easily directly record, but and R
1size is relevant, R
1can obtain.Another factor of impact retrieval used time is net result collection R
2size.According to T, R of 120 test gained
1, R
2, regretional analysis obtains following equation:
Coefficient of determination after adjustment
The significance test F=5265.814 of regression model
Be greater than F
0.01(2,117)=4.791
The significance test t of constant
0=86.610
R
1significance test t
1=74.263
R
2significance test t
2=25.489
All be greater than t
0.01(117)=2.6185
Visible, regression equation confidence level is very high.Constant wherein, namely reads the place value of all records, compares with it the time used, i.e. T with the place value of keyword
0, be 0.268 second.120 2.1739 seconds average used times of character string successive appraximation method of simultaneously carrying out.The time T that visible place value is more used
0be only 1/8th of the usual successive appraximation method used time.But the overall used time of bit mark character string retrieval goes back and R
1, R
2relevant, n=31 in this test, character string average length 12.989, have overlapping phenomenon after mark, the mean number m of " 1 ", about 11 to 12, calculates by 12 here, if the character unit distribution of database character string is normal above, " 1 " distributing equilibrium after mark, can calculate R by probability
1, net result collection R
2have nothing to do with screening effect, be appointed as 240 here, then ideally k is the retrieval used time of 1 to 12 after keyword tag, can be as follows by regression equation calculation:
k | R 1 | R 2 | T |
1 | 103354.8 | 240 | 1.16592 |
2 | 37896.77 | 240 | 0.601344 |
3 | 13067.85 | 240 | 0.387195 |
4 | 4200.381 | 240 | 0.310713 |
5 | 1244.557 | 240 | 0.285219 |
6 | 335.0732 | 240 | 0.277375 |
7 | 80.41756 | 240 | 0.275178 |
8 | 16.75366 | 240 | 0.274629 |
9 | 2.91368 | 240 | 0.27451 |
10 | 0.39732 | 240 | 0.274488 |
11 | 0.03784 | 240 | 0.274485 |
12 | 0.001892 | 240 | 0.274485 |
Visible, compared with usual character string retrieving method average 2.1739 seconds used times, bit mark character string retrieval has very large performance advantage.But the packet conditions of character unit during distribution situation, the mark of the data type used especially size of n, database character unit, all affect the used time that bit mark character string is retrieved, so this regression equation only has reference significance when different hardware and mark.
Only have 26 letters in English, if character string is very long, every bar record is containing each letter, then mark retrieval in position can lose screening effect.But record several 242,000 to one, character amount 7,493, the Database in English data of 000 are tested, show that bit mark character string retrieval is also feasible to alphabetic writing usually.This database character string data type is varchar, and field long 56, character string on average grows 30.846, marks 26 letters, case insensitive, ignore space and other character with 26 bit of 1 lint-long integer.Because the frequency of utilization of letter English letter is unbalanced, after mark, statistics only has 3213052 " 1 ", the average of each record " 1 " is 13.2266, illustrates that being marked with a large amount of overlaps occurs on the one hand, illustrates that not every bar record comprises each letter on the other hand.
Carry out 120 character successive appraximation retrievals to this database, the average used time is 4.573 seconds, and carry out position mark retrieval, according to T, R1, R2 of 120 times, regretional analysis obtains following equation simultaneously:
Coefficient of determination after adjustment
The significance test F=55405.88 of regression model
Be greater than F
0.01(2,117)=4.791
The significance test t of constant
0=36.400
R
1significance test t
1=121.733
R
2significance test t
2=77.409
All be greater than t
0.01(117)=2.6185
Be the alphabetical distribution statistics table of this database below:
Letter | Containing the record number of this letter | Number percent | Letter | Containing the record number of this letter | Number percent |
t | 227167 | 0.935136 | h | 114263 | 0.470365 |
r | 222338 | 0.915257 | u | 109376 | 0.450248 |
e | 221713 | 0.912685 | g | 97339 | 0.400697 |
a | 213249 | 0.877842 | f | 83452 | 0.343531 |
c | 210116 | 0.864945 | W | 71228 | 0.293211 |
n | 204225 | 0.840695 | b | 67065 | 0.276074 |
o | 201082 | 0.827757 | y | 60698 | 0.249864 |
i | 195734 | 0.805742 | v | 59351 | 0.244319 |
s | 195639 | 0.805351 | k | 47379 | 0.195036 |
l | 169753 | 0.698791 | x | 15150 | 0.062365 |
d | 159202 | 0.655357 | q | 7258 | 0.029878 |
m | 125449 | 0.516413 | j | 6351 | 0.026144 |
p | 122168 | 0.502906 | z | 6307 | 0.025963 |
From upper table, in this database, t, r, e tri-letter frequency are the highest, take tree as search key, and actual retrieval obtains the record 7 containing tree, and R
1be 188491,3.655 seconds used times, the successive appraximation simultaneously carried out 4.465 seconds used times of retrieval.As shown the alphabetical frequency of t, r, e tri-by upper, can be calculated:
R1=242,924*0.935136*0.915257*0.912685=242,924*0.781158=189762
By the regression equation calculation retrieval used time be again:
Visible, even the complete like this word be made up of high frequency letter of tree, position mark retrieval still has performance advantage.In fact, in English, most of word is more than 4 letters, if wherein there is a low frequency letter, the screening effect of position mark just very well, is equivalent to " short-board effect ".Enumerate 30 test datas for reference:
3. the overlapping possibility of a mark
In alphabetic writing, number of letters is little, so the frequency that letter repeats is high, as the words such as bigger, biggest have two g, but only the 7th " position " is labeled as 1.For Chinese character string, the probability that in a character string, same Chinese character repeats is low, but Chinese character is divided into groups, several Chinese characters in a character string may belong to same group, as in grouping above, " greatly " " river " coexists 20 groups, and " length " " falls " to coexisting 5 groups, is overlapped in one " position " after mark.Even if grouping realizes character unit frequency equilibrium, also there is the problem of string token overlap.If the figure place being used for marking is n, the number of character unit is m.The probability wherein do not overlapped completely is:
Obviously, time n is certain, m often increases by 1, and the new new-added item of molecule just successively decreases 1, and the new new-added item of denominator is constant, and the probability do not overlapped completely is more and more lower.If mark with 32 positions, the long m of character string is 7, the probability do not overlapped completely:
Following table marks with 32 positions, when the long m of character string is 1 to 24, and the probability do not overlapped completely:
The overlapping loss just meaning information, overlap is to a certain degree inevitable, is also acceptable, but the ratio of overlap is too high, can affect the performance of position mark retrieval, so the probability overlapped also merits attention.With n position mark, character string is long is m character unit, and the overlapping probability calculation formula for k " 1 " is:
Wherein
Here summation gets allly to meet m
1+ m
2+ ... + m
kthe positive integer solutions of=m, its group number is
If with 32 position marks, character string length is 7 character units, and overlap is 3 positions, i.e. m
1+ m
2+ m
3=7, have 15 groups, probability calculation is as follows:
7! | m 1 | m 2 | m 3 | m 1! | m 2! | m 3! | 7!/(m 1!*m 2!*m 3!) |
5040 | 1 | 1 | 5 | 1 | 1 | 120 | 42 |
5040 | 1 | 2 | 4 | 1 | 2 | 24 | 105 |
5040 | 1 | 3 | 3 | 1 | 6 | 6 | 140 |
5040 | 1 | 4 | 2 | 1 | 24 | 2 | 105 |
5040 | 1 | 5 | 1 | 1 | 120 | 1 | 42 |
5040 | 2 | 1 | 4 | 2 | 1 | 24 | 105 |
5040 | 2 | 2 | 3 | 2 | 2 | 6 | 210 |
5040 | 2 | 3 | 2 | 2 | 6 | 2 | 210 |
5040 | 2 | 4 | 1 | 2 | 24 | 1 | 105 |
5040 | 3 | 1 | 3 | 6 | 1 | 6 | 140 |
5040 | 3 | 2 | 2 | 6 | 2 | 2 | 210 |
5040 | 3 | 3 | 1 | 6 | 6 | 1 | 140 |
5040 | 4 | 1 | 2 | 24 | 1 | 2 | 105 |
5040 | 4 | 2 | 1 | 24 | 2 | 1 | 105 |
5040 | 5 | 1 | 1 | 120 | 1 | 1 | 42 |
? | ? | ? | ? | ? | ? | ? | 1806 |
With 32 positions mark, character string length is 7 character units, mark afterwards k be the probability of 1-7 as following table:
k | The number of permutations | Arrangement number percent | Number percent adds up | The k* number of permutations |
7 | 16963914240 | 0.493714884 | 0.493714884 | 1.18747E+11 |
6 | 13701623040 | 0.398769714 | 0.892484598 | 82209738240 |
5 | 3383116800 | 0.098461658 | 0.990946256 | 16915584000 |
4 | 302064000 | 0.008791219 | 0.999737475 | 1208256000 |
3 | 8957760 | 0.000260705 | 0.99999818 | 26873280 |
2 | 62496 | 1.81887E-06 | 0.999999999 | 124992 |
1 | 32 | 9.31323E-10 | 1 | 32 |
? | ? | ? | ? | 2.19108E+11 |
After mark, the weighted mean number of " 1 " is 6.376881392.
4. a logic algebra principle that mark is relevant and 0 mark
Some positions of the place value W of character string S are " 1 ", mean that proposition " character string S has in certain character unit or certain group character unit " is true exactly.For two character string S
aand S
b, W
acertain k position be 1, mean proposition " character string S exactly
ahave in certain k character unit or certain k group character unit " be true; If W
ball corresponding certain k positions are 1, mean proposition " character string S exactly
bhave in certain k character unit or certain k group character unit " be also true.Nature, W
bother certain m-k position may be had to be 1, and W
athese certain m-k positions are 0.This relation logic algebra formula is just expressed as: W
a→ W
b=W
t, legibility directly perceived.But and the programming language of not all and Database Systems have " position is contained " operational symbol, so need to convert by logic algebra principle in application.
From Boolean calculation theorem
known, if a → b=1, then
same, the place value for n position also has: if W
a→ W
b=W
t, then
Also can obtain:
W
a|W
b=W
b
W
a&W
b=W
a
From logical operation theorem
It is known,
W
a→ W
b=W
t, then
So available one each " position " is the data W of 1
tmark place value, if character string comprises a certain character unit, then corresponding position is labeled as 0, this mode is called 0 mark.As:
The place value of big
be 10111101011111111111111111111111
The place value of bigger
be 10110101011111111011111111111111
Obviously
Also can obtain:
More directly perceived with truth table:
W a | W b | W a→W b=W T | W a&W b=W a | W a|W b=W b | ? | ? |
0 | 0 | 1 | 0 | 0 | ? | ? |
0 | 1 | 1 | 0 | 1 | ? | ? |
1 | 0 | 0 | 0 | 1 | ? | ? |
1 | 1 | 1 | 1 | 1 | ? | ? |
Described above is, how to use modal bit arithmetic symbol to carry out place value and compare, the principle of screening probability communicates.Certainly, during 0 mark, m, k all refer to the number of 0.But the bit arithmetic that each programming language and database provide symbol quantity differs, the concrete operations of other bit arithmetic symbol, can release by logic algebra principle.Use logic algebra principle, carry out equivalence transformation, show that formula should for the purpose of brief being suitable for, instead of become more complicated and more complicated certainly.
More in addition, if programming language or Database Systems provide the function of putting 1 or setting to 0 of " position ", then position " mark " can directly use.If Database Systems programming language does not provide position 1, then "or" (or) computing of available position, carries out 1 mark.
First character unit is divided into groups, if with 32 position marks, then to n-th group of assignment 2
32-n.From scale-of-two, then this value from left to right n-th be 1, and all the other positions are 0, are referred to as " basic place value "." the basic place value " of character string all characters unit is carried out "or" (or) computing, the place value of this character string can be obtained.Certainly, also can to n-th group of " basic place value " assignment 2
n-1, from scale-of-two, be then 1 from n-th of right-to-left, and all the other positions are 0.In fact, in certain database, mark record character string and search key, character unit and position have fixing corresponding, no matter and be from right-to-left, still from left to right, or other order.
If carry out 0 mark, then first give n-th group of character unit, composing with n-th is 0 and all the other are " the basic place value " of 1, then " the basic place value " of all for specific character string character units is carried out " with " (and) computing, the place value of this character string can be obtained
As for mark overlapping possibility computing method, 0 mark and 1 marks and communicates.
5. multidigit mark and packet marking
1. multidigit mark
On the basis that list " position " (bit) marks, the following describes many " position " (bit) mark.If the figure place n of storage place value is more, and the length of character string is few, the corresponding one group of character unit of combination of an available j position marks, then have
the combination of individual position, correspondingly divides character unit to be
group, if S character element P belongs to r group, is then labeled as 1 by j the position forming the combination of r position in W.When j is 1 and above-mentioned list " position " (bit) mark.
If n is 32, j is 2, then character unit is divided into
group, is 496 groups, by one token, if character unit belongs to the 1st group, then 1,2 two of W " position " is labeled as " 1 "; If character unit belongs to the 2nd group, then 1,3 two of W " position " is labeled as 1, if character unit belongs to the 3rd group, then 1,4 two of W " position " is labeled as 1 ...
Suppose the place value the 2nd of Chinese character " front " word, 5 two positions are 1, and the place value the 23rd of " scape " word, 29 two positions are 1, then word " prospect " has 2, 5, 23, 29 4 positions are 1, establish again the place value the 2nd of " far ", 23 two positions are 1, the place value the 5th of " greatly ", 29 two positions are 1, then word " long-range " is also 2, 5, 23, 29 4 positions are 1, retrieve by the place value of " prospect ", there will be in result " long-range ", vice versa, namely there will be the character string containing " uncorrelated group of character unit " in result for retrieval, but the Chinese character contained by each group comparatively single " position " is labeled as few, so screening effect can be improved.Adopt the object of multidigit mark to be improve screening effect, its screening probability calculation is identical with unit markings principle.Calculating character string length is 4 character units below, and search key length is 2 character units, screening probability when j is 1,2,3,4, and its value is more low better:
J is 2, does not consider overlap, then after string token, m is 4*2=8 individual 1, and after search key mark, k is 2*2=4 individual 1.
J is 3, and namely character unit is divided into
group, does not consider overlap by namely 4960 groups, then after string token, m is 4*3=12 individual 1, and after search key mark, k is 2*3=6 individual 1.
J is 4, and namely character unit is divided into
group, does not consider overlap by namely 35960 groups, then after string token, m is 4*4=16 individual 1, and after search key mark, k is 2*4=8 individual 1.
If when search key length is 1,2 character unit, multidigit mark is especially meaningful.But be a timing at storage place value n, j value is not the bigger the better, and as calculated display above, when n be 32, m is 4, j is 3 best results, continues to increase, can not improve screening effect, and the ratio regular meeting increase of overlap during mark, marking operation is also more complicated.If carry out multidigit mark to long character string, n is sufficiently large.
2. packet marking:
As 32 points of 17 and 15 liang of groups by a unsigned long integer, mark respectively, the long m of character string is 4, does not consider to mark overlapping problem, and after mark being 41, if keyword length is 2, is 21 after mark, then screen probability to be:
Obviously, also screening effect can be improved.Certainly, carrying out this optimization, is that the position n of mark will be several times as much as m equally.This optimal way, for doing accommodation during long character string, namely uses the character unit packet mode that two kinds different, marks respectively obtain two groups of W, W by two data '.During retrieval, first obtain the W of keyword with the wherein labeling method of a group
t, with the W of database
nvalue compares, and filters out R
1after.At R
1in, the W ' of keyword is obtained with the labeling method of second group
t, with the W ' of database
nvalue compares and obtains R
2, then at R
2middlely obtain net result collection R by common successive appraximation method
z.Compared with upper a kind of optimization method, be the memory space increasing place value equally, this optimal way need not read whole W '
n, with W '
tcompare.
0 mark, 1 mark can carry out multidigit mark, packet marking, and packet marking also can combine with multidigit mark, and different groups also can be 0 mark, 1 mark respectively.Certainly, be suitable for as good.0 mark, multidigit mark all can be used for inverse retrieval with packet marking.
The speed of bit mark character string retrieval faster than the speed of prime number replacing character retrieval, but when character unit kind is many, with a position correspondence character unit, is not easily implemented; As with multidigit mark, in result for retrieval, do not comprise " object group character unit " and the character string that only comprises " uncorrelated group of character unit " can mix to come in.With prime number replacing character unit, can avoid occurring this specific admixture phenomenon in result for retrieval, in application, first can make preliminary screening with bit mark character string retrieval, make quadratic search with prime number replacing character retrieval again, as needs, then obtain net result by common character string successive appraximation method.
6. the application of bit mark character string retrieval technique in phonetic entry and Chinese phonetic alphabet input:
Phonetically similar word, the homonym of Chinese are many, therefore need to screen suitable words by phonetic in Chinese speech input and Pinyin Input, and position labeling method " inverse retrieval " can be used for this aspect.
Do not consider tone, Chinese has more than 400 syllable, is not easy to the corresponding syllable in a position, the initial consonant of the corresponding Chinese phonetic alphabet in 64 positions of available 8 bytes and simple or compound vowel of a Chinese syllable.Be provided with Chinese character by words collocation and basic sentence patterns database, wherein have basic sentence patterns " he has graduated ", phonetic is " tabiyele ", place value W after mark
1be:
1000,0101,0000,0000,0000,0001,1011,0000,0000,0000,0000,0000,0000,0000,0000,0000
If speech conversion or Pinyin Input " tazaojiubiyele ", place value W after mark
tbe:
1000,0101,0001,0010,0000,0001,1011,0000,1000,0000,1000,0000,0000,0000,0000,0000
W
1→W
t=W
T
Then " he has graduated " is that " tazaojiubiyele " can the basic sentence patterns of reference, can obtain after process " he zaojiu has graduated ", wherein " graduation " is predicate, it is verb, and phonetic is that the word of " zaojiu " has " for a long time " " achievements " " jujube wine " in dictionary, if the other sides such as grammer, semanteme, word frequency can play booster action, then can convert to further " he has graduated for a long time ".Due to the diversity that sound coordinates, compare with place value in the PRELIMINARY RESULTS obtained, the sentence pattern of phonetic as " tebayile " and so on also there will be.Can on the basis of bit mark character string, do postsearch screening by the prime number replacing method of the corresponding syllable of a prime number, obtain closer to result.
If have " building " and the collocation " building " of measure word " " in words collocation and basic sentence patterns, there are again " building " and the collocation " high building " of often modifying its adjective " height ", can mark by their the initial and the final.If speech conversion or Pinyin Input " zhedongxiezilouhengao ", mark by the initial and the final equally, with the place value W of the collocation of database whole words and basic sentence patterns
ncompare with it, collocation such as " building " " high buildings " can be filtered out for referencial use, can obtain after process " zhe xiezi building hen is high ", xiezi or xiezilou wherein can obtain from dictionary " writing " or " office building ", if the other sides such as grammer, semanteme, word frequency can play booster action, " zhedongxiezilouhengao " can be converted to " this office building is very high ".Other collocation of Chinese and corresponding place value can be obtained similarly.Administrative division is as, " provinces and cities " " county of province "; Numeral is as, " seven or eight " " multifarious "; Measure word " weight " " first angle "; And even the combination of surname " Lee " " Liu ".If input " hubeishengxianningshi ", according to the method described above, can obtain in " xianning city of hubei province ".If input " zhanglonglihu ", can obtain " Lee long hu ".If input " qiwansanqian ", can obtain " 70,000 san thousand ".Equally, can on the basis of bit mark character string, do postsearch screening by the prime number replacing method of the corresponding syllable of a prime number, obtain closer to result.
In addition, the frequency of utilization due to mandarin initial, simple or compound vowel of a Chinese syllable is unbalanced, during unit markings, with initial consonant, the simple or compound vowel of a Chinese syllable of 64 of the data of 8 bytes " position " corresponding Chinese phonetic alphabet, with some syllable as li retrieve time, due to 1 with i be all high frequency initial consonant, simple or compound vowel of a Chinese syllable, screening effect may be bad, can consider with dibit mark, be namely that 32, j is for 2 with n, i.e. 496 groups of positions, 400 syllables of corresponding Chinese, and do to calculate to database, make it balanced as far as possible.Certainly, mark by 400 syllable dibits, in PRELIMINARY RESULTS, certainly there will be " non-object " sentence pattern and collocation, can retrieve on the basis of screening by mark in place, do postsearch screening by the prime number replacing method of the corresponding syllable of prime number, obtain closer to result.
Also can use technique in Chinese word segmenting, by the word in dictionary, with reference to collocation and sentence pattern, by phonetic mark place value, keyword (actual is sentence) also by phonetic mark place value, carries out an implication operation, obtains R
i, effectively can reduce seek scope; Can certainly by the word in dictionary, with reference to collocation and sentence pattern, by the packet marking place value of ISN, keyword is also by same packet marking place value, carry out an implication operation, effectively can reduce seek scope equally, then be aided with other method and complete decomposition to keyword (namely sentence).
Because liaison, voice rheological phenomena are many in English, clearly not clear like Chinese syllable, also can with bit mark character string retrieval technique screening with reference to vocabulary, phrase.If database has phrase in the morning, by wherein vowel, consonant i, n,
m,
n, i, η carry out mark and obtain place value.If after phonetic entry conversion, corresponding to the wherein
fuzzyyer, can reject, with other louder i, n, m,
n, i, η carry out mark and obtain place value, with compare with the place value of Database Reference vocabulary, phrase, phrase in the morning can be obtained, and other pronunciation containing i, n, m,
the phrase of n, i, η or word.On this basis, reduce the scope further by the method for prime number replacing syllable, and then based on context, assist with grammer, semanteme, word frequency, select in the morning.
Foregoing describe method and relevant probability calculation formula, the logic algebra principle of bit mark character string retrieval, and lay particular emphasis on the screening of the character string record in database.From programming angle, the data structure of database is diversified, and various data structure is also not only for database.But bit mark character string retrieval technique, as a kind of character string algorithm, can be used for the string searching of various data structure, as the character string fuzzy search in large array.As for the character string fuzzy search of little array, then needs may not be had, because the mark of character string also takes time and space.
Embodiment
The present invention has obtained good realization in Chinese character string database and the storehouse fuzzy search of English character string data, here is to the Chinese character string database in SQL SERVER2000, carry out the vb6.0 code marked with " 1 ", the bit mark character string fuzzy search of other programming language, database can refer to enforcement.
1. building database
If database shuku has table biao, wherein have field shuming, data type is nvarchar, and length is 40.Separately set up field wei, data type is " long ", namely 4 bytes, has 32 positions, and wherein one " position " is sign bit, and all the other 31 positions may be used for mark.
Directly " position " is not set to the order of " 1 " or " 0 " in 2 vb6.0, therefore utilizes the inclusive-OR operation of position to make " mark " to database character string
dim?shuzu(30)As?Long
' the long array of definition 31 elements.
shuzu(0)=1
For?x=1?To?30
shuzu(x)=2*shuzu(x-1)
Next
' to 31 element assignment of long array, from 1,2,4,8,16 to 1073741824, from scale-of-two, have one to be 1, and all the other positions are 0, namely " basic place value ".
Dim?biaostr?As?String
' storage is when the character string of pre-treatment
Dim?weizhi?As?Long
' the place value of storage character string
Dim?weizhilin?As?Long
' the basic place value of a storage character
Dim?x?As?Integer
biaors.MoveFirst
' move to first record of database record set biaors
Do
weizhilin=0
weizhi=0
With?biaors
biaostr=.Fields(″shuming″)
End?With
' read in the character string of a record, invest string variable biaostr
For?x=1?To?Len(biaostr)
index=Abs(Asc?W(Mid(biaostr,x,1))Mod?31)
' from string variable biaostr, get a character, and by this character ISN, with 31 for mould does computing, then take absolute value, and invest index, namely character unit is divided into groups.
weizhilin=shuzu(index)
' array shuzu (index) value is invested weizhilin, is one of 1,2,4,8,16 to 1073741824.
weizhi=weizhi?Or?weizhilin
' " basic place value " the weizhilin value of a character and weizhi are done the inclusive-OR operation of position.
Next
' circulation terminates, and obtains " place value " weizhi of current string
With?biaors
.Fields(″wei″)=weizhi
End?With
biaors.Update
' " place value " weizhi is stored in the field wei of current record
biaors.MoveNext
' process next record
Loop?While?Not?biaors.EOF
3. utilize position " with " computing carries out the fuzzy search of database character string
Dim?shuzu(30)As?Long
shuzu(0)=1.
For?x=1?To?30
shuzu(x)=2*shuzu(x-1)
Next
' the long array of definition 31 elements, assignment, from 1,2,4,8,16 to 1073741824, completes consistent with the array of " mark ".
Dim?weizhi?As?Long
' the place value of a storage character string
Dim?weizhilin?As?Long
' the place value of a storage character
Dim?textstr?As?String
' storing retrievable character string
Dim?x?As?Integer
weizhilin=0
weizhi=0
textstr=Text1.Text
' Text1.Text is the search key in text box
For?x=1?To?Len(biaostr)
index=Abs(Asc?W(Mid(textstr,x,1))Mod?31)
weizhilin=shuzu(index)
weizhi=weizhi?Or?weizhilin
Next
' carry out mark to search key and obtain " place value " weizhi, method is consistent with database character string " mark " method.
strQuery=″select*from(SELECT*FROM?biao?WHERE(wei&″&weizhi&″)=″&weizhi&″)DERIVEDTBL?WHERE(shuming?like′%″&textstr&″%′)″
' in SQL SERVER 2000, there is no an implication operation symbol, here " place value " wei that " place value " weizhi of search key and database respectively record is done " with " (and) computing, filter out " with " (and) operation result equals the record (also can use (wei| " & weizhi & ")=wei) of " place value " weizhi of keyword, obtains PRELIMINARY RESULTS collection R
i, then make quadratic search in usual character string fuzzy search mode, obtain net result R
z.This is the query statement of SQL SERVER 2000, and other database may be slightly different.
Adodc?1.RecordSource=strQuery
Adodc?1.Refresh
' perform retrieval
DataList?1.ListField=″shuming″
DataList?l.ReFill
' in list box, show current result for retrieval.
Claims (6)
1. a character string fuzzy retrieval method, is characterized in that:
Mark with the data W there is n being the position (bit) of 0 the character metamessage forming character string, be a combination with wherein j position, then have
the combination of individual position, correspondingly divides alphabet unit to be
group, with the corresponding one group of character unit of the combination of a position, marks; If character string S character element P
1belong to r group, then j position of the combination of formation r the position of W is labeled as 1, similarly, according to other character element P of S
2, P
3, P
4affiliated group marks W, completes the W after alphabet meta-tag, records the character metamessage of S, is called the 1 mark place value of character string S;
With having the data that n is the position (bit) of 1
mark and form the character metamessage of character string, be a combination with wherein j position, then have
the combination of individual position, correspondingly divides alphabet unit to be
group, with the corresponding one group of character unit of the combination of a position, marks; If character string S character element P
1belong to r group, then will
j the position of combination of formation r position be labeled as 0, similarly, according to other character element P of S
2, P
3, P
4affiliated group is right
mark, complete after alphabet meta-tag
record the character metamessage of S, be called the 0 mark place value of character string S;
Note character string S
a1 mark place value be W
a, note S
a0 mark place value be
note character string S
b1 mark place value be W
b, note S
b0 mark place value
adopt below method for character string comparison:
I. with W
awith W
bcompare, if all W
abe the position of 1, W
bcorresponding position is also 1, then S
bcomprise and maybe may comprise S
aall characters unit;
II. with
with
compare, if all
be the position of 1,
corresponding position is also 1, then S
bcomprise and maybe may comprise S
aall characters unit;
III. with W
awith
compare, if
with W
aall corresponding positions are 1 time different, then S
bcomprise and maybe may comprise S
aall characters unit;
IV. with
with W
bcompare, if
with W
ball corresponding positions are 0 time different, then S
bcomprise and maybe may comprise S
aall characters unit.
2. in accordance with the method for claim 1, it is characterized in that: to W
a,
with W
b,
compare and realize with bit arithmetic symbol: note W
tfor all positions are the place value of 1, then method I can use W
a→ W
b=W
t, W
a| W
b=W
b, W
aaMP.AMp.Amp W
b=W
arealize.
3. in accordance with the method for claim 1, it is characterized in that: to W
a,
with W
b,
compare, if do not meet criterion, then S
bdo not comprise S
aall characters unit; If meet criterion, when adopting single position to mark to each group character unit and only having a character unit in each group character unit, then S
bcomprise S
aall characters unit, can not S be affirmed
bwhether comprise S
a; If meet criterion, when having multiple character unit in each group character unit when adopting single position to mark to each group character unit, or when adopting multiple position to mark to each group character unit, then S
bs may be comprised
aall characters unit, can not S be affirmed
bwhether comprise S
a; As required, then by character successive appraximation method S is judged
bwhether comprise S
a.
4. in accordance with the method for claim 1, it is characterized in that: character unit comprises Chinese radical, stroke; The letter of the Chinese phonetic alphabet, initial consonant, simple or compound vowel of a Chinese syllable, syllable; The letter of other Languages, syllable, word; The phonetic symbol of other Languages; Or their combination.
5. in accordance with the method for claim 1, it is characterized in that: character string S comprises one or more character string, when comprising a character string, its place value W and
mark a character string character metamessage, when comprising multiple character string, its place value W and
mark all character metamessages of multiple character string.
6. in accordance with the method for claim 1, it is characterized in that: search key S
tthere is 1 mark place value W
twith 0 mark place value
and character string S
nthere is 1 mark place value W
nwith 0 mark place value
by to W
t,
with W
n,
compare: can filter out to comprise and maybe may comprise search key S
tthe S of all character units
nif need, then judge S by character successive appraximation method
nwhether comprise S
t, also can filter out search key S
tcomprise the S that maybe may comprise its all character unit
nif needed, S can be judged by character successive appraximation method again
twhether comprise S
n.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200510057491.4A CN101488127B (en) | 2005-01-17 | 2005-09-13 | Bit mark character string fuzzy retrieval method for grouping character and labellng with bit |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2005100233835A CN1645374A (en) | 2005-01-17 | 2005-01-17 | Digit marking character string searching technology |
CN200510023383.5 | 2005-01-17 | ||
CN200510057491.4A CN101488127B (en) | 2005-01-17 | 2005-09-13 | Bit mark character string fuzzy retrieval method for grouping character and labellng with bit |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101488127A CN101488127A (en) | 2009-07-22 |
CN101488127B true CN101488127B (en) | 2015-01-07 |
Family
ID=34875846
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2005100233835A Pending CN1645374A (en) | 2005-01-17 | 2005-01-17 | Digit marking character string searching technology |
CN200510057491.4A Active CN101488127B (en) | 2005-01-17 | 2005-09-13 | Bit mark character string fuzzy retrieval method for grouping character and labellng with bit |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2005100233835A Pending CN1645374A (en) | 2005-01-17 | 2005-01-17 | Digit marking character string searching technology |
Country Status (2)
Country | Link |
---|---|
CN (2) | CN1645374A (en) |
WO (1) | WO2006074586A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4271227B2 (en) * | 2006-10-30 | 2009-06-03 | 株式会社エスグランツ | Bit string search device, search method and program |
CN101794283A (en) * | 2009-02-03 | 2010-08-04 | 华为技术有限公司 | Method and system for processing character strings and matcher |
CN102682033A (en) * | 2011-03-17 | 2012-09-19 | 环达电脑(上海)有限公司 | Method for querying words by matching binary characteristic values |
CN103870537B (en) * | 2013-12-03 | 2017-02-01 | 山东金质信息技术有限公司 | Intelligent word segmentation method for standard retrieval |
CN106933938A (en) * | 2015-12-30 | 2017-07-07 | 唯溥思株式会社 | The document retrieval method and literature index method encoded using multibyte |
CN115310424B (en) * | 2022-08-31 | 2024-12-06 | 快意电梯股份有限公司 | A method of expression processing based on VB6 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1281191A (en) * | 1999-07-19 | 2001-01-24 | 松下电器产业株式会社 | Information retrieval method and information retrieval device |
JP2003152548A (en) * | 2001-11-14 | 2003-05-23 | Canon Inc | String search method in data compression |
US6785677B1 (en) * | 2001-05-02 | 2004-08-31 | Unisys Corporation | Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2669601B2 (en) * | 1994-11-22 | 1997-10-29 | インターナショナル・ビジネス・マシーンズ・コーポレイション | Information retrieval method and system |
JP4298138B2 (en) * | 2000-06-21 | 2009-07-15 | 株式会社日立製作所 | Information retrieval method, apparatus for implementing the same, and recording medium recording the processing program |
-
2005
- 2005-01-17 CN CNA2005100233835A patent/CN1645374A/en active Pending
- 2005-09-13 CN CN200510057491.4A patent/CN101488127B/en active Active
- 2005-10-08 WO PCT/CN2005/001642 patent/WO2006074586A1/en not_active Application Discontinuation
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1281191A (en) * | 1999-07-19 | 2001-01-24 | 松下电器产业株式会社 | Information retrieval method and information retrieval device |
US6785677B1 (en) * | 2001-05-02 | 2004-08-31 | Unisys Corporation | Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector |
JP2003152548A (en) * | 2001-11-14 | 2003-05-23 | Canon Inc | String search method in data compression |
Also Published As
Publication number | Publication date |
---|---|
WO2006074586A1 (en) | 2006-07-20 |
CN1645374A (en) | 2005-07-27 |
CN101488127A (en) | 2009-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chang et al. | Automatic information extraction from semi-structured web pages by pattern discovery | |
CN102184169B (en) | Method, device and equipment used for determining similarity information among character string information | |
US5649023A (en) | Method and apparatus for indexing a plurality of handwritten objects | |
CN108829801A (en) | A Method for Extracting Event-triggered Words Based on Document-Level Attention Mechanism | |
CN100483417C (en) | Method for catching limit word information, optimizing output and input method system | |
CN102063508B (en) | Generalized suffix tree based fuzzy auto-completion method for Chinese search engine | |
CN113326267B (en) | Address matching method based on inverted index and neural network algorithm | |
CN105488077A (en) | Content tag generation method and apparatus | |
WO2007016133A2 (en) | Processor for fast contextual matching | |
CN102419778A (en) | Information searching method for mining and clustering sub-topics of query sentences | |
CN103823814A (en) | Information processing method and information processing device | |
CN111831820B (en) | Correlation analysis method between news and cases based on case element guidance and deep clustering | |
CN102929864B (en) | A kind of tone-character conversion method and device | |
CN101488127B (en) | Bit mark character string fuzzy retrieval method for grouping character and labellng with bit | |
CN102135956A (en) | Word position tagging-based Tibetan word segmentation method | |
CN103853792A (en) | Automatic image semantic annotation method and system | |
CN101944086A (en) | Whole word index dictionary | |
CN101470701A (en) | Text analyzer supporting semantic rule based on finite state machine and method thereof | |
CN116226323A (en) | Scoring function construction method and related device for semantic retrieval | |
CN102117285A (en) | Search method based on semantic indexing | |
Dang et al. | WordNet-based suffix tree clustering algorithm | |
CN102122296A (en) | Search result clustering method and device | |
CN101499056A (en) | Backward reference sentence pattern language analysis method | |
CN105426490A (en) | Tree structure based indexing method | |
Phan et al. | Automated data extraction from the web with conditional models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |