CN101488127B

CN101488127B - Bit mark character string fuzzy retrieval method for grouping character and labellng with bit

Info

Publication number: CN101488127B
Application number: CN200510057491.4A
Authority: CN
Inventors: 徐文新
Original assignee: Individual
Current assignee: Individual
Priority date: 2005-01-17
Filing date: 2005-09-13
Publication date: 2015-01-07
Anticipated expiration: 2025-09-13
Also published as: WO2006074586A1; CN1645374A; CN101488127A

Abstract

The invention discloses a character string retrieval fuzzy method for grouping the characters and labeling by bits. The method includes: dividing total character elements into n groups, and recording character P element information which forms a character string S by data W having n bits; if the character element P1 of the S belongs to j group, then setting jth bit of the W as 1; P2 belonging to k group, then setting kth bit of the W as 1 -, taking the W after all the character elements are labeled as a '' bit value'' of the S, comparing Wa of Sa with Wb of Sb, judging that the Sb does not contain or possibly contains all the character elements of Sa; and aiming at the possible Sb, judging whether Sa is contained by using a character comparison method. Based on the above method, according to logic algebra, multiple labeling and comparison bit value methods can be extrapolated. The method can improve retrieval speed by many times, but the character string is needed to be labeled firstly, so that it is suitable for the character string search with a fixed retrieval range.

Description

To character grouping and with the character string fuzzy retrieval method of position mark

Technical field

The present invention relates to a kind of character string fuzzy retrieval method, fast and effeciently determine the scope that object character string may exist, re-use existing character match method location object character string, be applicable to the string search field that range of search is relatively fixed.

Background technology

Common character string fuzzy search adopts successive appraximation mode to carry out, as judged whether comprise character f in character string S=" bdopfqew ", computing machine compares with first character b and the f of main string S, compare with second of S character d and f again, by that analogy, until the 5th of S the character is identical with f, the match is successful, and this is the simplest situation.If substring T length is more than 2 characters, simple pattern matching algorithm, i.e. BF algorithm are with S ₁with T ₁relatively, if difference, with S ₂with T ₁relatively, the like, until some character S of S _iwith T ₁identical, then by the S after them _i+1with T ₁₊₁compare, if also identical, then continue down to compare, as some character S of S _i+nwith T _1+ntime different, then return, then with S _i+1with T ₁compare as a new round, repeat above process, until the character in T is all than complete, then the match is successful, otherwise it fails to match.Along with the length of search key T increases, the corresponding increase of complexity of character match.Pattern matching algorithm after improvement, i.e. KMP algorithm, concerning the alphabetic writing of small size character set, avoid backtracking, but the Chinese character string large for character set, monocase frequency is low, essential meaning is little.In brief, BF algorithm and KMP algorithm are all carry out successive appraximation to the character of main string and substring.

Within 2004, I proposes " prime number replacing character string search technology ", and the method can improve the speed 2-3 of character string fuzzy search doubly, but implements the method for long character string, needs more space storage prime number product value.In order to improve the speed of character string fuzzy search, and the demand reduced memory space, the present invention proposes the composition information carrying out tab character string with n position (bit) of data, data after after mark claim " place value " of this character string, the place value of two character strings is compared, and in conjunction with common character successive appraximation method, realize the fuzzy search of character string.Test shows, speed is the several times of general character successive appraximation fuzzy search and even more than tens times.

Summary of the invention

The present invention is a kind of character string retrieval technique, and object improves the speed of character string fuzzy search.With position (bit) several characters units corresponding, with the corresponding alphabet unit in n position, namely divide alphabet unit to be n group, and with the n of data individual be 0 position, be designated as W _f, mark the character metamessage of composition character string.If several character string S character element P ₁belong to n-th group, be then correspondingly labeled as 1 by n-th of W, similarly, according to other character element P of S ₂, P ₃, P ₄affiliated group marks W, and complete the W after alphabet meta-tag, record the information of S, be called " place value " of S, this mode is called 1 mark.According to the principle of logic algebra, also can be the position of 1 with the n of data, be designated as mark the character metamessage of composition character string.If S character element P belongs to n-th group, then correspondingly by data n-th be labeled as 0, this mode is called 0 mark.By to S _a" place value " W _a, with S _b" place value " W _b, compare, can S be judged _b" do not comprise ", " comprising " or " may comprise " S _ball characters unit.Such as, to W _awith W _bcarry out an implication operation, if there is implication relation all positions, then S _bcomprise and maybe may comprise S _a" all character unit ".If need, then judge S by common character successive appraximation method _bwhether comprise " S _a".Be " position mark retrieval ", " bit mark character string retrieval " " bit mark character string retrieval technique " hereinafter referred to as the inventive method.

Test shows, position mark retrieval can significantly improve the fuzzy search speed of character string.Outside speed advantage, another feature of position mark retrieval is that multiple keyword query is equally convenient with single keyword query.Position mark both can be used for the retrieval of ordinary meaning, namely judge whether database character string comprises keyword, also can do " inverse retrieval ", judge whether keyword comprises database character string, can be used in phonetic entry, Pinyin Input and Chinese word segmenting, coupling basic sentence patterns or word.

As the expansion of basic skills, if can be used for the figure place n marked, be more than 2 times of the average length m of character string, can mark, to improve screening effeciency with the corresponding one group of character unit of the combination of several position (bit).

Bit mark character string retrieval technique, as a kind of character string algorithm, needs first to carry out position mark to the character string of range of search, so be applicable to the string searching field that range of search relatively fixes.

1. basic skills

During bit mark character string retrieval is implemented, can with the character of " position " or " combination of position " corresponding ordinary meaning, as: a, b, A, B, ∑ #, ∞, ま, :, in, state; Also the corresponding Chinese radical in position can be used as required, as: Rolling, Si, so that stroke, as Pie, Dian etc.; For alphabetic writing, as english character string " day and night ", the corresponding word in position can be used, as day, and, night, or with the corresponding monogram in position, as ay, ai, the initial consonant of the Chinese phonetic alphabet, simple or compound vowel of a Chinese syllable or syllable can be represented in Chinese phonetic alphabet input or phonetic entry, can phonetic symbol be represented for other Languages, as θ, si η.For convenience of explanation, claim the character unit corresponding to combination of position or position to be " character unit ", be designated as P.

The method of bit mark character string retrieval, easily realizes in alphabetic writing string search.If with one 32, each is the data 0000,0000,0000,0000,0000,0000,0000,0000 of 0, is designated as W _f, with W _fcorresponding 26 English alphabets in 1st to the 26th position of (also can be from right-to-left), may not consider for other 6 from left to right, also can corresponding punctuation mark.Word big comprises b, g, i, so 3 positions such as corresponding 2,7,9 are labeled as " 1 ", then data become 0100,0010,1000,0000,0000,0000,0000,0000, are called the place value of big, are designated as W _a, this mark mode is one token.

Equally, bigger comprises b, e, g, i, r, therefore by W _f5 positions such as 2,5,7,9,18 be labeled as " 1 ", then data become 0100,1010,1000,0000,0100,0000,0000,0000, are called the place value of bigger, are designated as W _b.

Can find out, all big are the position of 1, and bigger also must be 1; But bigger is the position of 1, big may be 1, also may be 0, as the 5th, 18 position; Bigger is the position of 0, and big must be 0, as 21 positions such as the 1st, 3,4,6,8,10.By W _awith W _bcarry out " position is contained " computing, result is that all positions are " 1 ": 1111,1111,1111,1111,1111,1111,1111,1111, is designated as W _t, if unsigned long integer, be exactly 4294967295.Can be formulated as: W _a→ W _b=W _t, character string bigger comprises big.

If the place value of biggest is designated as W _c, have equally: W _a→ W _c=W _t, character string biggest comprises big.

If the place value of BIG is designated as W _a, by the place value W of big _acarry out " position is contained " computing with it, same has: W _a→ W _a=W _t, character string BIG equals big.Here do not consider that alphabet size is write, only the place value of the character string that explanation two is identical carries out " position is contained " computing, and all positions are 1.

If the place value of digger is designated as W _dif, by the place value W of big _acarry out " position is contained " computing with it, result is 1011,1111,1111,1111,1111,1111,1111,1111, is not equal to W _t, that is: W _a→ W _d≠ W _t, character string digger does not comprise big.

That is, by carrying out " position is contained " computing, if result is not equal to W to the place value W of two character strings _t, then consequent corresponding character string " does not comprise " (" no more than " and " being not equal to ", lower same) character string corresponding to preceding paragraph.

Scheme does not above consider the capital and small letter of letter, if case sensitive, then need 52 positions, this also can realize.But the corresponding position of each character unit, always unfeasible.As, GBK scope has 21000 Chinese characters, with the corresponding Chinese character in a position, the data of storage place value can take a lot of space, wherein having considerable is insignificant blank, and read data and bit comparison operation use time can increase meaninglessly, and in enforcement, as required can with one " position " corresponding multiple Chinese character.An easy scheme is by 21000 of GBK scope Chinese characters and other symbol, is divided into 32 groups, stores the place value of character string with long data according to coding, is provided with character string " Gu Yanzhi long river, desert setting sun circle ", then:

Chinese character	Greatly	Unconcerned	Lonely	Cigarette	Directly	Long	River	Fall	Day	Circle
											Coding	46323	50350	47554	53708	54961	45988	47827	49892	51413	54450
Group	20	15	3	13	18	5	20	5	22	19
											Chinese character	Frost	Cold	Long	River	?	Fall	Day	My god	Margin	?
Coding	52138	49380	45988	47827	?	49892	51413	52460	53700	?
											Group	11	5	5	20	?	5	22	13	5	?

Coding in table obtains with function code in excel, and wherein " greatly " " river " coexists 20 groups, and " length " " falls " to coexisting 5 groups, so be only 1 by 8 corresponding positions, then and the place value W of " Gu Yanzhi long river, the desert setting sun is justified " _hfor:

0010，1000，0000，1010，0111，0100，0000，0000

The W of " the long river setting sun " can be obtained equally _i" place value " is: 0000,1000,0000,0000,0001,0100,0000,0000, with W _iwith W _hmake position " position is contained " computing: W _i→ W _h=W _t, there is relation of inclusion in two character strings.

In " white cold long river ", " frost " is the 11st group, and " cold " and " length " are the 5th group, then " place value " W _jfor: 0000,1000,0010,0000,0001,0000,0000,0000, with W _jwith W _hmake position " position is contained " computing: W _j→ W _h≠ W _t, and also there is not relation of inclusion in two character strings.That is, with one " position " corresponding multiple Chinese character, obtain place value, " position is contained " computing is carried out, if result is not equal to W to the place value of two character strings _t, then consequent corresponding character string " does not comprise " " the alphabet unit " of the character string corresponding to preceding paragraph, also just must not comprise this " character string ".For alphabetic writing, in like manner with the corresponding multiple word in a position, can judge that a character string " does not comprise ", the character of " may comprise " another character string unit.

But, " position is contained " computing is carried out, if result equals W to the place value of two character strings _t, consequent corresponding character string is also " may comprise " character string corresponding to preceding paragraph.

If the place value of gibber is designated as W _e, by the place value W of big _acarry out " position is contained " computing with it, have equally:

W _a→W _e＝W _T

But character string gibber does not comprise big.

" place value " W of " setting sun ends of the earth " _kfor: 0000,1000,0000,1000,0000,0100,0000,0000, with and W _hdo position " position is contained " computing:

W _k→W _h＝W _T

But " Gu Yanzhi long river, desert setting sun circle " does not comprise " setting sun ends of the earth ".

That is, bit mark character string retrieval is distinguishing with general character string successive appraximation method.Even if the place value presence bit implication relation of two character strings, also not necessarily there is relation of inclusion in two character strings.If with " position " correspondence " a character unit ", the place value presence bit implication relation of two character strings, then the character string of consequent correspondence " comprises " the alphabet unit of character string corresponding to preceding paragraph.But owing to not considering the arrangement of character unit, can not affirm whether two character strings exist relation of inclusion.If mark is with " position " correspondence " one group of character unit ", multiple character unit is had among one group of character unit, then the character string " likely " of consequent correspondence comprises the alphabet unit of character string corresponding to preceding paragraph, also " likely " do not comprise character string corresponding to preceding paragraph alphabet unit, more can not affirm whether two character strings exist relation of inclusion.But as required, can judge whether two character strings exist relation of inclusion by character successive appraximation method again.

Bit mark character string is retrieved, and carry out the preceding paragraph of bit arithmetic and consequent place value, can be all the label information of the character unit of a character string, also can be the label information of all character units of multiple character string, is referred to as the place value of " several character strings " S." several character strings " is the appositive concept of S, and in other words, S refers to one or more character string.

If carry out bit mark character string retrieval to database, character string record S ₁, S ₂, S ₃... place value be respectively W ₁, W ₂, W ₃...Search key S _tplace value be designated as W _t, with W _twith W _ncarry out an implication operation, result is W _t, then character string S _ncomprise and maybe may comprise S _tall characters unit.By qualified S _nthe record set formed, is designated as R _i.Retrieving the time used is referred to as T _i.Usual needs are at R _imiddle character string successive appraximation method obtains net result collection R _z, retrieving the time used is referred to as T _z.

Therefore the used time that bit mark character string search method is total is T _i+ T _zif wish to make T _ilow, concerning the processors of 32, the data of preferably storing place value are exactly 32.For certain character string fuzzy search, R _zbe certain, reduce T _zmethod be make PRELIMINARY RESULTS collection R as far as possible _iminimum.The figure place n of storage place value is fewer, and the character unit of a position correspondence is more, and screening effect is poorer, R _ialso can be larger; Character string average length is longer, R _ialso can be larger.Affect R _isize also has another factor: search key is longer, R _ialso less.

Bidding remembers that figure place used is n, and " 1 " of character string place value is m, and " 1 " of search key place value is k, then the screening probability of bit comparison can calculate with following formula, and its value is the smaller the better:

P_{Ii} = \frac{C_{m}^{k}}{C_{n}^{k}} = \frac{\frac{m!}{k! (m - k)!}}{\frac{n!}{k! (n - k)!}} = \frac{m! (n - k)!}{n! (m - k)!}

N is that the screening probability calculation of 32, m and k part value is as follows:

m	k＝1	k＝2	k＝3	k＝4
					24	0.75	0.556451613	0.408065	0.295495
22	0.6875	0.465725806	0.310484	0.20342
					20	0.625	0.383064516	0.229839	0.134733
18	0.5625	0.308467742	0.164516	0.085095
					16	0.5	0.241935484	0.112903	0.050612
14	0.4375	0.183467742	0.073387	0.027836
					12	0.375	0.133064516	0.044355	0.013765
10	0.3125	0.090725806	0.024194	0.00584
					8	0.25	0.056451613	0.01129	0.001947

In three factors of impact screening probability, during user search, length keywords used determines the size of k, record string length determines the size of m, but length keywords used is uncontrollable when record string length, retrieval, controllable factor marks figure place n used, and the number m of " 1 ", the size of k after the certain string token of length, also affect by n, so n and m keeps enough ratio very important.

Some suggestion:

The cpu of 1.32, marks with lint-long integer, is beneficial to place value most and compares, if signless integer, then W is 32 positions, if there is sign bit integer, then only has 31 positions to be convenient to mark.If character string average length is greater than 16 character units, the database that record number is many, in SQL SEVER 2000,63 positions of available data types bigint mark, and correspondingly, character unit are divided into 63 groups.Certainly, for 32 bit processors, store place value with bigint, read and carry out bit comparison and must use the more time.In fact, any data type of being convenient to carry out bit arithmetic in any database all can be used for tab character string, if naturally better by the data type without sign bit.For 64 cpu, 64 positions certainly should be adopted to carry out tab character string, to make full use of the performance of cpu, improve the dispersion of " place value ".

If 2. cpu is 32, the database that between character string, length differs greatly, can be divided into two tables by record, short character string list 32 positions mark, long string table 64 marks, is joined together by two table Query Results in inquiry with union order.

3. the grouping of character unit take frequency equilibrium as the best, also will take into account the speed of mark place value.Carry out modular arithmetic by Hanzi internal code, remainder number divides into groups, and easily realize, but be not optimal group, and modular arithmetic division arithmetic speed is slow.Can consider that Chinese character is divided into 32 groups by certain 5 position of getting Hanzi internal code, speed can be faster, and effect may be better.

Outside the retrieval of ordinary meaning, can carry out database character string " inverse retrieval ", namely with database character string S by position labeling method _nw _nvalue is preceding paragraph, with search key S _tw _tvalue is a consequent implication operation that carries out, and the position whether all according to result is " 1 ", i.e. W _n→ W _t=W _t, can filter out those composition character units in database may be search key S _tthe S comprised _n.

Nature, the calculating of the screening probability of inverse retrieval will be conversely.If database and keyword all use one token, marking figure place used is n, S after the mark of position _nplace value W _n" 1 " be m, " 1 " after search key mark is k, then search key S _tcomprise the probability that maybe may comprise its all character unit can calculate with following formula:

P_{Ii} = \frac{C_{k}^{m}}{C_{n}^{m}} = \frac{\frac{k!}{m! (k - m)!}}{\frac{n!}{m! (n - m)!}} = \frac{k! (n - m)!}{n! (k - m)!}

2. test analysis

With said method to one based on Chinese character, have the English and digital database of part to test: to record several 267, more than 000 bar, character amount 3,473, more than 000, character string average length 12.989, with lint-long integer mark place value, programming language is VB, and operating system is Window xp, CPU Celeron 800Hz, internal memory 256M, Jijia 810 mainboard, hard disk 40G.The retrieval overall used time, be designated as T.Obviously, the mark retrieval of any once position, all must read the place value of all records, compare with it with the place value of keyword, the time used can think a constant, is called T ₀, but not easily directly record.The relatively rear qualified record set of place value, is designated as R ₁, to R ₁carry out successive appraximation retrieval used time T ₁also not easily directly record, but and R ₁size is relevant, R ₁can obtain.Another factor of impact retrieval used time is net result collection R ₂size.According to T, R of 120 test gained ₁, R ₂, regretional analysis obtains following equation:

\hat{T} = 0.268 + 0.000,008,625 R_{1} + 0.000,0270,2 R_{2}

Coefficient of determination after adjustment

The significance test F=5265.814 of regression model

Be greater than F _0.01(2,117)=4.791

The significance test t of constant ₀=86.610

R ₁significance test t ₁=74.263

R ₂significance test t ₂=25.489

All be greater than t _0.01(117)=2.6185

Visible, regression equation confidence level is very high.Constant wherein, namely reads the place value of all records, compares with it the time used, i.e. T with the place value of keyword ₀, be 0.268 second.120 2.1739 seconds average used times of character string successive appraximation method of simultaneously carrying out.The time T that visible place value is more used ₀be only 1/8th of the usual successive appraximation method used time.But the overall used time of bit mark character string retrieval goes back and R ₁, R ₂relevant, n=31 in this test, character string average length 12.989, have overlapping phenomenon after mark, the mean number m of " 1 ", about 11 to 12, calculates by 12 here, if the character unit distribution of database character string is normal above, " 1 " distributing equilibrium after mark, can calculate R by probability ₁, net result collection R ₂have nothing to do with screening effect, be appointed as 240 here, then ideally k is the retrieval used time of 1 to 12 after keyword tag, can be as follows by regression equation calculation:

k	R ₁	R ₂	T
				1	103354.8	240	1.16592
2	37896.77	240	0.601344
				3	13067.85	240	0.387195
4	4200.381	240	0.310713

5	1244.557	240	0.285219
				6	335.0732	240	0.277375
7	80.41756	240	0.275178
				8	16.75366	240	0.274629
9	2.91368	240	0.27451
				10	0.39732	240	0.274488
11	0.03784	240	0.274485
				12	0.001892	240	0.274485

Visible, compared with usual character string retrieving method average 2.1739 seconds used times, bit mark character string retrieval has very large performance advantage.But the packet conditions of character unit during distribution situation, the mark of the data type used especially size of n, database character unit, all affect the used time that bit mark character string is retrieved, so this regression equation only has reference significance when different hardware and mark.

Only have 26 letters in English, if character string is very long, every bar record is containing each letter, then mark retrieval in position can lose screening effect.But record several 242,000 to one, character amount 7,493, the Database in English data of 000 are tested, show that bit mark character string retrieval is also feasible to alphabetic writing usually.This database character string data type is varchar, and field long 56, character string on average grows 30.846, marks 26 letters, case insensitive, ignore space and other character with 26 bit of 1 lint-long integer.Because the frequency of utilization of letter English letter is unbalanced, after mark, statistics only has 3213052 " 1 ", the average of each record " 1 " is 13.2266, illustrates that being marked with a large amount of overlaps occurs on the one hand, illustrates that not every bar record comprises each letter on the other hand.

Carry out 120 character successive appraximation retrievals to this database, the average used time is 4.573 seconds, and carry out position mark retrieval, according to T, R1, R2 of 120 times, regretional analysis obtains following equation simultaneously:

\hat{T} = 0.265 + 0.000,019,36 R_{1} + 0.000,0362, 3 R_{2}

Coefficient of determination after adjustment

The significance test F=55405.88 of regression model

Be greater than F _0.01(2,117)=4.791

The significance test t of constant ₀=36.400

R ₁significance test t ₁=121.733

R ₂significance test t ₂=77.409

All be greater than t _0.01(117)=2.6185

Be the alphabetical distribution statistics table of this database below:

Letter	Containing the record number of this letter	Number percent	Letter	Containing the record number of this letter	Number percent
						t	227167	0.935136	h	114263	0.470365
r	222338	0.915257	u	109376	0.450248
						e	221713	0.912685	g	97339	0.400697
a	213249	0.877842	f	83452	0.343531
						c	210116	0.864945	W	71228	0.293211
n	204225	0.840695	b	67065	0.276074
						o	201082	0.827757	y	60698	0.249864
i	195734	0.805742	v	59351	0.244319
						s	195639	0.805351	k	47379	0.195036
l	169753	0.698791	x	15150	0.062365
						d	159202	0.655357	q	7258	0.029878
m	125449	0.516413	j	6351	0.026144
						p	122168	0.502906	z	6307	0.025963

From upper table, in this database, t, r, e tri-letter frequency are the highest, take tree as search key, and actual retrieval obtains the record 7 containing tree, and R ₁be 188491,3.655 seconds used times, the successive appraximation simultaneously carried out 4.465 seconds used times of retrieval.As shown the alphabetical frequency of t, r, e tri-by upper, can be calculated:

R1＝242，924*0.935136*0.915257*0.912685＝242，924*0.781158＝189762

By the regression equation calculation retrieval used time be again:

\begin{matrix} \hat{T} = 0.265 + 0.000,019,36 * 189762 + 0.000,0362,3 * 7 \\ = 3.939 \end{matrix}

Visible, even the complete like this word be made up of high frequency letter of tree, position mark retrieval still has performance advantage.In fact, in English, most of word is more than 4 letters, if wherein there is a low frequency letter, the screening effect of position mark just very well, is equivalent to " short-board effect ".Enumerate 30 test datas for reference:

3. the overlapping possibility of a mark

In alphabetic writing, number of letters is little, so the frequency that letter repeats is high, as the words such as bigger, biggest have two g, but only the 7th " position " is labeled as 1.For Chinese character string, the probability that in a character string, same Chinese character repeats is low, but Chinese character is divided into groups, several Chinese characters in a character string may belong to same group, as in grouping above, " greatly " " river " coexists 20 groups, and " length " " falls " to coexisting 5 groups, is overlapped in one " position " after mark.Even if grouping realizes character unit frequency equilibrium, also there is the problem of string token overlap.If the figure place being used for marking is n, the number of character unit is m.The probability wherein do not overlapped completely is:

P = \frac{A_{n}^{m}}{n^{m}} = \frac{n (n - 1) \cdot \cdot \cdot (n - m + 1)}{n^{m}}

Obviously, time n is certain, m often increases by 1, and the new new-added item of molecule just successively decreases 1, and the new new-added item of denominator is constant, and the probability do not overlapped completely is more and more lower.If mark with 32 positions, the long m of character string is 7, the probability do not overlapped completely:

P_{(7)} = \frac{A_{32}^{7}}{32^{7}} = \frac{32 (32 - 1) \cdot \cdot \cdot (32 - 7 + 1)}{32^{7}} = \frac{32 * 31 * 30 * 29 * 28 * 27 * 26}{32 * 32 * 32 * 32 * 32 * 32 * 32} = \frac{16,963,914,240}{34,359,738,368} = 0.4937

Following table marks with 32 positions, when the long m of character string is 1 to 24, and the probability do not overlapped completely:

The overlapping loss just meaning information, overlap is to a certain degree inevitable, is also acceptable, but the ratio of overlap is too high, can affect the performance of position mark retrieval, so the probability overlapped also merits attention.With n position mark, character string is long is m character unit, and the overlapping probability calculation formula for k " 1 " is:

P = \frac{C_{n}^{k} \cdot A}{n^{m}}

Wherein

A = \underset{m_{1} + m_{2} + \cdot \cdot \cdot + m_{k} = m}{Σ} \frac{m!}{m_{1}! m_{2}! \cdot \cdot \cdot m_{k}!}

Here summation gets allly to meet m ₁+ m ₂+ ... + m _kthe positive integer solutions of=m, its group number is

If with 32 position marks, character string length is 7 character units, and overlap is 3 positions, i.e. m ₁+ m ₂+ m ₃=7, have 15 groups, probability calculation is as follows:

7！	m ₁	m ₂	m ₃	m ₁！	m ₂！	m ₃！	7！/(m ₁！m ₂！m ₃！)
								5040	1	1	5	1	1	120	42
5040	1	2	4	1	2	24	105
								5040	1	3	3	1	6	6	140
5040	1	4	2	1	24	2	105
								5040	1	5	1	1	120	1	42
5040	2	1	4	2	1	24	105
								5040	2	2	3	2	2	6	210
5040	2	3	2	2	6	2	210
								5040	2	4	1	2	24	1	105
5040	3	1	3	6	1	6	140
								5040	3	2	2	6	2	2	210
5040	3	3	1	6	6	1	140
								5040	4	1	2	24	1	2	105
5040	4	2	1	24	2	1	105
								5040	5	1	1	120	1	1	42
?	?	?	?	?	?	?	1806

P = \frac{C_{32}^{3} \cdot A}{32^{7}} = \frac{4960 * 1806}{34,359,738,368} = 0.000261

With 32 positions mark, character string length is 7 character units, mark afterwards k be the probability of 1-7 as following table:

k	The number of permutations	Arrangement number percent	Number percent adds up	The k* number of permutations
					7	16963914240	0.493714884	0.493714884	1.18747E+11
6	13701623040	0.398769714	0.892484598	82209738240
					5	3383116800	0.098461658	0.990946256	16915584000
4	302064000	0.008791219	0.999737475	1208256000
					3	8957760	0.000260705	0.99999818	26873280
2	62496	1.81887E-06	0.999999999	124992
					1	32	9.31323E-10	1	32
?	?	?	?	2.19108E+11

After mark, the weighted mean number of " 1 " is 6.376881392.

4. a logic algebra principle that mark is relevant and 0 mark

Some positions of the place value W of character string S are " 1 ", mean that proposition " character string S has in certain character unit or certain group character unit " is true exactly.For two character string S _aand S _b, W _acertain k position be 1, mean proposition " character string S exactly _ahave in certain k character unit or certain k group character unit " be true; If W _ball corresponding certain k positions are 1, mean proposition " character string S exactly _bhave in certain k character unit or certain k group character unit " be also true.Nature, W _bother certain m-k position may be had to be 1, and W _athese certain m-k positions are 0.This relation logic algebra formula is just expressed as: W _a→ W _b=W _t, legibility directly perceived.But and the programming language of not all and Database Systems have " position is contained " operational symbol, so need to convert by logic algebra principle in application.

From Boolean calculation theorem known, if a → b=1, then same, the place value for n position also has: if W _a→ W _b=W _t, then

{\overset{&OverBar;}{W}}_{a} | W_{b} = W_{T} .

Also can obtain:

W _a|W _b＝W _b

W _a&W _b＝W _a

From logical operation theorem

W_{a} &RightArrow; W_{b} &DoubleLeftRightArrow; {\overset{&OverBar;}{W}}_{b} &RightArrow; {\overset{&OverBar;}{W}}_{a}

It is known,

W _a→ W _b=W _t, then

So available one each " position " is the data W of 1 _tmark place value, if character string comprises a certain character unit, then corresponding position is labeled as 0, this mode is called 0 mark.As:

The place value of big be 10111101011111111111111111111111

The place value of bigger be 10110101011111111011111111111111

Obviously

{\overset{&OverBar;}{W}}_{b} &RightArrow; {\overset{&OverBar;}{W}}_{a} = W_{T} .

Also can obtain:

{\overset{&OverBar;}{W}}_{b} | {\overset{&OverBar;}{W}}_{a} = {\overset{&OverBar;}{W}}_{a}

More directly perceived with truth table:

W _a	W _b	W _a→W _b＝W _T	W _a&W _b＝W _a	W _a\|W _b＝W _b	?	?
							0	0	1	0	0	?	?
0	1	1	0	1	?	?
							1	0	0	0	1	?	?
1	1	1	1	1	?	?

Described above is, how to use modal bit arithmetic symbol to carry out place value and compare, the principle of screening probability communicates.Certainly, during 0 mark, m, k all refer to the number of 0.But the bit arithmetic that each programming language and database provide symbol quantity differs, the concrete operations of other bit arithmetic symbol, can release by logic algebra principle.Use logic algebra principle, carry out equivalence transformation, show that formula should for the purpose of brief being suitable for, instead of become more complicated and more complicated certainly.

More in addition, if programming language or Database Systems provide the function of putting 1 or setting to 0 of " position ", then position " mark " can directly use.If Database Systems programming language does not provide position 1, then "or" (or) computing of available position, carries out 1 mark.

First character unit is divided into groups, if with 32 position marks, then to n-th group of assignment 2 ^32-n.From scale-of-two, then this value from left to right n-th be 1, and all the other positions are 0, are referred to as " basic place value "." the basic place value " of character string all characters unit is carried out "or" (or) computing, the place value of this character string can be obtained.Certainly, also can to n-th group of " basic place value " assignment 2 ^n-1, from scale-of-two, be then 1 from n-th of right-to-left, and all the other positions are 0.In fact, in certain database, mark record character string and search key, character unit and position have fixing corresponding, no matter and be from right-to-left, still from left to right, or other order.

If carry out 0 mark, then first give n-th group of character unit, composing with n-th is 0 and all the other are " the basic place value " of 1, then " the basic place value " of all for specific character string character units is carried out " with " (and) computing, the place value of this character string can be obtained

As for mark overlapping possibility computing method, 0 mark and 1 marks and communicates.

5. multidigit mark and packet marking

1. multidigit mark

On the basis that list " position " (bit) marks, the following describes many " position " (bit) mark.If the figure place n of storage place value is more, and the length of character string is few, the corresponding one group of character unit of combination of an available j position marks, then have the combination of individual position, correspondingly divides character unit to be group, if S character element P belongs to r group, is then labeled as 1 by j the position forming the combination of r position in W.When j is 1 and above-mentioned list " position " (bit) mark.

If n is 32, j is 2, then character unit is divided into group, is 496 groups, by one token, if character unit belongs to the 1st group, then 1,2 two of W " position " is labeled as " 1 "; If character unit belongs to the 2nd group, then 1,3 two of W " position " is labeled as 1, if character unit belongs to the 3rd group, then 1,4 two of W " position " is labeled as 1 ...

Suppose the place value the 2nd of Chinese character " front " word, 5 two positions are 1, and the place value the 23rd of " scape " word, 29 two positions are 1, then word " prospect " has 2, 5, 23, 29 4 positions are 1, establish again the place value the 2nd of " far ", 23 two positions are 1, the place value the 5th of " greatly ", 29 two positions are 1, then word " long-range " is also 2, 5, 23, 29 4 positions are 1, retrieve by the place value of " prospect ", there will be in result " long-range ", vice versa, namely there will be the character string containing " uncorrelated group of character unit " in result for retrieval, but the Chinese character contained by each group comparatively single " position " is labeled as few, so screening effect can be improved.Adopt the object of multidigit mark to be improve screening effect, its screening probability calculation is identical with unit markings principle.Calculating character string length is 4 character units below, and search key length is 2 character units, screening probability when j is 1,2,3,4, and its value is more low better:

J is 2, does not consider overlap, then after string token, m is 4*2=8 individual 1, and after search key mark, k is 2*2=4 individual 1.

J is 3, and namely character unit is divided into group, does not consider overlap by namely 4960 groups, then after string token, m is 4*3=12 individual 1, and after search key mark, k is 2*3=6 individual 1.

J is 4, and namely character unit is divided into group, does not consider overlap by namely 35960 groups, then after string token, m is 4*4=16 individual 1, and after search key mark, k is 2*4=8 individual 1.

P_{1} = \frac{C_{m}^{k}}{C_{n}^{k}} = \frac{\frac{m!}{k! (m - k)!}}{\frac{n!}{k! (n - k)!}} = \frac{m! (n - k)!}{n! (m - k)!} = \frac{4! (32 - 2)!}{32! (4 - 2)!} = \frac{4 * 3}{32 * 31} = 0.012097

P_{2} = \frac{\frac{8!}{4! (8 - 4)!}}{\frac{32!}{4! (32 - 4)!}} = \frac{\frac{8!}{4!}}{\frac{32!}{28!}} = \frac{8 * 7 * 6 * 5 *}{32 * 31 * 30 * 29} = 0.00194661

P_{3} = \frac{\frac{12!}{6! (12 - 6)!}}{\frac{32!}{6! (32 - 6)!}} = \frac{\frac{12!}{6!}}{\frac{32!}{26!}} = \frac{12 * 11 * 10 * 9 * 8 * 7 *}{32 * 31 * 30 * 29 * 28 * 27} = 0.00101965

P_{4} = \frac{\frac{16!}{8! (16 - 8)!}}{\frac{32!}{8! (32 - 8)!}} = \frac{\frac{16!}{8!}}{\frac{32!}{24!}} = \frac{16 * 15 * 14 * 13 * 12 * 11 * 10 * 9}{32 * 31 * 30 * 29 * 28 * 27 * 26 * 25} = 0.00122358

If when search key length is 1,2 character unit, multidigit mark is especially meaningful.But be a timing at storage place value n, j value is not the bigger the better, and as calculated display above, when n be 32, m is 4, j is 3 best results, continues to increase, can not improve screening effect, and the ratio regular meeting increase of overlap during mark, marking operation is also more complicated.If carry out multidigit mark to long character string, n is sufficiently large.

2. packet marking:

As 32 points of 17 and 15 liang of groups by a unsigned long integer, mark respectively, the long m of character string is 4, does not consider to mark overlapping problem, and after mark being 41, if keyword length is 2, is 21 after mark, then screen probability to be:

P = \frac{4! (17 - 2)!}{17! (4 - 2)!} \times \frac{4! (15 - 2)!}{15! (4 - 2)!} = \frac{4 * 3}{17 * 16} \times \frac{4 * 3}{15 * 14} = \frac{12}{272} \times \frac{12}{210} = 0.002521

Obviously, also screening effect can be improved.Certainly, carrying out this optimization, is that the position n of mark will be several times as much as m equally.This optimal way, for doing accommodation during long character string, namely uses the character unit packet mode that two kinds different, marks respectively obtain two groups of W, W by two data '.During retrieval, first obtain the W of keyword with the wherein labeling method of a group _t, with the W of database _nvalue compares, and filters out R ₁after.At R ₁in, the W ' of keyword is obtained with the labeling method of second group _t, with the W ' of database _nvalue compares and obtains R ₂, then at R ₂middlely obtain net result collection R by common successive appraximation method _z.Compared with upper a kind of optimization method, be the memory space increasing place value equally, this optimal way need not read whole W ' _n, with W ' _tcompare.

0 mark, 1 mark can carry out multidigit mark, packet marking, and packet marking also can combine with multidigit mark, and different groups also can be 0 mark, 1 mark respectively.Certainly, be suitable for as good.0 mark, multidigit mark all can be used for inverse retrieval with packet marking.

The speed of bit mark character string retrieval faster than the speed of prime number replacing character retrieval, but when character unit kind is many, with a position correspondence character unit, is not easily implemented; As with multidigit mark, in result for retrieval, do not comprise " object group character unit " and the character string that only comprises " uncorrelated group of character unit " can mix to come in.With prime number replacing character unit, can avoid occurring this specific admixture phenomenon in result for retrieval, in application, first can make preliminary screening with bit mark character string retrieval, make quadratic search with prime number replacing character retrieval again, as needs, then obtain net result by common character string successive appraximation method.

6. the application of bit mark character string retrieval technique in phonetic entry and Chinese phonetic alphabet input:

Phonetically similar word, the homonym of Chinese are many, therefore need to screen suitable words by phonetic in Chinese speech input and Pinyin Input, and position labeling method " inverse retrieval " can be used for this aspect.

Do not consider tone, Chinese has more than 400 syllable, is not easy to the corresponding syllable in a position, the initial consonant of the corresponding Chinese phonetic alphabet in 64 positions of available 8 bytes and simple or compound vowel of a Chinese syllable.Be provided with Chinese character by words collocation and basic sentence patterns database, wherein have basic sentence patterns " he has graduated ", phonetic is " tabiyele ", place value W after mark ₁be:

1000，0101，0000，0000，0000，0001，1011，0000，0000，0000，0000，0000，0000，0000，0000，0000

If speech conversion or Pinyin Input " tazaojiubiyele ", place value W after mark _tbe:

1000，0101，0001，0010，0000，0001，1011，0000，1000，0000，1000，0000，0000，0000，0000，0000

W ₁→W _t＝W _T

Then " he has graduated " is that " tazaojiubiyele " can the basic sentence patterns of reference, can obtain after process " he zaojiu has graduated ", wherein " graduation " is predicate, it is verb, and phonetic is that the word of " zaojiu " has " for a long time " " achievements " " jujube wine " in dictionary, if the other sides such as grammer, semanteme, word frequency can play booster action, then can convert to further " he has graduated for a long time ".Due to the diversity that sound coordinates, compare with place value in the PRELIMINARY RESULTS obtained, the sentence pattern of phonetic as " tebayile " and so on also there will be.Can on the basis of bit mark character string, do postsearch screening by the prime number replacing method of the corresponding syllable of a prime number, obtain closer to result.

If have " building " and the collocation " building " of measure word " " in words collocation and basic sentence patterns, there are again " building " and the collocation " high building " of often modifying its adjective " height ", can mark by their the initial and the final.If speech conversion or Pinyin Input " zhedongxiezilouhengao ", mark by the initial and the final equally, with the place value W of the collocation of database whole words and basic sentence patterns _ncompare with it, collocation such as " building " " high buildings " can be filtered out for referencial use, can obtain after process " zhe xiezi building hen is high ", xiezi or xiezilou wherein can obtain from dictionary " writing " or " office building ", if the other sides such as grammer, semanteme, word frequency can play booster action, " zhedongxiezilouhengao " can be converted to " this office building is very high ".Other collocation of Chinese and corresponding place value can be obtained similarly.Administrative division is as, " provinces and cities " " county of province "; Numeral is as, " seven or eight " " multifarious "; Measure word " weight " " first angle "; And even the combination of surname " Lee " " Liu ".If input " hubeishengxianningshi ", according to the method described above, can obtain in " xianning city of hubei province ".If input " zhanglonglihu ", can obtain " Lee long hu ".If input " qiwansanqian ", can obtain " 70,000 san thousand ".Equally, can on the basis of bit mark character string, do postsearch screening by the prime number replacing method of the corresponding syllable of a prime number, obtain closer to result.

In addition, the frequency of utilization due to mandarin initial, simple or compound vowel of a Chinese syllable is unbalanced, during unit markings, with initial consonant, the simple or compound vowel of a Chinese syllable of 64 of the data of 8 bytes " position " corresponding Chinese phonetic alphabet, with some syllable as li retrieve time, due to 1 with i be all high frequency initial consonant, simple or compound vowel of a Chinese syllable, screening effect may be bad, can consider with dibit mark, be namely that 32, j is for 2 with n, i.e. 496 groups of positions, 400 syllables of corresponding Chinese, and do to calculate to database, make it balanced as far as possible.Certainly, mark by 400 syllable dibits, in PRELIMINARY RESULTS, certainly there will be " non-object " sentence pattern and collocation, can retrieve on the basis of screening by mark in place, do postsearch screening by the prime number replacing method of the corresponding syllable of prime number, obtain closer to result.

Also can use technique in Chinese word segmenting, by the word in dictionary, with reference to collocation and sentence pattern, by phonetic mark place value, keyword (actual is sentence) also by phonetic mark place value, carries out an implication operation, obtains R _i, effectively can reduce seek scope; Can certainly by the word in dictionary, with reference to collocation and sentence pattern, by the packet marking place value of ISN, keyword is also by same packet marking place value, carry out an implication operation, effectively can reduce seek scope equally, then be aided with other method and complete decomposition to keyword (namely sentence).

Because liaison, voice rheological phenomena are many in English, clearly not clear like Chinese syllable, also can with bit mark character string retrieval technique screening with reference to vocabulary, phrase.If database has phrase in the morning, by wherein vowel, consonant i, n, m, n, i, η carry out mark and obtain place value.If after phonetic entry conversion, corresponding to the wherein fuzzyyer, can reject, with other louder i, n, m, n, i, η carry out mark and obtain place value, with compare with the place value of Database Reference vocabulary, phrase, phrase in the morning can be obtained, and other pronunciation containing i, n, m, the phrase of n, i, η or word.On this basis, reduce the scope further by the method for prime number replacing syllable, and then based on context, assist with grammer, semanteme, word frequency, select in the morning.

Foregoing describe method and relevant probability calculation formula, the logic algebra principle of bit mark character string retrieval, and lay particular emphasis on the screening of the character string record in database.From programming angle, the data structure of database is diversified, and various data structure is also not only for database.But bit mark character string retrieval technique, as a kind of character string algorithm, can be used for the string searching of various data structure, as the character string fuzzy search in large array.As for the character string fuzzy search of little array, then needs may not be had, because the mark of character string also takes time and space.

Embodiment

The present invention has obtained good realization in Chinese character string database and the storehouse fuzzy search of English character string data, here is to the Chinese character string database in SQL SERVER2000, carry out the vb6.0 code marked with " 1 ", the bit mark character string fuzzy search of other programming language, database can refer to enforcement.

1. building database

If database shuku has table biao, wherein have field shuming, data type is nvarchar, and length is 40.Separately set up field wei, data type is " long ", namely 4 bytes, has 32 positions, and wherein one " position " is sign bit, and all the other 31 positions may be used for mark.

Directly " position " is not set to the order of " 1 " or " 0 " in 2 vb6.0, therefore utilizes the inclusive-OR operation of position to make " mark " to database character string

dim?shuzu(30)As?Long

' the long array of definition 31 elements.

shuzu(0)＝1

For?x＝1?To?30

shuzu(x)＝2*shuzu(x-1)

Dim?biaostr?As?String

' storage is when the character string of pre-treatment

Dim?weizhi?As?Long

' the place value of storage character string

Dim?weizhilin?As?Long

' the basic place value of a storage character

Dim?x?As?Integer

biaors.MoveFirst

' move to first record of database record set biaors

Do

weizhilin＝0

weizhi＝0

With?biaors

biaostr＝.Fields(″shuming″)

End?With

' read in the character string of a record, invest string variable biaostr

For?x＝1?To?Len(biaostr)

index＝Abs(Asc?W(Mid(biaostr，x，1))Mod?31)

' from string variable biaostr, get a character, and by this character ISN, with 31 for mould does computing, then take absolute value, and invest index, namely character unit is divided into groups.

weizhilin＝shuzu(index)

' array shuzu (index) value is invested weizhilin, is one of 1,2,4,8,16 to 1073741824.

weizhi＝weizhi?Or?weizhilin

' " basic place value " the weizhilin value of a character and weizhi are done the inclusive-OR operation of position.

With?biaors

.Fields(″wei″)＝weizhi

End?With

biaors.Update

' " place value " weizhi is stored in the field wei of current record

biaors.MoveNext

' process next record

Loop?While?Not?biaors.EOF

3. utilize position " with " computing carries out the fuzzy search of database character string

Dim?shuzu(30)As?Long

shuzu(0)＝1.

For?x＝1?To?30

shuzu(x)＝2*shuzu(x-1)

Dim?weizhi?As?Long

' the place value of a storage character string

Dim?weizhilin?As?Long

' the place value of a storage character

Dim?textstr?As?String

' storing retrievable character string

Dim?x?As?Integer

weizhilin＝0

weizhi＝0

textstr＝Text1.Text

' Text1.Text is the search key in text box

For?x＝1?To?Len(biaostr)

index＝Abs(Asc?W(Mid(textstr，x，1))Mod?31)

weizhilin＝shuzu(index)

weizhi＝weizhi?Or?weizhilin

strQuery＝″select*from(SELECT*FROM?biao?WHERE(wei&″&weizhi&″)＝″&weizhi&″)DERIVEDTBL?WHERE(shuming?like′％″&textstr&″％′)″

' in SQL SERVER 2000, there is no an implication operation symbol, here " place value " wei that " place value " weizhi of search key and database respectively record is done " with " (and) computing, filter out " with " (and) operation result equals the record (also can use (wei| " & weizhi & ")=wei) of " place value " weizhi of keyword, obtains PRELIMINARY RESULTS collection R _i, then make quadratic search in usual character string fuzzy search mode, obtain net result R _z.This is the query statement of SQL SERVER 2000, and other database may be slightly different.

Adodc?1.RecordSource＝strQuery

Adodc?1.Refresh

' perform retrieval

DataList?1.ListField＝″shuming″

DataList?l.ReFill

' in list box, show current result for retrieval.

Claims

1. a character string fuzzy retrieval method, is characterized in that:

Mark with the data W there is n being the position (bit) of 0 the character metamessage forming character string, be a combination with wherein j position, then have the combination of individual position, correspondingly divides alphabet unit to be group, with the corresponding one group of character unit of the combination of a position, marks; If character string S character element P ₁belong to r group, then j position of the combination of formation r the position of W is labeled as 1, similarly, according to other character element P of S ₂, P ₃, P ₄affiliated group marks W, completes the W after alphabet meta-tag, records the character metamessage of S, is called the 1 mark place value of character string S;

With having the data that n is the position (bit) of 1 mark and form the character metamessage of character string, be a combination with wherein j position, then have the combination of individual position, correspondingly divides alphabet unit to be group, with the corresponding one group of character unit of the combination of a position, marks; If character string S character element P ₁belong to r group, then will j the position of combination of formation r position be labeled as 0, similarly, according to other character element P of S ₂, P ₃, P ₄affiliated group is right mark, complete after alphabet meta-tag record the character metamessage of S, be called the 0 mark place value of character string S;

Note character string S _a1 mark place value be W _a, note S _a0 mark place value be note character string S _b1 mark place value be W _b, note S _b0 mark place value adopt below method for character string comparison:

I. with W _awith W _bcompare, if all W _abe the position of 1, W _bcorresponding position is also 1, then S _bcomprise and maybe may comprise S _aall characters unit;

II. with with compare, if all be the position of 1, corresponding position is also 1, then S _bcomprise and maybe may comprise S _aall characters unit;

III. with W _awith compare, if with W _aall corresponding positions are 1 time different, then S _bcomprise and maybe may comprise S _aall characters unit;

IV. with with W _bcompare, if with W _ball corresponding positions are 0 time different, then S _bcomprise and maybe may comprise S _aall characters unit.

2. in accordance with the method for claim 1, it is characterized in that: to W _a, with W _b, compare and realize with bit arithmetic symbol: note W _tfor all positions are the place value of 1, then method I can use W _a→ W _b=W _t, W _a| W _b=W _b, W _aaMP.AMp.Amp W _b=W _arealize.

3. in accordance with the method for claim 1, it is characterized in that: to W _a, with W _b, compare, if do not meet criterion, then S _bdo not comprise S _aall characters unit; If meet criterion, when adopting single position to mark to each group character unit and only having a character unit in each group character unit, then S _bcomprise S _aall characters unit, can not S be affirmed _bwhether comprise S _a; If meet criterion, when having multiple character unit in each group character unit when adopting single position to mark to each group character unit, or when adopting multiple position to mark to each group character unit, then S _bs may be comprised _aall characters unit, can not S be affirmed _bwhether comprise S _a; As required, then by character successive appraximation method S is judged _bwhether comprise S _a.

4. in accordance with the method for claim 1, it is characterized in that: character unit comprises Chinese radical, stroke; The letter of the Chinese phonetic alphabet, initial consonant, simple or compound vowel of a Chinese syllable, syllable; The letter of other Languages, syllable, word; The phonetic symbol of other Languages; Or their combination.

5. in accordance with the method for claim 1, it is characterized in that: character string S comprises one or more character string, when comprising a character string, its place value W and mark a character string character metamessage, when comprising multiple character string, its place value W and mark all character metamessages of multiple character string.

6. in accordance with the method for claim 1, it is characterized in that: search key S _tthere is 1 mark place value W _twith 0 mark place value and character string S _nthere is 1 mark place value W _nwith 0 mark place value by to W _t, with W _n, compare: can filter out to comprise and maybe may comprise search key S _tthe S of all character units _nif need, then judge S by character successive appraximation method _nwhether comprise S _t, also can filter out search key S _tcomprise the S that maybe may comprise its all character unit _nif needed, S can be judged by character successive appraximation method again _twhether comprise S _n.