[go: up one dir, main page]

CN101140583A - A method and device for text retrieval - Google Patents

A method and device for text retrieval Download PDF

Info

Publication number
CN101140583A
CN101140583A CNA2007101238322A CN200710123832A CN101140583A CN 101140583 A CN101140583 A CN 101140583A CN A2007101238322 A CNA2007101238322 A CN A2007101238322A CN 200710123832 A CN200710123832 A CN 200710123832A CN 101140583 A CN101140583 A CN 101140583A
Authority
CN
China
Prior art keywords
dimension
data
module
mapping
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101238322A
Other languages
Chinese (zh)
Other versions
CN100530192C (en
Inventor
胡辛遥
韩定一
俞勇
金洪波
吕晓雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Shanghai Jiao Tong University
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CNB2007101238322A priority Critical patent/CN100530192C/en
Publication of CN101140583A publication Critical patent/CN101140583A/en
Application granted granted Critical
Publication of CN100530192C publication Critical patent/CN100530192C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text retrieval method and a relevant device. The method comprises: Input primary text data; perform self-adapting mapping process with a method of descent for the primary text data; carry forth retrieval text data similar to the data after the self-adapting mapping process with the method of descent; output the retrieval text data. The invention execution example can fulfill effective compression for original data after decency for higher-dimensional text data; the invention is characterized in rather low time complexity, adapting to magnanimity data, and effectively maintaining similarity between all vectors of the text data. Utilization of the invention execution example method or the device can fulfill quick response to such requests as network text inquiry, search, retrieval and etc at rather low arithmetic cost, so as to resolve the problems in prior network text retrievals of high time complexity, large space consumption and rather low accuracy.

Description

A kind of method and apparatus of text retrieval
Technical field
The present invention relates to the text retrieval technology, relate in particular to a kind of method and apparatus in this retrieval of the enterprising style of writing of network.
Background technology
Annual newly-increased data surpass 10 on the WWW 18Bytes, and annual the continuation increases with the index rank.More existing search engines can not adapt to such growth scale.Such growth scale requires a kind of new framework, makes its index and query contents information, for example HTML, plain text, music and image rapidly.On the other hand, peer-to-peer network had obtained to accept widely in recent years.Their extensibility, the essence of fault-tolerance and adaptivity have caused that people set up the interest of search engine cheaply on peer-to-peer network.
Though the search technique based on peer-to-peer network is suggested more recently, they mostly are based on simple keyword coupling, do not use some more senior sort algorithms of information retrieval field.Do not having under the situation of sort algorithm,, will return a lot of documents when the user imports some popular keywords, the processing power that has outnumbered the user of these documents, this makes that these systems are unavailable.
Another basic problem of these existing peer-to-peer network systems is: document is a stochastic distribution.When the user imported an inquiry, a large amount of nodes need be searched for by system.If system uses some didactic rules to dwindle the scope of searching, then can lose some documents associated with the query.In order to address this problem, be suggested based on semantic overlay.On based on semantic overlay, content is to organize according to their semanteme.Distance between the content is proportional to their similarities semantically.
CAN (Content-Aware Network, perception of content network) is a kind of peering structure network.CAN is n dimension cartesian space from conceptive understanding, promptly each point in the CAN space, can be expressed as (x1 ..., xn).Each peer (can be understood as a real PC) is managing a zone, and promptly (x1_low<=x1<=x1_high ..., xn_low<xn<xn_high).In 2 dimension spaces, matrix area of each machine handing, in 3 dimension spaces, cube of each machine handing.In the n-dimensional space, hypercube of each machine handing.
The routing mechanism of CAN is exactly to go up near target from each dimension, and promptly (2,3,4) need be routed to (3,4,5), can following route, (2,3,4)->(3,3,4), at this moment, first has been satisfied, and route is second again, and (3,3,4)->(3,4,4), the 3rd of last route is to (3,4,5).
The index in CAN space, suppose peer1 need with d=(d1 ..., dn) index in the CAN space.At first it finds the zone that comprises d, again with (peer1, d) relation stores on the peer2 (comprising this regional node).It has just stored a relation, and does not have that actual what document copying was deposited among the CAN in the past is the vector of n dimension, this with information retrieval in the method for expressing of inquiry and document very similar, so generally use CAN to deposit object based on semantic overlay.Is very direct with content as the idea that vector is stored in the CAN space.But it has introduced a series of complicated problems:
(1) in information retrieval, document represented by high dimension vector with inquiry, several ten thousand dimensions normally, and the dimension in CAN space generally is tens, not the matching of dimension make can not be directly with document index to CAN;
(2) similarity in the higher dimensional space is inaccurate one by one for the dimension disaster, and this has caused inquiring about the difficulty that becomes in higher dimensional space.
Similar inquiry (Similarity Search) is widely used in the various systems, such as image retrieval, web search and data compression etc.In these were used, real object was abstracted into the vector in the geometric space.By defining the function of one or more calculating similarities, can be used for weighing the impression of people to similarity between the object.For example: can think that the angle between two vectors is exactly the similarity of real object, think that perhaps the Jaccard value of two vectors is true vectorial similarity.In order accurately to calculate similarity as much as possible, the similarity between the vector need be calculated in the space of higher-dimension usually.The dimension of supposing higher dimensional space is n, and so general method need be stored with n unit space, and the similarity time complexity between the compute vector is about O (n).For example, in the collection of document of the pure English text of a 600MB size, the dimension in text vector space can be up to hundreds of thousands.In this case, storing these vectors needs very big storage space, and the operand of similarity is also very big between the compute vector.Therefore,, need a kind of simple and effective dimension reduction method in order to reduce storage space and to simplify calculation of similarity degree between the vector, with the DUAL PROBLEMS OF VECTOR MAPPING of higher dimensional space to lower dimensional space.
Yet, the existing data structure that is used to do similar inquiry, as the B+ tree, the kd tree, vp tree and CAN etc. require object to represent with the vector of lower dimensional space.If high dimension vector directly is applied on these data structures, the validity of these algorithms can reduce greatly.Therefore, the high dimension vector dimensionality reduction is also kept original similarity to a certain extent to lower dimensional space, all significant concerning a lot of application.
Existing shallow-layer semantic indexing (Latent Semantic lndexing) technology comes matrix is carried out dimensionality reduction by a plurality of speech being merged into a notion.This method is converted into relation between document and the notion to the relation between document and the speech.Such as, had originally car, truck, three notions of flower}, after the LSl processing, these 3 notions are converted into following two notions, { (1.3452 *Car+0.2828 *Truck), flower}.LSl has adopted svd (Singular Value Decomposition) to the matrix dimensionality reduction.The time complexity of this algorithm is higher, so the adaptability of LSl and extensibility are relatively poor.
Existing (Random Mapping) technology of mapping at random, its algorithm is as follows: the supposition original matrix is M[t *X] (t object arranged, and each object is the x dimension), by generating a stochastic matrix R[x *Y], original matrix be multiply by stochastic matrix obtain M[t *Y], be the dimensionality reduction result.Be mapped with following shortcoming at random:
(1) mapping must be safeguarded stochastic matrix R[x at random *Y], this matrix is bigger (because the dimension x of original matrix is higher) usually;
(2) store this matrix and need expend more space.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of network text search method and device, has overcome text complexity retrieval time height on the existing network, has expended the defective that the space is big, degree of accuracy is relatively poor.
The embodiment of the invention is achieved through the following technical solutions:
The embodiment of the invention provides a kind of method of text retrieval, comprising:
Input urtext data;
Described urtext data are carried out self-adaptation mapping dimension-reduction treatment;
According to the data after the described self-adaptation mapping dimension-reduction treatment, retrieve the text data similar to it.
The embodiment of the invention also provides a kind of text retrieval device, comprising: load module, dimensionality reduction module, retrieval module, wherein:
Load module is used to import the urtext data;
The dimensionality reduction module is used for described urtext data are carried out self-adaptation mapping dimension-reduction treatment;
Retrieval module is used for according to the data after the described dimensionality reduction resume module, retrieves the text data similar to it.
The technical scheme that is provided by the invention described above embodiment as can be seen, the embodiment of the invention can effectively be compressed former data after the higher-dimension text data is carried out dimensionality reduction, and time complexity is lower, be applicable to mass data, can keep the similarity between each former vector of text data effectively.Use the method or the device of the embodiment of the invention to quickly respond to requests such as network text inquiry, search, retrieval, thereby solved existing network text retrieval time complexity height, expend the problem that the space is big, precision is relatively poor with smaller computing cost.
The embodiment of the invention can also be according to user's demand, control the maintenance degree of similarity and the speed of computing by adjusting parameter, makes the user to try to achieve a balance on the precision of text retrieval and calculated amount.And implementing the present invention does not need the relevant knowledge in introducing field.
Description of drawings
Fig. 1 is an embodiment of the invention self-adaptation mapping algorithm process flow diagram;
Fig. 2 is an embodiment of the invention self-adaptation mapping algorithm process flow diagram more specifically;
Fig. 3-a is the program flow diagram of self-adaptation mapping algorithm of the present invention;
Fig. 3-b is the iteration synoptic diagram of self-adaptation mapping algorithm of the present invention;
Fig. 4 is the index process synoptic diagram of the embodiment of the invention;
Fig. 5 is the text retrieval apparatus function module diagram of the embodiment of the invention;
Fig. 6 is the structural representation of the dimensionality reduction module in the text retrieval device of the embodiment of the invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, the technical scheme that the embodiment of the invention proposed is elaborated below in conjunction with accompanying drawing.
The text searching method of the embodiment of the invention is based on " self-adaptation mapping " algorithm.With reference to Fig. 1, " self-adaptation mapping " algorithm comprises the steps:
S102: input raw data; This raw data is a vector form, and might as well establish its dimension is M;
S104: determine the dimension of target mapping space, also promptly intend the dimension of the data behind the dimensionality reduction, might as well establish this dimension is N;
S106:, determine the dimensionality reduction mapping relations according to the dimension of raw data and the dimension of determined target mapping space;
S108:, determine the corresponding relation of each dimension of raw data with each dimension of target mapping space according to the determined dimensionality reduction mapping relations of S106.
Further, this method can also comprise:
S110:, calculate the value of each dimension of target mapping space according to raw data.The computing method that adopted can have a variety of, such as: iteration adds up etc.
With reference to Fig. 2, more specifically, " self-adaptation mapping " algorithm comprises the steps:
S202: input raw data; This raw data is a vector form, and might as well establish its dimension is M;
S204: determine the dimension of target mapping space, also promptly intend the dimension of the data behind the dimensionality reduction, might as well establish this dimension is N;
S206:, select to determine the hash function according to the dimension of raw data and the dimension of determined target mapping space; The selection of this hash function can be according to the dimension of the raw data of S202 input and the dimension of the data after intending dimensionality reduction is decided;
S208:, determine the corresponding relation of each dimension of raw data with each dimension of target mapping space according to the determined hash function of S206.
Further, this method can also comprise:
S210:, calculate the value of each dimension of target mapping space according to raw data.The computing method that adopted can have a variety of, such as: iteration adds up etc.
" self-adaptation mapping " algorithm implementing procedure more specifically is described below for example:
The input of algorithm is divided into two parts:
(1) input vector input, i.e. raw data;
(2) input hash function; The hash function can be mod, MD5, SHA1 etc.;
Wherein the dimension of input vector is m, and algorithm is output as the output vector, its dimension be n (n<<m).
Shown in Fig. 3-a and Fig. 3-b, program circuit is as follows:
(1) initialization output vector, the value of its each dimension is initialized as 0;
(2) initialization index i;
(3) the input vector is circulated,, calculate its cryptographic hash hash (i) its every dimension i;
(4) calculate output[hash (i)] value, can adopt multiple computing method, for example: iteration adds up, that is: output[hash (i)]=output[hash (i)]+input[i], perhaps can be: output[hash (i)]=(output[hash (i)] 2+ input[i] 2) 1/2.
The example that to provide a hash function below be mod:
Raw data f is one 5 dimensional vector, is expressed as f={0.1,0.3,0.2,0.4, and 0.5}.The dimension k of target mapping space=2, raw data f corresponds to vector v={ v (0), v (1) } the target mapping space.Getting the consistance hash function is h (i)=i mod k=i mod 2.The flow process of self-adaptation mapping algorithm is as follows:
(1) initialization v (0)-v (1)=0;
(2) adopt " adding up " to calculate rreturn value:
h(0)=0,v(0)=v(0)+0.1=0.1;
h(1)=1,v(1)=v(1)+0.3=0.3;
h(2)=0,v(0)=v(0)+0.2=0.3;
h(3)=1,v(1)=v(1)+0.4=0.7;
h(4)=0,v(0)=v(0)+0.5=0.8.
(3) final return vector v={0.8,0.7}.
Provide an example below again, the hash function of employing is MD5, and is as shown in table 1, and the 1st row " former Spatial Dimension " are mapped as the hash value (sexadecimal) of the 2nd row through MD5, and the 3rd classifies the decimal representation of the 2nd row as, and the 4th classifies the 3rd as is listed as the result who asks mould to 5.
Former Spatial Dimension The hash value (sexadecimal) of former Spatial Dimension The integer pattern of hash value is represented (decimal system) The hash value is to 5 values of asking mould
1 a0b923820dcc509a v11581326958244155546 1
2 9d4c2f636f067f89 v11334486466295660425 0
3 4b5ce2fe28308fd9 v5430464831925817305 0
4 a2f3e71d9181a67b v11741982767666275963 3
5 bbce2345d7772b06 v13532792713169545990 0
6 5a880faf6fb5e608 v6523481306414048776 1
7 ceea167a5a36dedd v14909754231118814941 1
8 fb98ab9159f51fd0 v18129428940747775952 2
9 2e2d7fbdea1afc51 v3327456153349848145 0
10 02a44259755d38e6 v190350036244969702 2
11 d9caa6e02c990b0a v15693539333277027082 2
12 6fe97759aa27a0c9 v8064107834774102217 2
13 c124a10e0db5e4b9 v8064107834774102217 2
14 22bcc25a6f606eb5 v2503089186582589109 4
15 f062936a96d3c8bd v17321569202826627261 1
16 1eae257e44aa9d5b v2210745691333631323 3
17 c9b086079795c442 v14533263364690658370 0
18 5568161a8cdf4ad2 v6154193194090187474 4
19 99908345f7439f8f v11065488620973694863 3
20 210194c475687be6 v2378345649732615142 2
Table 1
Shown in as above showing, the hash mapping result:
hash(2,3,59,17)=0,
hash(1,6,7,15)=1,
hash(8,10,11,12,13,20)=2,
hash(4,16)=3,
hash(14,18)=4,
Though mapping is not fully evenly, because " evenly " is on the statistical significance, the dimension that needs only when former space reaches enough height, and mapping will become " evenly ".
It is better that self-adaptation is mapped under the higher-dimension sparse vector effect, below one section be that the higher-dimension sparse vector produces replenishing of reason.
The generation of higher-dimension sparse vector is mainly derived from the mapping of text to higher dimensional space.Text is as follows to the detailed process of the mapping of higher dimensional space: at first add up the speech in the collection of document; Begin mapping then, document the component of each dimension all corresponding certain speech in the collection of document.The value of each dimension is calculated as follows, if this piece document contains speech that should component, and this component non-vanishing (can calculate) by methods such as tfidf, otherwise this component is 0.We are example with the trec-6 document sets, and trec-6 has 556k piece of writing document, total 742k speech.Wherein, average every piece of document has 200-300 speech.According to top mapping process, document is mapped as the 742k dimensional vector, and wherein having only the value of 200-300 dimension is not 0, and the value of all the other great majority dimensions all is 0.Therefore, this mapping process has been introduced the higher-dimension sparse vector.The main cause that produces the higher-dimension sparse vector is that the number of speech in the collection of document is very big, but the number of contained speech seldom in every piece of document.
The difficulty that high dimension vector not only brings storage and calculates also exists some difficulties in some other field.In the peer-to-peer network searching system based on the self-adaptation mapping, the dimension in CAN space generally is tens dimensions, and text vector is a hundreds of thousands.The mismatch of dimension makes text can not directly put into the CAN space.Secondly, in the higher dimensional space, because the dimension disaster, calculation of similarity degree is not very accurate, the ability drop that this makes text retrieval.By the self-adaptation mapping, we can be the DUAL PROBLEMS OF VECTOR MAPPING of higher dimensional space to lower dimensional space, the cost that has not only reduced storage and calculated, and can also make up overlay with text index to the CAN space based on semanteme, so that the retrieving of back.
Provide an example below again, vector wherein all satisfies the sparse characteristic of higher-dimension:
Original vector f1, f2 are the vectors of one 50 dimension, because dimension is higher, thus adopt the rarefaction representation method to represent, as follows:
f1=[10:0.468,15:0.058,42:0.336,43:0.852];
f2=[1:0.16134499,6:0.086649,10:0.11352496,29:0.7749904,47:0.95804197];
Adopt the vector space angle, the similarity that can calculate between f1 and the f2 is 0.0532.
More than two vectors through hash Function Mapping after 10 dimension spaces, obtain hash_f1 and hash_f2, be respectively:
hash_f1=[0:0.468,2:0.336,3:0.852,5:0.058];
hash_f2=[0:0.113,1:0.161,6:0.086,7:0.958,9:0.774].
Adopt the vector space angle, the similarity that can calculate between hash_f1 and the hash_f2 also is 0.0532.As seen " self-adaptation mapping " algorithm has the advantage of the vectorial similarity that can highly keep the Hash front and back.
The embodiment of the invention provides a kind of text searching method, this method is based on the self-adaptation mapping algorithm, be applied to a kind of peering structure network CAN (Content-Aware Network one by one, the perception of content network), before introducing text search method, the method for the index of text in the CAN space is set up in explanation earlier:
The dimension of supposing document is n, and the dimension in CAN space is k;
Set up text index and mainly comprise two steps:
(1) for a certain piece of writing document d=(t 1..., t n), carrying out dimensionality reduction with the self-adaptation mapping algorithm, the target dimension is CAN space dimensionality k, the result who obtains is d '=(t 1' ..., t k');
(2) d is stored in the CAN space comprise (t 1' ..., t k') the zone on.
With reference to Fig. 4, how Fig. 4 discloses on the CAN space with document index to 2 dimension of a n dimension.Input document d=(t 1..., t n) being mapped as 2 dimensional vectors (5,4) by SAM, document d is indexed to the area B that comprises (5,4) subsequently.
This method of setting up index makes that the distance of similar text in the CAN space is more approaching.
The following describes the text searching method of present embodiment:
At first, the text q that retrieve is carried out dimensionality reduction with the self-adaptation mapping algorithm,, obtain the inquiry q ' behind the dimensionality reduction the dimension k that its dimension is reduced to the CAN space;
Then, will inquire about q ' is routed to and comprises q ' zone in the CAN space; Because text that should the zone is more similar to the text of peripheral region,, return the most similar text collection so around this zone and this zone, retrieve.
Result for retrieval can be expressed as Q r(q)=d, S (d, q)>r}.Wherein q is inquiry, and d is a document, and r is the threshold value of similarity, S (x y) is similarity function---the definition mode of similarity function can be: and the included angle cosine of two vectors, or the like.Q r(q) be to find and inquire about of the set of q similarity greater than the document d of r.
The step of retrieval is as follows:
(1) for a certain inquiry q=(t 1..., t n), carrying out dimensionality reduction with the self-adaptation mapping algorithm, the target dimension is CAN space dimensionality k, obtains the result and is q '=(t 1' ..., t k');
(2) q is routed to comprises coordinate q '=(t in the CAN space 1' ...., t k') regional z, all the document d among the comparison domain z, return those satisfy S (d, q)>document of r;
(3) if in should zone z all d all satisfy S (d q)>r, then will inquire about the neighbours zone that q is routed to z.Repeating step (2)~(3);
(4) if should have a certain document d ' among the z of zone, do not satisfy S (d ', q)>r, then inquiry stops.
Determining and choosing of the series of parameters that the operation of present embodiment method is related can be by accuracy requirement and the calculated amount decision of user according to retrieval, makes the embodiment of the invention meet consumers' demand more neatly.
The adaptable occasion of present embodiment method comprises at least: network text retrieval, database, search engine or retrieval server etc.
With reference to Fig. 5, present embodiment provides a kind of text retrieval device, and this device comprises: load module 502, dimensionality reduction module 504, retrieval module 506, output module 508, wherein:
Load module 502 is used to import the urtext data;
Dimensionality reduction module 504 can adopt self-adaptation mapping dimensionality reduction algorithm, be used for the urtext data of input are carried out dimension-reduction treatment, according to the dimension of raw data and the dimension of target mapping space, determine the dimensionality reduction mapping relations, and according to these mapping relations, determine the corresponding relation of each dimension of raw data, calculate the value of each dimension of target mapping space with each dimension of target mapping space;
Retrieval module 506 is used for the data after 504 dimension-reduction treatment of dimensionality reduction module are retrieved.
Further, this device also comprises:
Output module 508 is used to export the text data that retrieval module 506 retrieves;
Parameter regulation module 510 is used for determining the dimension of target mapping space, the account form that the target mapping space is respectively tieed up value, chooses dimensionality reduction mapping relations etc., such as choosing of hash function, and the dimension of the data behind the plan dimensionality reduction etc.Determining and choosing of the series of parameters that the device operation is related can be by accuracy requirement and the calculated amount decision of user according to retrieval, makes the embodiment of the invention meet consumers' demand more neatly.The parameter that this module is regulated passes to dimensionality reduction module 504, with the treatment scheme of control dimensionality reduction module.Adjust parameter by this module and control the maintenance degree of similarity and the speed of computing, make the balance that the user can ask on precision and calculated amount.
With reference to Fig. 6, dimensionality reduction module 504 specifically can comprise:
Dimension mapping block 504-1 and computing module 504-2, wherein:
Dimension mapping block 504-1 is used for determining the dimensionality reduction mapping relations according to the dimension of raw data and the dimension of target mapping space, and according to these mapping relations, determines the corresponding relation of each dimension of raw data with each dimension of target mapping space; Determine that the dimensionality reduction mapping relations specifically can be to select to determine specific hash function, as mould n mod, MD5 function or SHA1 etc.;
Computing module 504-2 is used for according to certain computing method, such as add up, quadratic sum adds up evolution etc. again, calculates the value of each dimension of target mapping space.
Dimensionality reduction module 504 can be carried out treatment scheme according to the parameter from parameter regulation module 510.
Above-mentioned module 502~510 can be hardware cell, software unit or hardware cell and the combining of software unit.
The indexing unit of present embodiment can be integrated in network text searcher, database, search engine or the retrieval server etc.
Each module of the indexing unit of present embodiment can distributedly be integrated on a plurality of equipment, promptly constitutes the searching system suitable with the apparatus function of present embodiment by multiple devices.
In sum, the embodiment of the invention provides a kind of text searching method and device based on the self-adaptation mapping algorithm, can former text data effectively be compressed, and time complexity is lower, be applicable to mass data, can keep the similarity between each former vector of text data effectively.Use the method or the device of the embodiment of the invention to quickly respond to requests such as network text inquiry, search, retrieval, thereby solved existing network text retrieval time complexity height, expend the problem that the space is big, precision is relatively poor with smaller computing cost.
The embodiment of the invention can also be according to user's demand, control the maintenance degree of similarity and the speed of computing by adjusting parameter, makes the user to try to achieve a balance on the precision of text retrieval and calculated amount.And implementing the present invention does not need the relevant knowledge in introducing field.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (12)

1. the method for a text retrieval is characterized in that, comprising:
Input urtext data;
Described urtext data are carried out self-adaptation mapping dimension-reduction treatment;
According to the data after the described self-adaptation mapping dimension-reduction treatment, retrieve the text data similar to it.
2. the method for claim 1 is characterized in that, retrieves after the step of the text data similar to it according to the data after the described self-adaptation mapping dimension-reduction treatment described, further comprises:
Export the described text data that retrieves.
3. the method for claim 1 is characterized in that, the described step that the urtext data are carried out self-adaptation mapping dimension-reduction treatment comprises:
According to the dimension of described urtext data and the dimension of target mapping space, determine the dimensionality reduction mapping relations;
According to described definite dimensionality reduction mapping relations, determine the corresponding relation of described each dimension of urtext data with each dimension of target mapping space.
4. method as claimed in claim 3 is characterized in that, according to the dimension of urtext data and the dimension of target mapping space, determines that before the step of dimensionality reduction mapping relations, this method further comprises described:
Determine the dimension of target mapping space.
5. method as claimed in claim 3 is characterized in that, after the step of described each dimension of definite described raw data with the corresponding relation of each dimension of target mapping space, this method further comprises:
According to the urtext data, calculate the value of each dimension of target mapping space.
6. method as claimed in claim 3 is characterized in that, described definite dimensionality reduction mapping relations comprise: select to determine Hash hash function.
7. the method for claim 1 is characterized in that, the occasion that described method is used comprises at least: network text retrieval, database, search engine or retrieval server.
8. a text retrieval device is characterized in that, comprising: load module, dimensionality reduction module, retrieval module, wherein:
Load module is used to import the urtext data;
The dimensionality reduction module is used for described urtext data are carried out self-adaptation mapping dimension-reduction treatment;
Retrieval module is used for according to the data after the described dimensionality reduction resume module, retrieves the text data similar to it.
9. device as claimed in claim 8 is characterized in that, this device further comprises:
Output module is used to export the text data that described retrieval module retrieves.
10. device as claimed in claim 8 is characterized in that, described dimensionality reduction module comprises:
The dimension mapping block is used for determining the dimensionality reduction mapping relations according to the dimension of raw data and the dimension of target mapping space; According to described definite dimensionality reduction mapping relations, determine the corresponding relation of each dimension of urtext data with each dimension of target mapping space;
Computing module is used to calculate the value of each dimension of target mapping space.
11. device as claimed in claim 8 is characterized in that, further comprises:
The parameter regulation module, be used for determining dimension, each dimension value of target mapping space of target mapping space account form, choose the dimensionality reduction mapping relations; The parameter that this module is determined and/or chosen passes to described dimensionality reduction module to participate in the operation of control dimensionality reduction module.
12. device as claimed in claim 8 is characterized in that, this device is integrated in network text searcher, database, search engine or the retrieval server.
CNB2007101238322A 2007-10-09 2007-10-09 Text searching method and device Expired - Fee Related CN100530192C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101238322A CN100530192C (en) 2007-10-09 2007-10-09 Text searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101238322A CN100530192C (en) 2007-10-09 2007-10-09 Text searching method and device

Publications (2)

Publication Number Publication Date
CN101140583A true CN101140583A (en) 2008-03-12
CN100530192C CN100530192C (en) 2009-08-19

Family

ID=39192536

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101238322A Expired - Fee Related CN100530192C (en) 2007-10-09 2007-10-09 Text searching method and device

Country Status (1)

Country Link
CN (1) CN100530192C (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214180A (en) * 2010-04-12 2011-10-12 无锡科利德斯科技有限公司 Retrieval method and method using same for establishing text semantic extraction module
CN103399927A (en) * 2013-08-05 2013-11-20 百度在线网络技术(北京)有限公司 Distributed computing method and device
CN106487824A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of rule gray scale dissemination method and device
CN106649668A (en) * 2016-12-14 2017-05-10 华南师范大学 Vector model-based massive spatiotemporal data retrieval method and system
CN107229939A (en) * 2016-03-24 2017-10-03 北大方正集团有限公司 The decision method and device of similar document
CN110275991A (en) * 2019-06-03 2019-09-24 腾讯科技(深圳)有限公司 The determination method and apparatus of cryptographic Hash, storage medium, electronic device
CN110795432A (en) * 2019-10-29 2020-02-14 腾讯云计算(北京)有限责任公司 Characteristic data retrieval method and device and storage medium
CN110891010A (en) * 2018-09-05 2020-03-17 百度在线网络技术(北京)有限公司 Method and apparatus for transmitting information
CN112364009A (en) * 2020-12-03 2021-02-12 四川长虹电器股份有限公司 Method for retrieving similar data of target object
CN117390013A (en) * 2023-09-12 2024-01-12 博瀚智能(深圳)有限公司 Data storage methods, retrieval methods, systems, equipment and storage media

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214180A (en) * 2010-04-12 2011-10-12 无锡科利德斯科技有限公司 Retrieval method and method using same for establishing text semantic extraction module
CN103399927A (en) * 2013-08-05 2013-11-20 百度在线网络技术(北京)有限公司 Distributed computing method and device
CN103399927B (en) * 2013-08-05 2016-11-02 百度在线网络技术(北京)有限公司 Distributed computing method and device
CN106487824A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of rule gray scale dissemination method and device
CN107229939B (en) * 2016-03-24 2020-12-04 北大方正集团有限公司 Method and device for determining similar documents
CN107229939A (en) * 2016-03-24 2017-10-03 北大方正集团有限公司 The decision method and device of similar document
CN106649668A (en) * 2016-12-14 2017-05-10 华南师范大学 Vector model-based massive spatiotemporal data retrieval method and system
CN110891010A (en) * 2018-09-05 2020-03-17 百度在线网络技术(北京)有限公司 Method and apparatus for transmitting information
CN110275991A (en) * 2019-06-03 2019-09-24 腾讯科技(深圳)有限公司 The determination method and apparatus of cryptographic Hash, storage medium, electronic device
CN110275991B (en) * 2019-06-03 2021-05-14 腾讯科技(深圳)有限公司 Hash value determination method and device, storage medium and electronic device
CN110795432A (en) * 2019-10-29 2020-02-14 腾讯云计算(北京)有限责任公司 Characteristic data retrieval method and device and storage medium
CN112364009A (en) * 2020-12-03 2021-02-12 四川长虹电器股份有限公司 Method for retrieving similar data of target object
CN117390013A (en) * 2023-09-12 2024-01-12 博瀚智能(深圳)有限公司 Data storage methods, retrieval methods, systems, equipment and storage media

Also Published As

Publication number Publication date
CN100530192C (en) 2009-08-19

Similar Documents

Publication Publication Date Title
CN100530192C (en) Text searching method and device
Tomasic et al. Performance of inverted indices in shared-nothing distributed text document information retrieval systems
US7761407B1 (en) Use of primary and secondary indexes to facilitate aggregation of records of an OLAP data cube
KR100385528B1 (en) Multidimensional data clustering and dimension reduction for indexing and searching
US9424351B2 (en) Hybrid-distribution model for search engine indexes
Boussahoua et al. Logical schema for data warehouse on column-oriented NoSQL databases
CN106997386A (en) A kind of OLAP precomputations model, method for automatic modeling and automatic modeling system
CN105279264B (en) A kind of semantic relevancy computational methods of document
CN107103032B (en) A massive data paging query method that avoids global sorting in a distributed environment
US9195745B2 (en) Dynamic query master agent for query execution
CA2470899A1 (en) Method and system for similarity search and clustering
Amine et al. Evaluation of text clustering methods using wordnet.
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
CN116662521B (en) Electronic document screening and inquiring method and system
US20090327339A1 (en) Partition templates for multidimensional databases
CN105159971A (en) Cloud platform data retrieval method
Bai et al. Adaptive query relaxation and top‐k result sorting of fuzzy spatiotemporal data based on XML
CN102915312B (en) Information issuing method in website and system
CN101840438A (en) Retrieval system oriented to meta keywords of source document
Ahmed et al. Hybrid Data Fragmentation Using Genetic Killer Whale Optimization-Based Clustering Model.
Kong et al. A multi-source heterogeneous data storage and retrieval system for intelligent manufacturing
Rao et al. An efficient keyword based search of big data using map reduce
Dhulavvagol et al. Efficient data partitioning and retrieval using modified redde technique
Ni et al. An Efficient Method for Improving Query Efficiency in Data Warehouse.
CN112364032B (en) Data center data query method based on Internet technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
C41 Transfer of patent application or patent right or utility model
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20080328

Address after: headquarters of Bantian HUAWEI, Longgang District, Guangdong, Shenzhen

Applicant after: HUAWEI TECHNOLOGIES Co.,Ltd.

Applicant after: SHANGHAI JIAO TONG University

Address before: Bantian HUAWEI headquarters, Longgang District, Shenzhen, Guangdong

Applicant before: HUAWEI TECHNOLOGIES Co.,Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090819

Termination date: 20201009

CF01 Termination of patent right due to non-payment of annual fee