CN102043866B

CN102043866B - Relaxation search and optimization sequencing method based on form characteristic

Info

Publication number: CN102043866B
Application number: CN 201110025996
Authority: CN
Inventors: 张书奎; 崔志明; 陈明; 赵朋朋; 孙涌
Original assignee: SUZHOU PRODUCTION INFORMATION TECHNOLOGY Co Ltd
Current assignee: SUZHOU SOUKE INFORMATION TECHNOLOGY CO., LTD.
Priority date: 2011-01-25
Filing date: 2011-01-25
Publication date: 2013-03-13
Anticipated expiration: 2031-01-25
Also published as: CN102043866A

Abstract

The invention relates to a relaxation search and optimization sequencing method based on form characteristic, comprising firstly using a form information collector to obtain a plurality of relative search form information; constructing a triple information relative to each attribute ranking; according to a relative method in fit with a mode, mapping attributes with different attribute names and the same semantic to a same attribute; filtering the attribute appearing in a certain specific data source; using a suggested formula to compute comprehensive ranking of each attribute; reordering according to the comprehensive ranking of all attributes; and filtering the relaxation result information ranking. The ordering and filtering method is improved, only the attribute influencing the similarity is computed in distance value, and the efficiency of processing the relaxation result ranking is improved.

Description

Relaxation search and optimization sequencing method based on form feature

Technical field

The present invention relates to a kind of optimization method of information retrieval, relate in particular to relaxation search and optimization sequencing method based on form feature.

Background technology

Along with the development of Internet, the user goes to obtain own interested information by network and has become more and more usual.And in the information that the internet comprises, the information that comprises among the Deep Web for static information more is subjected to user's favor.Because the information among the Deep Web generally is stored in the database, usually safeguard constantly and renewal that by each tissue it has the structuring degree of better real-time and Geng Gao.According to research in 2000, Deep Web information was 500 times of surface web information simultaneously, and the data source that google estimation Deep Web in 2007 comprises reaches 25,000,000; So the quantity of information that comprises among the Deep Web is horn of plenty more also.

Yet these Deep Web just provide a query interface to the user usually, and the user will obtain these high-quality structured messages and must obtain by own submit Query word.But the user does not understand the relation between the data in the database, and usually owing to initial conditions retrains mutually, querying condition is crossed the reasons such as strong, tends to cause inquiring about unsuccessfully, namely obtains less than Query Result.Usually be that querying condition is relaxed for this class solution of problem way, namely amplify search, some the most close results of querying condition that offer user and its input supply user selection.Comparatively speaking, present many for the relaxation method research of database, but fewer for the relaxation method of Deep Web integration field.

In the time will being applied to Deep Web integration field based on the lax method of attribute, because the isomerism of numerous data sources, original lax sequential decision method can not well be suitable for.

At present, mainly concentrate on the database aspect about the research of inquiring about lax method.Following three classes are roughly arranged: 1) extensive by on the basis that attributes match is studied inquiry being carried out, more representational is Gaasterland; 2) lax based on the inquiry of sample, Muslea has proposed a kind of LOQR algorithm, at first target database is sampled, then in sample, seek out and the failed record that approaches the most of inquiry and seek common ground with querying condition, thereby obtain the query expression after lax; 3) based on the relaxation method that weakens attribute, the people such as Nambiar obtain a sample at random by the detection to database, then utilize the method for machine learning to come the functional dependence between the getattr to concern, thereby determine the order that attribute is lax with this significance level of judging attribute, then sequentially come the attribute of effective query condition with this.

Summary of the invention

The objective of the invention is to overcome the deficiency that prior art exists, a kind of relaxation search based on form feature and optimization sequencing method are provided.

Purpose of the present invention is achieved through the following technical solutions:

Relaxation search and optimization sequencing method based on form feature is characterized in that may further comprise the steps:

(1) utilizes the list gatherer to collect a large amount of inquiry form information, and record all triplet information OI={DB_IDs relevant with the attribute record rank of each list, Attribute, Order};

Triplet information OI={DB_ID, Attribute, Order};

Wherein, DB_ID refers to the identifier of certain list place data source that system is given, is used for query interface of unique identification; Attribute refers to the name of a certain attribute, is used for identifying a certain attribute; The rank of Order entitling attribute in list, i.e. its order in position;

(2) adopt the method for Schema-based coupling, attribute-name is the different but best property of attribute mapping of expressing same semanteme is to same attribute;

(3) attribute that comprises of locating query interface;

(4) calculate the overall ranking of each attribute;

4-1) attribute behind taking-up step (3) location according to its occurrence number AC and corresponding rank relevant information, utilizes the overall ranking CO of following formula computation attribute, and puts into table COS,

CO (Attri) = \frac{Σ_{i = 1}^{count} W_{imp} ({DB}_{i}) . {Order}_{{DB}_{i}} (Attri)}{count}

Wherein, attribute occurrence number (Attribute Count AC): the number of times that occurs in all lists of being added up for certain particular community is described as two tuple AC={Attribute, Count}, wherein: Attribute refers to the name of a certain attribute, is used for identifying a certain attribute; Count is the number of times that this attribute occurs in the list of all collections altogether; The overall ranking of attribute (Comprehensive Order, CO): be used for showing that attribute is in its overall ranking after different information in each list of consideration;

4-2) repeating step 4-1) until finishes the calculating of the overall ranking of all properties;

(5) resequence according to the overall ranking of all properties;

The relative order of attribute is in certain field the Importance of attribute order thus, and the i.e. backward of order for this reason of lax order; After obtaining lax order information, utilize this information to relax to unsuccessfully inquiring about;

(6) the object information rank is filtered, and comprises the following steps:

6-1) association attributes obtains;

Calculating is for certain specific data source DB _iThe as a result example that returns, calculate its property set that loses constraint and be:

CAS(DB _i)＝RAS∪(UAS(DB _i)∩QAS)

Wherein, lax property set (Relax Attributes Set, RAS) is illustrated in the community set that is relaxed in the query script; Do not support querying attributes collection (UnsupportedAttributes Set, UAS), represent not support in the query interface that certain data source provides the community set of inquiring about; Querying attributes collection (Query Attributes Set QAS) is illustrated in the community set that comprises in the query requests; Computation attribute collection (CalculateAttributes Set CAS), expression finally consider result and querying condition apart from the time calculative property set;

6-2) distance value calculates;

To 6-1) the middle corresponding CAS (DB of data source _i) in attribute carry out successively the calculating summation of distance value:

Dist (Q, t) = Σ_{i = 1}^{n} W_{imp} (a_{i}) . {Dist}_{a_{i}} (Q, t)

W wherein _Imp(a _i) be attribute a _iWeight, i.e. attribute a _iSignificance level,

For data recording and querying condition at attribute a _iOn distance; To attribute weight w _Imp(a _i) calculating, for a certain attribute a _iIts weight is:

W_{imp} (a_{i}) = \frac{RelaxOrder (a_{i})}{| Attributes (R) |}

RelaxOrder (a wherein _i) be attribute a _iLax order, | Attributes (R) | be the total quantity of all properties;

During to various attribute type compute distance values, use existing public compute mode to calculate;

If a _iBe boolean properties or categorical attribute, if the value of the value of attribute and querying attributes is consistent so as a result, then its distance is 0, otherwise its distance is 1;

If a _iBe numerical attribute, its distance is:

Dist (q, t) = \frac{| q_{a_{i}} - t_{a_{i}} |}{\max_{a_{i}} - \min_{a_{i}}}

Wherein With

Be respectively the obtainable maximal value of this attribute and minimum value,

Be the value of querying condition on attribute,

Be the value of outcome record on this attribute;

If a _iBe text message, utilize that distance table is shown on the attribute:

Dist (q_{a_{i}}, t_{a_{i}}) = \frac{d (q_{a_{i}}, t_{a_{i}})}{\max (| q_{a_{i}} |, | t_{a_{i}} |)}

Wherein

For at attribute a _iOn editing distance,

With

Be attribute a _iString length on inquiry and result;

6-3) rank is filtered;

For 6-2) in all as a result examples that return according to

Descending arrange, need to return front K (K for record number) bar according to the user and record to the user.

Further, above-mentioned relaxation search and optimization sequencing method based on form feature, wherein, the step of the attribute that described locating query interface comprises is:

3-1) set the attribute frequency threshold value;

3-2) in the attribute that step (2) obtains, take out an attribute and add up the number of times of its appearance;

If 3-3) the attribute occurrence number is greater than the attribute frequency threshold value of setting, then this attribute of mark is the query interface zone, otherwise detects next attribute;

3-4) repeating step 3-2), 3-3), finish the attribute that the locating query interface comprises.

The substantive distinguishing features that technical solution of the present invention is outstanding and significant progressive being mainly reflected in:

The present invention is in conjunction with Deep Web inquiry form feature, and the decision method for the lax order of Deep Web integration field attribute carries out relaxation search.In the sequencer procedure of Query Result, consider simultaneously the basic reason of the lax result of impact and querying condition similarity, a kind of more efficient sort method has been proposed, reduced the time of filtration treatment, it is more to work as especially the attribute that comprises at querying condition, and lax attribute is when less, and the efficient that improves is more remarkable.

Description of drawings

Below in conjunction with accompanying drawing technical solution of the present invention is described further:

Fig. 1: Dangdang.com's inquiry form;

Fig. 2: Joyo.com's inquiry form.

Fig. 3: lax sequential decision schematic flow sheet of the present invention.

Embodiment

The present invention is directed to the decision method of the lax order of Deep Web integration field attribute, carry out relaxation search; In the sequencer procedure of Query Result, consider simultaneously the basic reason of the lax result of impact and querying condition similarity, proposed a kind of more efficient sort method.Mainly be based on the relaxation method that weakens the attribute constraint, the most key problem of this relaxation method is how to judge sequencing lax between the attribute, in order farthest to obtain the result close with the original inquiry of user, need as far as possible little going change original querying condition, namely need to find out in numerous attributes of inquiry on the most weak attribute of Query Result impact.In traditional database field, the people such as Nambiar will determine to inquire about that proposition obtains a sample at random by the detection to database when being converted into uncertain inquiry solving, then utilize the method for machine learning to come the functional dependence between the getattr to concern, judge the significance level of attribute with this.Yet in Deep Web integration field, this method but is difficult to applicable, data source in the face of numerous isomeries is difficult to go to obtain a suitable sample in global scope on the one hand, on the other hand, the decision method of this kind attribute significance level is from the dependence between the Database Properties, can not well reflect user's viewpoint.

The present invention is based on relaxation search and the optimization sequencing method of form feature, the invention will be further described below in conjunction with drawings and Examples, as shown in Figure 3, may further comprise the steps:

(1) utilizes the list gatherer to collect a large amount of relevant inquiry form information, and record each list

The triplet information OI={DB_ID that all are relevant with the attribute record rank, Attribute, Order};

For a certain specific area, utilize the list gatherer to collect a large amount of relevant inquiry form information f orms, the list gatherer can utilize deep web focused crawler to realize.

The step of obtaining triplet information is as follows:

1-1) form information that obtains is lined up a tabulation, the attribute that obtains is also lined up a table;

1-2) get next property value, and record the relevant tlv triple relevant information OI={DB_ID of this attribute rank, Name, Order} also puts it among the table OI S;

Wherein, OI (Order Information) is the rank relevant information: the relevant information for certain attribute in certain list can be described as tlv triple OI={DB_ID, Attribute, Order}.DB_ID refers to the identifier of certain list place data source that system is given, is used for query interface of unique identification; Attribute refers to the name of a certain attribute, is used for identifying a certain attribute; The rank of this attribute of Order entitling in this list, i.e. its order in position.

1-3) repeating step 1-2), until finish the collection of the relevant triplet information of all properties rank;

(2) adopt the correlation technique of Schema-based coupling, attribute-name is the different but best property of attribute mapping of expressing same semanteme is to same attribute;

Because what attribute-name that may be different was expressed is same semanteme, namely the represented attribute-name of the Name among the OIS may be different, but represent identical semanteme.Such as the trade name among Fig. 1 and title, attribute-name is different but what express is same attribute, needs here it all is mapped on the same attribute-name to consider.

(3) the locating query interface attribute that will comprise;

3-1) set the attribute frequency threshold value;

3-4) repeating step 3-2), 3-3), finish the attribute that the locating query interface comprises;

(4) calculate the overall ranking of each attribute,

CO (Attri) = \frac{Σ_{i = 1}^{count} W_{imp} ({DB}_{i}) . {Order}_{{DB}_{i}} (Attri)}{count}

Wherein, attribute occurrence number (Attribute Count AC): the number of times that occurs in all lists of being added up for certain particular community is described as two tuple AC={Attribute, Count} is wherein: Attribute refers to the name of a certain attribute, is used for identifying a certain attribute.Count is the number of times that this attribute occurs in the list of all collections altogether.The overall ranking of attribute (Comprehensive Order, CO): be used for showing that attribute considering its overall ranking after different information in each list;

(5) resequence according to the overall ranking of all properties;

The relative order of attribute is in certain field of thinking the Importance of attribute order thus, and the i.e. backward of order for this reason of lax order.After obtaining lax order information, just can utilize this information to relax to unsuccessfully inquiring about, but not very high data owing to tend to produce a large amount of and former inquiry correlativity after lax, wherein close with former inquiry record and the user often is concerned about, so need to the result after lax be filtered.

(6) the object information rank is filtered, and comprises the following steps:

6-1) association attributes obtains;

CAS(DB _i)＝RAS∪(UAS(DB _i)∩QAS)

Wherein, lax property set (Relax Attributes Set, RAS) is illustrated in the community set that is relaxed in the query script.Do not support querying attributes collection (UnsupportedAttributes Set, UAS), represent not support in the query interface that certain data source provides the community set of inquiring about.Querying attributes collection (Query Attributes Set QAS) is illustrated in the community set that comprises in the query requests.Computation attribute collection (CalculateAttributes Set CAS), expression finally consider result and querying condition apart from the time calculative property set;

6-2) distance value calculates;

Corresponding CAS (the DB of data source from (1) _i) in attribute carry out successively the calculating summation of distance value:

Dist (Q, t) = Σ_{i = 1}^{n} W_{imp} (a_{i}) . {Dist}_{a_{i}} (Q, t)

For data recording and querying condition at attribute a _iOn distance.To attribute weight W _Imp(a _i) calculating, utilize the mode in existing, for a certain attribute a _iIts weight is:

W_{imp} (a_{i}) = \frac{RelaxOrder (a_{i})}{| Attributes (R) |}

RelaxOrder (a wherein _i) be attribute a _iLax order, | Attributes (R) | be the total quantity of all properties.

During for various attribute type compute distance values, use more existing public compute modes to calculate.

If a _iBe boolean properties or categorical attribute, if the value of the value of attribute and querying attributes is consistent so as a result, then its distance is 0, otherwise its distance is 1.

If a _iBe numerical attribute, its distance is:

Dist (q, t) = \frac{| q_{a_{i}} - t_{a_{i}} |}{\max_{a_{i}} - \min_{a_{i}}}

Wherein

With

Be the value of querying condition on attribute,

Be the value of outcome record on this attribute.

If a _iBe text message, utilize on the attribute distance to be expressed as:

Dist (q_{a_{i}}, t_{a_{i}}) = \frac{d (q_{a_{i}}, t_{a_{i}})}{\max (| q_{a_{i}} |, | t_{a_{i}} |)}

Wherein

For at attribute a _iOn editing distance,

With

Be attribute a _iString length on inquiry and result.

6-3) rank is filtered;

For all return in (2) as a result example according to

Descending arrange, need to return front K bar according to the user and record to the user;

So, then only calculate in required attribute, saved computational resource, reduced the time of filtration treatment, more when the attribute that comprises at querying condition especially, and lax attribute is when less, and the efficient that improves is more obvious.

Adopt the correlation technique of Schema-based coupling in the technique scheme, attribute-name is different but best property of attribute mapping that express same semanteme is prior art to same attribute.

Technique scheme is owing to only be concerned about the attribute that the query interface after integrated will comprise, so to those attributes that occurs in certain particular source, unify to be filtered.The number of times that each attribute Name of statistics occurs after attribute-name is unified to shine upon is set a threshold w and filter out the attribute that those occur in particular source.

In sum, the present invention is in conjunction with Deep Web inquiry form feature, and the decision method for the lax order of Deep Web integration field attribute carries out relaxation search.In the sequencer procedure of Query Result, consider simultaneously the basic reason of the lax result of impact and querying condition similarity, a kind of more efficient sort method has been proposed, reduced the time of filtration treatment, it is more to work as especially the attribute that comprises at querying condition, and lax attribute is when less, and the efficient that improves is more obvious.

What need to understand is: the above only is preferred implementation of the present invention; for those skilled in the art; under the prerequisite that does not break away from the principle of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. based on relaxation search and the optimization sequencing method of form feature, it is characterized in that may further comprise the steps:

Triplet information OI={DB_ID, Attribute, Order};

Wherein, DB_ID refers to the identifier of certain list place data source that system is given, is used for query interface of unique identification; Attribute refers to the name of a certain attribute, is used for identifying a certain attribute; Order shows the rank of attribute in list, i.e. its order in position;

(3) attribute that comprises of locating query interface;

(4) calculate the overall ranking of each attribute;

4-1) attribute behind taking-up step (3) location according to its occurrence number Attribute Count AC and corresponding rank relevant information, utilizes the overall ranking Comprehensive Order CO of following formula computation attribute, and puts into table COS,

CO (Attri) = \frac{Σ_{i = 1}^{count} W_{imp} ({DB}_{i}) . {Order}_{D B_{i}} (Attri)}{count}

Wherein, DB _iRepresent specific data source, W _Imp(a _i) be attribute a _iWeight, i.e. attribute a _iSignificance level; Attribute occurrence number Attribute Count AC: the number of times that occurs in all lists of being added up for certain particular community is described as two tuple AC={Attribute, Count}, wherein: Attribute refers to the name of a certain attribute, is used for identifying a certain attribute; Count is the number of times that this attribute occurs in the list of all collections altogether; The overall ranking Comprehensive Order of attribute, CO are used for showing that attribute is in its overall ranking after different information in each list of consideration;

(5) resequencing afterwards according to the overall ranking of all properties, the relative order of resulting attribute is Importance of attribute order in certain field;

Lax order is the backward of Importance of attribute order; After obtaining lax order information, utilize this information to relax to unsuccessfully inquiring about;

(6) the object information rank is filtered, and comprises the following steps:

6-1) association attributes obtains;

Calculating is for the as a result example that certain specific data source DGi returns, and calculates its property set that loses constraint to be:

CAS(DB _i)＝RAS?∪(UAS(DB _i)∩QAS)

Wherein, lax property set RAS is illustrated in the community set that is relaxed in the query script; Do not support querying attributes collection UAS, represent not support in the query interface that certain data source provides the community set of inquiring about; Querying attributes collection QAS is illustrated in the community set that comprises in the query requests; Computation attribute collection CAS, expression finally consider result and querying condition apart from the time calculative property set;

6-2) distance value calculates;

Dist (Q, t) = Σ_{i = 1}^{n} W_{imp} (a_{i}) . D {ist}_{a_{i}} (Q, t)

W_{imp} (a_{i}) = \frac{RelaxOrder (a_{i})}{| Attributes (R) |}

If a _iBe numerical attribute, its distance is:

Dist (q, t) = \frac{| q_{a_{i}} - t_{a_{i}} |}{\max_{a_{i}} - \min_{a_{i}}}

Wherein

With

Be respectively the obtainable maximal value of this attribute and minimum value, Be the value of querying condition on attribute,

Be the value of outcome record on this attribute;

If a _iBe text message, utilize that distance table is shown on the attribute:

Dist (q_{a_{i}}, t_{a_{i}}) = \frac{d (q_{a_{i}}, t_{a_{i}})}{\max (| q_{a_{i}} |, | t_{a_{i}} |)}

Wherein

For at attribute a _iOn editing distance,

With

Be attribute a _iString length on inquiry and result;

6-3) rank is filtered;

For 6-2) in all as a result examples that return according to

Descending arrange, need to return front K bar according to the user and record to the user.

2. relaxation search and optimization sequencing method based on form feature according to claim 1, it is characterized in that: the step of the attribute that described locating query interface comprises is:

3-1) set the attribute frequency threshold value;