Summary of the invention
The objective of the invention is to overcome the deficiency that prior art exists, a kind of relaxation search based on form feature and optimization sequencing method are provided.
Purpose of the present invention is achieved through the following technical solutions:
Relaxation search and optimization sequencing method based on form feature is characterized in that may further comprise the steps:
(1) utilizes the list gatherer to collect a large amount of inquiry form information, and record all triplet information OI={DB_IDs relevant with the attribute record rank of each list, Attribute, Order};
Triplet information OI={DB_ID, Attribute, Order};
Wherein, DB_ID refers to the identifier of certain list place data source that system is given, is used for query interface of unique identification; Attribute refers to the name of a certain attribute, is used for identifying a certain attribute; The rank of Order entitling attribute in list, i.e. its order in position;
(2) adopt the method for Schema-based coupling, attribute-name is the different but best property of attribute mapping of expressing same semanteme is to same attribute;
(3) attribute that comprises of locating query interface;
(4) calculate the overall ranking of each attribute;
4-1) attribute behind taking-up step (3) location according to its occurrence number AC and corresponding rank relevant information, utilizes the overall ranking CO of following formula computation attribute, and puts into table COS,
Wherein, attribute occurrence number (Attribute Count AC): the number of times that occurs in all lists of being added up for certain particular community is described as two tuple AC={Attribute, Count}, wherein: Attribute refers to the name of a certain attribute, is used for identifying a certain attribute; Count is the number of times that this attribute occurs in the list of all collections altogether; The overall ranking of attribute (Comprehensive Order, CO): be used for showing that attribute is in its overall ranking after different information in each list of consideration;
4-2) repeating step 4-1) until finishes the calculating of the overall ranking of all properties;
(5) resequence according to the overall ranking of all properties;
The relative order of attribute is in certain field the Importance of attribute order thus, and the i.e. backward of order for this reason of lax order; After obtaining lax order information, utilize this information to relax to unsuccessfully inquiring about;
(6) the object information rank is filtered, and comprises the following steps:
6-1) association attributes obtains;
Calculating is for certain specific data source DB
iThe as a result example that returns, calculate its property set that loses constraint and be:
CAS(DB
i)=RAS∪(UAS(DB
i)∩QAS)
Wherein, lax property set (Relax Attributes Set, RAS) is illustrated in the community set that is relaxed in the query script; Do not support querying attributes collection (UnsupportedAttributes Set, UAS), represent not support in the query interface that certain data source provides the community set of inquiring about; Querying attributes collection (Query Attributes Set QAS) is illustrated in the community set that comprises in the query requests; Computation attribute collection (CalculateAttributes Set CAS), expression finally consider result and querying condition apart from the time calculative property set;
6-2) distance value calculates;
To 6-1) the middle corresponding CAS (DB of data source
i) in attribute carry out successively the calculating summation of distance value:
W wherein
Imp(a
i) be attribute a
iWeight, i.e. attribute a
iSignificance level,
For data recording and querying condition at attribute a
iOn distance; To attribute weight w
Imp(a
i) calculating, for a certain attribute a
iIts weight is:
RelaxOrder (a wherein
i) be attribute a
iLax order, | Attributes (R) | be the total quantity of all properties;
During to various attribute type compute distance values, use existing public compute mode to calculate;
If a
iBe boolean properties or categorical attribute, if the value of the value of attribute and querying attributes is consistent so as a result, then its distance is 0, otherwise its distance is 1;
If a
iBe numerical attribute, its distance is:
Wherein
With
Be respectively the obtainable maximal value of this attribute and minimum value,
Be the value of querying condition on attribute,
Be the value of outcome record on this attribute;
If a
iBe text message, utilize that distance table is shown on the attribute:
Wherein
For at attribute a
iOn editing distance,
With
Be attribute a
iString length on inquiry and result;
6-3) rank is filtered;
For 6-2) in all as a result examples that return according to
Descending arrange, need to return front K (K for record number) bar according to the user and record to the user.
Further, above-mentioned relaxation search and optimization sequencing method based on form feature, wherein, the step of the attribute that described locating query interface comprises is:
3-1) set the attribute frequency threshold value;
3-2) in the attribute that step (2) obtains, take out an attribute and add up the number of times of its appearance;
If 3-3) the attribute occurrence number is greater than the attribute frequency threshold value of setting, then this attribute of mark is the query interface zone, otherwise detects next attribute;
3-4) repeating step 3-2), 3-3), finish the attribute that the locating query interface comprises.
The substantive distinguishing features that technical solution of the present invention is outstanding and significant progressive being mainly reflected in:
The present invention is in conjunction with Deep Web inquiry form feature, and the decision method for the lax order of Deep Web integration field attribute carries out relaxation search.In the sequencer procedure of Query Result, consider simultaneously the basic reason of the lax result of impact and querying condition similarity, a kind of more efficient sort method has been proposed, reduced the time of filtration treatment, it is more to work as especially the attribute that comprises at querying condition, and lax attribute is when less, and the efficient that improves is more remarkable.
Embodiment
The present invention is directed to the decision method of the lax order of Deep Web integration field attribute, carry out relaxation search; In the sequencer procedure of Query Result, consider simultaneously the basic reason of the lax result of impact and querying condition similarity, proposed a kind of more efficient sort method.Mainly be based on the relaxation method that weakens the attribute constraint, the most key problem of this relaxation method is how to judge sequencing lax between the attribute, in order farthest to obtain the result close with the original inquiry of user, need as far as possible little going change original querying condition, namely need to find out in numerous attributes of inquiry on the most weak attribute of Query Result impact.In traditional database field, the people such as Nambiar will determine to inquire about that proposition obtains a sample at random by the detection to database when being converted into uncertain inquiry solving, then utilize the method for machine learning to come the functional dependence between the getattr to concern, judge the significance level of attribute with this.Yet in Deep Web integration field, this method but is difficult to applicable, data source in the face of numerous isomeries is difficult to go to obtain a suitable sample in global scope on the one hand, on the other hand, the decision method of this kind attribute significance level is from the dependence between the Database Properties, can not well reflect user's viewpoint.
The present invention is based on relaxation search and the optimization sequencing method of form feature, the invention will be further described below in conjunction with drawings and Examples, as shown in Figure 3, may further comprise the steps:
(1) utilizes the list gatherer to collect a large amount of relevant inquiry form information, and record each list
The triplet information OI={DB_ID that all are relevant with the attribute record rank, Attribute, Order};
For a certain specific area, utilize the list gatherer to collect a large amount of relevant inquiry form information f orms, the list gatherer can utilize deep web focused crawler to realize.
The step of obtaining triplet information is as follows:
1-1) form information that obtains is lined up a tabulation, the attribute that obtains is also lined up a table;
1-2) get next property value, and record the relevant tlv triple relevant information OI={DB_ID of this attribute rank, Name, Order} also puts it among the table OI S;
Wherein, OI (Order Information) is the rank relevant information: the relevant information for certain attribute in certain list can be described as tlv triple OI={DB_ID, Attribute, Order}.DB_ID refers to the identifier of certain list place data source that system is given, is used for query interface of unique identification; Attribute refers to the name of a certain attribute, is used for identifying a certain attribute; The rank of this attribute of Order entitling in this list, i.e. its order in position.
1-3) repeating step 1-2), until finish the collection of the relevant triplet information of all properties rank;
(2) adopt the correlation technique of Schema-based coupling, attribute-name is the different but best property of attribute mapping of expressing same semanteme is to same attribute;
Because what attribute-name that may be different was expressed is same semanteme, namely the represented attribute-name of the Name among the OIS may be different, but represent identical semanteme.Such as the trade name among Fig. 1 and title, attribute-name is different but what express is same attribute, needs here it all is mapped on the same attribute-name to consider.
(3) the locating query interface attribute that will comprise;
3-1) set the attribute frequency threshold value;
3-2) in the attribute that step (2) obtains, take out an attribute and add up the number of times of its appearance;
If 3-3) the attribute occurrence number is greater than the attribute frequency threshold value of setting, then this attribute of mark is the query interface zone, otherwise detects next attribute;
3-4) repeating step 3-2), 3-3), finish the attribute that the locating query interface comprises;
(4) calculate the overall ranking of each attribute,
4-1) attribute behind taking-up step (3) location according to its occurrence number AC and corresponding rank relevant information, utilizes the overall ranking CO of following formula computation attribute, and puts into table COS,
Wherein, attribute occurrence number (Attribute Count AC): the number of times that occurs in all lists of being added up for certain particular community is described as two tuple AC={Attribute, Count} is wherein: Attribute refers to the name of a certain attribute, is used for identifying a certain attribute.Count is the number of times that this attribute occurs in the list of all collections altogether.The overall ranking of attribute (Comprehensive Order, CO): be used for showing that attribute considering its overall ranking after different information in each list;
4-2) repeating step 4-1) until finishes the calculating of the overall ranking of all properties;
(5) resequence according to the overall ranking of all properties;
The relative order of attribute is in certain field of thinking the Importance of attribute order thus, and the i.e. backward of order for this reason of lax order.After obtaining lax order information, just can utilize this information to relax to unsuccessfully inquiring about, but not very high data owing to tend to produce a large amount of and former inquiry correlativity after lax, wherein close with former inquiry record and the user often is concerned about, so need to the result after lax be filtered.
(6) the object information rank is filtered, and comprises the following steps:
6-1) association attributes obtains;
Calculating is for certain specific data source DB
iThe as a result example that returns, calculate its property set that loses constraint and be:
CAS(DB
i)=RAS∪(UAS(DB
i)∩QAS)
Wherein, lax property set (Relax Attributes Set, RAS) is illustrated in the community set that is relaxed in the query script.Do not support querying attributes collection (UnsupportedAttributes Set, UAS), represent not support in the query interface that certain data source provides the community set of inquiring about.Querying attributes collection (Query Attributes Set QAS) is illustrated in the community set that comprises in the query requests.Computation attribute collection (CalculateAttributes Set CAS), expression finally consider result and querying condition apart from the time calculative property set;
6-2) distance value calculates;
Corresponding CAS (the DB of data source from (1)
i) in attribute carry out successively the calculating summation of distance value:
W wherein
Imp(a
i) be attribute a
iWeight, i.e. attribute a
iSignificance level,
For data recording and querying condition at attribute a
iOn distance.To attribute weight W
Imp(a
i) calculating, utilize the mode in existing, for a certain attribute a
iIts weight is:
RelaxOrder (a wherein
i) be attribute a
iLax order, | Attributes (R) | be the total quantity of all properties.
During for various attribute type compute distance values, use more existing public compute modes to calculate.
If a
iBe boolean properties or categorical attribute, if the value of the value of attribute and querying attributes is consistent so as a result, then its distance is 0, otherwise its distance is 1.
If a
iBe numerical attribute, its distance is:
Wherein
With
Be respectively the obtainable maximal value of this attribute and minimum value,
Be the value of querying condition on attribute,
Be the value of outcome record on this attribute.
If a
iBe text message, utilize on the attribute distance to be expressed as:
Wherein
For at attribute a
iOn editing distance,
With
Be attribute a
iString length on inquiry and result.
6-3) rank is filtered;
For all return in (2) as a result example according to
Descending arrange, need to return front K bar according to the user and record to the user;
So, then only calculate in required attribute, saved computational resource, reduced the time of filtration treatment, more when the attribute that comprises at querying condition especially, and lax attribute is when less, and the efficient that improves is more obvious.
Adopt the correlation technique of Schema-based coupling in the technique scheme, attribute-name is different but best property of attribute mapping that express same semanteme is prior art to same attribute.
Technique scheme is owing to only be concerned about the attribute that the query interface after integrated will comprise, so to those attributes that occurs in certain particular source, unify to be filtered.The number of times that each attribute Name of statistics occurs after attribute-name is unified to shine upon is set a threshold w and filter out the attribute that those occur in particular source.
In sum, the present invention is in conjunction with Deep Web inquiry form feature, and the decision method for the lax order of Deep Web integration field attribute carries out relaxation search.In the sequencer procedure of Query Result, consider simultaneously the basic reason of the lax result of impact and querying condition similarity, a kind of more efficient sort method has been proposed, reduced the time of filtration treatment, it is more to work as especially the attribute that comprises at querying condition, and lax attribute is when less, and the efficient that improves is more obvious.
What need to understand is: the above only is preferred implementation of the present invention; for those skilled in the art; under the prerequisite that does not break away from the principle of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.