CN111177327A

CN111177327A - Document boltzmann machine construction optimization method and device for document query

Info

Publication number: CN111177327A
Application number: CN201811339382.5A
Authority: CN
Inventors: 黄历铭; 李昌盛; 杨传书; 何江
Original assignee: China Petroleum and Chemical Corp; Sinopec Research Institute of Petroleum Engineering
Current assignee: China Petroleum and Chemical Corp; Sinopec Research Institute of Petroleum Engineering
Priority date: 2018-11-12
Filing date: 2018-11-12
Publication date: 2020-05-19

Abstract

The invention provides a method for constructing and optimizing a document Boltzmann machine for document query, which comprises the following steps: sampling the selected document to obtain a plurality of groups of text fragments, and gathering the plurality of groups of text fragments to obtain a sample set; performing model learning processing according to the sample set to obtain a document Boltzmann machine model corresponding to the selected document; and performing optimization processing on the generated document Boltzmann machine model through a Bayesian information criterion to obtain the optimized document Boltzmann machine model. The document boltzmann machine construction and optimization method for document query, provided by the invention, applies the boltzmann machine to the field of document query, can naturally capture the dependency relationship among terms, and generalizes the distribution hypothesis used by a traditional language model. More effective query likelihood can be obtained, and the retrieval accuracy is improved.

Description

Document boltzmann machine construction optimization method and device for document query

Technical Field

The invention relates to the field of information retrieval, in particular to a method and a device for constructing and optimizing a document Boltzmann machine for document query.

Background

In recent years, with the rapid development of internet technology, information on the internet grows at an exponential rate, and information resources on the internet are greatly enriched. However, it is becoming more and more difficult to screen the information needed by the user from the massive information, which not only relates to the speed of the search, but also must consider the accuracy and validity of the search result and whether the user's needs can be met.

In the field of information retrieval, probabilistic language models have been widely used. The language model estimates the document model under the multinomial distribution assumption, and then performs relevance ranking on the documents by using query likelihood, wherein the language model assumes that terms in the documents are independent of each other. There is currently no probabilistic language model applied to document queries.

Therefore, the invention provides a method and a device for constructing and optimizing a document Boltzmann machine for document query.

Disclosure of Invention

In order to solve the above problems, the present invention provides a document boltzmann mechanism optimization method for document query, the method includes the following steps:

sampling the selected document to obtain a plurality of groups of text fragments, and gathering the plurality of groups of text fragments to obtain a sample set;

performing model learning processing according to the sample set to obtain a document Boltzmann machine model corresponding to the selected document;

and performing optimization processing on the generated document Boltzmann machine model through a Bayesian information criterion to obtain the optimized document Boltzmann machine model.

According to an embodiment of the present invention, the step of sampling the selected document to obtain a plurality of groups of text segments further includes the following steps:

sampling is carried out by utilizing a sliding window to obtain the overlapped text segments, wherein the size of the sliding window is a first preset value, and the step length of the sliding window is a second preset value.

According to an embodiment of the present invention, the step of performing model learning processing according to the sample set further includes the following steps:

model learning processing is performed by a maximum likelihood function to obtain a learning objective function as shown below:

wherein L (W; X) represents the learning objective function, Z (W) represents a distribution function, X represents the sample set, W represents a parameter value of the document Boltzmann machine model, and W represents the distribution function¹、w²、w³Representing a subset of W, a first order parameter

Representing the state of node i, a second order parameter

Third order parameter representing degree of association between node i and node j

Indicating the degree of association between node i, node j, and node k.

According to an embodiment of the invention, the step of performing optimization processing through a Bayesian information criterion to obtain the optimized document Boltzmann machine model further comprises the following steps:

obtaining a plurality of small models corresponding to the document Boltzmann machine model and an edge probability value corresponding to a single small model according to the relevance of nodes in the document Boltzmann machine model;

and selecting the maximum value P of the edge probability values according to the edge probability values of the small models, and recording the small model corresponding to the maximum value P as the optimized document Boltzmann machine model.

According to an embodiment of the present invention, when the document boltzmann model includes b nodes, the step of obtaining a plurality of small models corresponding to the document boltzmann model and an edge probability value corresponding to a single small model further includes the following steps:

when the correlation value between any two nodes in the document Boltzmann machine model is zero, obtaining the corresponding correlation value of the document Boltzmann machine model

Selecting edge probability values corresponding to different small models and a single model, and selecting the maximum value P1 of the edge probability values;

when the two groups, the three groups and the four groups … in the document Boltzmann machine model reach to

When the correlation value between any two nodes in the group is zero, obtaining a plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model, and selecting the maximum values P2, P3 and P4 … of the edge probability values until the maximum values P2, P3 and P4 … of the edge probability values are zero

According to another aspect of the present invention, there is also provided a document boltzmann mechanism optimization apparatus for document query, the apparatus including:

the system comprises a sample set module, a document selection module and a document analysis module, wherein the sample set module is used for sampling a selected document to obtain a plurality of groups of text fragments, and collecting the plurality of groups of text fragments to obtain a sample set;

the document Boltzmann machine model module is used for performing model learning processing according to the sample set to obtain a document Boltzmann machine model corresponding to the selected document;

and the optimization module is used for carrying out optimization processing on the generated document Boltzmann machine model through a Bayesian information criterion to obtain the optimized document Boltzmann machine model.

According to one embodiment of the invention, the sample set module comprises:

the sampling processing unit is used for carrying out sampling processing by utilizing a sliding window to obtain a text segment which can be overlapped, wherein the size of the sliding window is a first preset value, and the step length of the sliding window is a second preset value.

According to one embodiment of the invention, the document boltzmann model module comprises:

an object learning function unit for performing a model learning process by a maximum likelihood function to obtain a learning object function as shown below:

Representing the state of node i, a second order parameter

Indicating the degree of association between node i, node j, and node k.

According to one embodiment of the invention, the optimization module comprises:

the edge probability value unit is used for obtaining a plurality of small models corresponding to the document Boltzmann machine model and an edge probability value corresponding to a single small model according to the relevance of nodes in the document Boltzmann machine model;

and the selecting unit is used for selecting the maximum value P of the edge probability values according to the edge probability values of the small models, and recording the small model corresponding to the maximum value P as the optimized document Boltzmann machine model.

According to an embodiment of the present invention, when the document boltzmann model includes b nodes, the edge probability value unit includes:

a single group of sub-units, which is used for obtaining the corresponding relation between any two nodes in the document Boltzmann machine model when the relation value between any two nodes in the document Boltzmann machine model is zero

a plurality of groups of subunits used for forming two groups, three groups and four groups … in the document Boltzmann machine model until

The document boltzmann mechanism optimization method for document query provided by the invention applies a boltzmann machine to the field of document query, can naturally capture the dependency relationship among terms, and generalizes the distribution hypothesis used by a traditional language model. More effective query likelihood can be obtained, and the retrieval accuracy is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 shows a flowchart of a document Boltzmann mechanism optimization method for document queries, according to one embodiment of the invention;

FIG. 2 is a diagram showing a document Boltzmann mechanism optimization method for document queries to obtain a document Boltzmann model according to an embodiment of the invention;

FIG. 3 shows a schematic diagram of a document Boltzmann machine model optimized for document queries according to another embodiment of the invention; and

FIG. 4 shows a block diagram of an apparatus for optimizing the construction of a document Boltzmann machine for document query according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

FIG. 1 shows a flowchart of a document Boltzmann mechanism optimization method for document queries, according to one embodiment of the invention.

As shown in fig. 1, in step S101, a selected document is sampled to obtain a plurality of groups of text segments, and the obtained plurality of groups of text segments are collected to obtain a sample set. In one embodiment, a sliding window is used for sampling to obtain the text segments that can be overlapped, wherein the size of the sliding window is a first preset value, and the step length of the sliding window is a second preset value.

Then, in step S102, model learning processing is performed according to the sample set, and a document boltzmann model corresponding to the selected document is obtained. In one embodiment, the model learning process is performed by a maximum likelihood function to obtain a learning objective function as shown below:

wherein L (W; X) represents a learning objective function, Z (W) represents a distribution function, X represents a sample set, W represents a parameter value of a document Boltzmann machine model, and W¹、w²、w³Representing a subset of W, a first order parameter

Representing the state of node i, a second order parameter

Representing associations between node i, node j, and node kDegree of the disease.

Finally, in step S103, the generated document boltzmann model is optimized according to the bayesian information criterion, so as to obtain an optimized document boltzmann model. In one embodiment, according to the relevance of nodes in the document boltzmann model, a plurality of small models corresponding to the document boltzmann model and edge probability values corresponding to a single small model are obtained. And selecting the maximum value P of the edge probability values according to the edge probability values of the small models, and recording the small model corresponding to the maximum value P as the optimized document Boltzmann machine model.

When the document boltzmann model contains b nodes, in one embodiment, when the correlation value between any two nodes in the document boltzmann model is zero, the corresponding node of the document boltzmann model is obtained

Selecting edge probability values corresponding to different small models and a single model, and selecting the maximum value P1 of the edge probability values; when two groups, three groups and four groups … in the boltzmann model are recorded till

When the correlation value between any two nodes of the group is zero, obtaining a plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model, and selecting the maximum values P2, P3 and P4 … of the edge probability values until the maximum values P2, P3 and P4 … of the edge probability values are zero

FIG. 2 is a schematic diagram of a document Boltzmann mechanism optimization method for document queries to obtain a document Boltzmann model according to an embodiment of the invention.

Documents are modeled using the fully visible Boltzmann Machine (BM) with the goal of naturally capturing dependencies between terms, generalizing the distribution assumptions used by traditional language models. The model represents a class of Boltzmann distribution, is more generalized than a plurality of distributions, further obtains more effective query likelihood and improves the retrieval accuracy.

FIG. 2, for any document d_iOverlappable document fragments are obtained using a sliding window, the sliding window size being set to a fixed value σ and the sliding step size being set to 1. Each dimension of each document fragment indicates whether a corresponding term exists in the current document fragment, i.e. if a term appears in the document fragment, the value of the corresponding dimension in the vector representation is 1, otherwise 0, and the obtained sample set X is used for learning a model for representing the document i. The learning method selects the most basic maximum likelihood function, and can obtain a learning objective function, namely the log likelihood of a sample:

wherein Z (W) ═ Σ_Xexp[-E(x；W)]Is a partition function;

x is the sample set, w¹,w²,w³Etc. are different subsets of the parameter set W. First order parameter

Representing the state of the modeling node i, a second-order parameter

The degree of association between node i and node j is modeled, while higher order parameters may model the degree of connectivity between multiple nodes. Thus, when each node represents the state of a term, the boltzmann machine can naturally model the dependencies between terms.

Updating parameters according to the gradient of the objective function in the learning process:

wherein,<.>₀and<.>_∞mean values based on the sample distribution and the current model distribution are respectively represented. For any document d_iCorresponding to a document Boltzmann machine model BM_iThe probability of any document fragment x of this document is:

log p(x|d_i)＝log p(X|BM_i)＝log p(x；W_i)

the document boltzmann model DBM can be regarded as a probability distribution of dynamic system states, a query also represents a state of the system, and the system presents the state with a certain probability. In other words, since a query can be regarded as a value input by the boltzmann machine, its probability, i.e., query likelihood, can be directly calculated.

In some cases, all the words appearing in the vocabulary are included in the structure of the boltzmann machine, but this places a large burden on model learning and probability calculation. In a model implementation, given a query Q ═ Q (Q)₁,q₂,q₃,…q_i…), a query word q is formulated for each node_iOnce obtained for document d_iThe boltzmann machine model BM (parameter value is W) can obtain the query likelihood:

wherein x is_QIs a vector with dimension values of 1, so that the assignment is due to all the query terms appearing in query Q. The form of query generation is obviously different from the traditional language model, independence among terms is not assumed, the query is regarded as a text segment, and the generation probability of the text segment in each document Boltzmann machine is calculated by taking the text segment as a whole.

The invention not only makes full use of the strong multidimensional data distribution modeling capability of the Boltzmann machine, but also better conforms to the search intention of the user from the cognitive angle, namely, the non-independent terms input by the user are a concept expressed by text fragments.

In one embodiment, the present invention provides methods on four standard public TREC data sets: AP8889 (queries 151-200), WSJ8792 (queries 151-200), ROBUST (queries 601-700), and WT10G (queries 501-500) were verified. The four datasets used were indexed using index 5.1 and the text was stemmed using Poter before indexing, with word removal disabled. The rearrangement experiment is carried out in the first 50 documents of the first round of query, and the query results of the first round are sorted by the language model query likelihood. The window size for the document boltzmann machine to sample the text segment is set to 16.

In the embodiment, the retrieval effects of different document models without using any smoothing method are compared. Two language models were chosen as comparison bases, and the parameter mu in the dirichlet smoothing method used in the two models was set to 0 (denoted by LM (μ ═ 0) in the table) and a small integer ∈ (denoted by LM (μ ═ epsilon) in the table). The model under test includes a document boltzmann machine with only first order parameters, denoted 1-DBM, and a document boltzmann machine with first and second order parameters, denoted 2-DBM. Both measured models set the parameter v in the smoothing method to 0 (to eliminate the smoothing effect). The results of the model comparisons without the smoothing method are shown in table 1. From the results, it can be seen that the effect of the document boltzmann machine (1-DBM, 2-DBM) is superior to the language model (LM (μ ═ 0), LM (μ ═ ∈)) on most data sets without the influence of the smoothing method. In addition, it can be observed that the performance of the 2-DBM is always better than that of the 1-DBM, which shows that the high-order parameters of the document boltzmann machine are used as the supplement of the low-order parameters, and the improvement on the whole retrieval effect is effective.

TABLE 1 comparison of search results for different document models

The method for searching by using the Boltzmann Machine (BM) provided by the invention uses the fully visible Boltzmann machine to perform document modeling, and aims to generalize the distribution hypothesis used by the traditional language model. The model may represent a class of boltzmann machine distributions that are more generalized than polynomial distributions. The proposed Document Boltzmann Model (DBM) can naturally obtain the relation between terms, and further obtain more effective query likelihood.

FIG. 3 shows a schematic diagram of a document Boltzmann machine model optimized by the document Boltzmann mechanism for document queries according to another embodiment of the present invention.

The Bayesian information criterion BIC is a basic method in the decision of a statistical module, and the basic idea is as follows: 1. knowing a class conditional probability density parameter expression and prior probability; 2. converting into posterior probability by Bayesian formula; 3. and carrying out decision classification according to the posterior probability.

In selecting a model, the Bayesian approach is to maximize the model M at a given data (y1, …, yn)_iThe posterior probability of (d). According to Bayes theorem:

P(Mi|y1,…,yn)＝P(y1,…,yn|Mi)*P(Mi)/P(y1,…,yn)

wherein P (y1, …, yn | Mi) is model M_iGiven that the data P (y 1.., yn) is constant, given that the data is unknown, the respective model is reasonable, i.e., P (Mi) is a constant value, thus maximizing the model M_iIs equivalent to the maximum model M_iP (y1, …, yn | Mi).

In the optimization of the document boltzmann machine DBM by using bayesian information amount BIC, P (y1, …, yn | Mi) ═ ^ integral theta i L (theta i | y1,., yn) gi (theta i) d theta i; wherein Θ i is a parameter vector of the model DBM, namely a set of correlation values between any two nodes in the DBM; the data (y1, …, yn) is a collection of nodes in the DBM model.

It should be noted that nodes in the DBM model refer to keywords, and include two types of keywords: 1. keywords input by a user; 2. and the first n documents are retrieved through the language model LM, and except the key words input by the user, the first n words with the highest occurrence probability are found in the n documents.

The optimization shown in fig. 3 comprises the following specific steps: (assume there are k nodes in the DBM model)

The association value between any two nodes of the DBM is set to be 0, and the DBM are shared

In one case, i.e. correspond to

And each model corresponds to an edge probability value P (y1, …, yn | Mi). Selecting the maximum edge probability value, and marking as P₁And save the corresponding model, denoted as M₁。

The association value between any two nodes in the DBM is set to be 0, and the two nodes have the same association value

In one case, i.e. correspond to

Different models, each model corresponding to an edge probability value P (y1, …, yn | Mi). Selecting the maximum edge probability value, and marking as P₂And saving the corresponding model, denoted as M₂。

And so on, … for three groups and four groups in the DBM model

The association value for any two nodes of the set is set to 0 ….

Finally, in DBM model

Setting the associated value of any two nodes in the group as 0, selecting the maximum edge probability value and marking as P_jAnd saving the corresponding model, denoted as M_j。

From the above edge probability values { P₁,P₂…P_jAnd selecting the maximum value P of the edge probability, wherein the corresponding model M is the optimal model.

As shown in fig. 4, the apparatus comprises: a sample set module 401, a document boltzmann machine model module 402, and an optimization module 403. The sample set module 401 comprises a sampling processing unit 4011. The document boltzmann model module 402 includes a target learning function unit 4021. The optimization module 403 includes an edge probability value unit 4031 and a selection unit 4032. The edge probability value unit 4031 includes a single set of sub-units 40311 and multiple sets of sub-units 40312.

The sample set module 401 is configured to sample a selected document to obtain a plurality of groups of text segments, and collect the plurality of groups of text segments to obtain a sample set. The sampling processing unit 4011 is configured to perform sampling processing by using a sliding window to obtain a text segment that can be overlapped, where a size of the sliding window is a first preset value, and a step length of the sliding window is a second preset value.

The document boltzmann model module 402 is configured to perform model learning processing according to the sample set, and obtain a document boltzmann model corresponding to the selected document. The target learning function unit 4021 is configured to perform model learning processing by a maximum likelihood function to obtain a learning target function as shown below:

Representing the state of node i, a second order parameter

Indicating the degree of association between node i, node j, and node k.

The optimization module 403 is configured to perform optimization processing on the generated document boltzmann model according to a bayesian information criterion, so as to obtain an optimized document boltzmann model.

The edge probability value unit 4031 is used for obtaining a plurality of small models corresponding to the document boltzmann model and an edge probability value corresponding to a single small model according to the relevance of nodes in the document boltzmann model. The single-group subunit 40311 is used for obtaining a correlation value between any two nodes in the document boltzmann model when the correlation value is zero, and obtaining a corresponding relation between the document boltzmann model

And selecting the edge probability values corresponding to different small models and the single model, and selecting the maximum value P1 of the edge probability values. The multi-group subunit 40312 is used for two, three and four groups … in the boltzmann model when the document is written

The selecting unit 4032 is configured to select a maximum value P of the edge probability values according to the edge probability values of the multiple small models, and record the small model corresponding to the maximum value P as the optimized document boltzmann model.

It is to be understood that the disclosed embodiments of the invention are not limited to the particular structures, process steps, or materials disclosed herein but are extended to equivalents thereof as would be understood by those ordinarily skilled in the relevant arts. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A document Boltzmann mechanism optimization method for document query is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of sampling the selected document to obtain the plurality of groups of text segments further comprises the steps of:

3. The method of claim 1, wherein the step of performing model learning processing based on the sample set further comprises the steps of:

Z(W)＝∑_Xexp[-E(x；W)]

Representing the state of node i, a second order parameter

Indicating the degree of association between node i, node j, and node k.

4. The method of claim 1, wherein the step of performing optimization processing through Bayesian information criterion to obtain the optimized document Boltzmann machine model further comprises the steps of:

5. The method of claim 4, wherein when the document Boltzmann machine model includes b nodes, the step of obtaining a plurality of small models corresponding to the document Boltzmann machine model and an edge probability value corresponding to a single small model further comprises the steps of:

When the correlation value between any two nodes of the group is zero, obtainingA plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model are selected, and the maximum values P2, P3 and P4 … of the edge probability values are selected until the maximum values are reached

6. A document boltzmann mechanism optimization apparatus for document query, the apparatus comprising:

7. The apparatus of claim 6, wherein the sample set module comprises:

8. The apparatus of claim 6, wherein the document Boltzmann machine model module comprises:

Z(W)＝∑_Xexp[-E(x；W)]

Representing the state of node i, a second order parameter

Indicating the degree of association between node i, node j, and node k.

9. The apparatus of claim 6, wherein the optimization module comprises:

10. The apparatus of claim 9, wherein when the document boltzmann model includes b nodes, the edge probability value unit includes:

a single group of sub-units for use when the document BohrWhen the correlation value between any two nodes in the Boltzmann machine model is zero, the corresponding correlation value of the Boltzmann machine model of the document is obtained