[go: up one dir, main page]

CN111177327A - Document boltzmann machine construction optimization method and device for document query - Google Patents

Document boltzmann machine construction optimization method and device for document query Download PDF

Info

Publication number
CN111177327A
CN111177327A CN201811339382.5A CN201811339382A CN111177327A CN 111177327 A CN111177327 A CN 111177327A CN 201811339382 A CN201811339382 A CN 201811339382A CN 111177327 A CN111177327 A CN 111177327A
Authority
CN
China
Prior art keywords
document
model
boltzmann machine
machine model
boltzmann
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811339382.5A
Other languages
Chinese (zh)
Inventor
黄历铭
李昌盛
杨传书
何江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Petroleum and Chemical Corp
Sinopec Research Institute of Petroleum Engineering
Original Assignee
China Petroleum and Chemical Corp
Sinopec Research Institute of Petroleum Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Petroleum and Chemical Corp, Sinopec Research Institute of Petroleum Engineering filed Critical China Petroleum and Chemical Corp
Priority to CN201811339382.5A priority Critical patent/CN111177327A/en
Publication of CN111177327A publication Critical patent/CN111177327A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for constructing and optimizing a document Boltzmann machine for document query, which comprises the following steps: sampling the selected document to obtain a plurality of groups of text fragments, and gathering the plurality of groups of text fragments to obtain a sample set; performing model learning processing according to the sample set to obtain a document Boltzmann machine model corresponding to the selected document; and performing optimization processing on the generated document Boltzmann machine model through a Bayesian information criterion to obtain the optimized document Boltzmann machine model. The document boltzmann machine construction and optimization method for document query, provided by the invention, applies the boltzmann machine to the field of document query, can naturally capture the dependency relationship among terms, and generalizes the distribution hypothesis used by a traditional language model. More effective query likelihood can be obtained, and the retrieval accuracy is improved.

Description

Document boltzmann machine construction optimization method and device for document query
Technical Field
The invention relates to the field of information retrieval, in particular to a method and a device for constructing and optimizing a document Boltzmann machine for document query.
Background
In recent years, with the rapid development of internet technology, information on the internet grows at an exponential rate, and information resources on the internet are greatly enriched. However, it is becoming more and more difficult to screen the information needed by the user from the massive information, which not only relates to the speed of the search, but also must consider the accuracy and validity of the search result and whether the user's needs can be met.
In the field of information retrieval, probabilistic language models have been widely used. The language model estimates the document model under the multinomial distribution assumption, and then performs relevance ranking on the documents by using query likelihood, wherein the language model assumes that terms in the documents are independent of each other. There is currently no probabilistic language model applied to document queries.
Therefore, the invention provides a method and a device for constructing and optimizing a document Boltzmann machine for document query.
Disclosure of Invention
In order to solve the above problems, the present invention provides a document boltzmann mechanism optimization method for document query, the method includes the following steps:
sampling the selected document to obtain a plurality of groups of text fragments, and gathering the plurality of groups of text fragments to obtain a sample set;
performing model learning processing according to the sample set to obtain a document Boltzmann machine model corresponding to the selected document;
and performing optimization processing on the generated document Boltzmann machine model through a Bayesian information criterion to obtain the optimized document Boltzmann machine model.
According to an embodiment of the present invention, the step of sampling the selected document to obtain a plurality of groups of text segments further includes the following steps:
sampling is carried out by utilizing a sliding window to obtain the overlapped text segments, wherein the size of the sliding window is a first preset value, and the step length of the sliding window is a second preset value.
According to an embodiment of the present invention, the step of performing model learning processing according to the sample set further includes the following steps:
model learning processing is performed by a maximum likelihood function to obtain a learning objective function as shown below:
Figure BDA0001862124400000021
Figure BDA0001862124400000022
Figure BDA0001862124400000023
wherein L (W; X) represents the learning objective function, Z (W) represents a distribution function, X represents the sample set, W represents a parameter value of the document Boltzmann machine model, and W represents the distribution function1、w2、w3Representing a subset of W, a first order parameter
Figure BDA0001862124400000024
Representing the state of node i, a second order parameter
Figure BDA0001862124400000025
Third order parameter representing degree of association between node i and node j
Figure BDA0001862124400000026
Indicating the degree of association between node i, node j, and node k.
According to an embodiment of the invention, the step of performing optimization processing through a Bayesian information criterion to obtain the optimized document Boltzmann machine model further comprises the following steps:
obtaining a plurality of small models corresponding to the document Boltzmann machine model and an edge probability value corresponding to a single small model according to the relevance of nodes in the document Boltzmann machine model;
and selecting the maximum value P of the edge probability values according to the edge probability values of the small models, and recording the small model corresponding to the maximum value P as the optimized document Boltzmann machine model.
According to an embodiment of the present invention, when the document boltzmann model includes b nodes, the step of obtaining a plurality of small models corresponding to the document boltzmann model and an edge probability value corresponding to a single small model further includes the following steps:
when the correlation value between any two nodes in the document Boltzmann machine model is zero, obtaining the corresponding correlation value of the document Boltzmann machine model
Figure BDA0001862124400000027
Selecting edge probability values corresponding to different small models and a single model, and selecting the maximum value P1 of the edge probability values;
when the two groups, the three groups and the four groups … in the document Boltzmann machine model reach to
Figure BDA0001862124400000028
When the correlation value between any two nodes in the group is zero, obtaining a plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model, and selecting the maximum values P2, P3 and P4 … of the edge probability values until the maximum values P2, P3 and P4 … of the edge probability values are zero
Figure BDA0001862124400000029
According to another aspect of the present invention, there is also provided a document boltzmann mechanism optimization apparatus for document query, the apparatus including:
the system comprises a sample set module, a document selection module and a document analysis module, wherein the sample set module is used for sampling a selected document to obtain a plurality of groups of text fragments, and collecting the plurality of groups of text fragments to obtain a sample set;
the document Boltzmann machine model module is used for performing model learning processing according to the sample set to obtain a document Boltzmann machine model corresponding to the selected document;
and the optimization module is used for carrying out optimization processing on the generated document Boltzmann machine model through a Bayesian information criterion to obtain the optimized document Boltzmann machine model.
According to one embodiment of the invention, the sample set module comprises:
the sampling processing unit is used for carrying out sampling processing by utilizing a sliding window to obtain a text segment which can be overlapped, wherein the size of the sliding window is a first preset value, and the step length of the sliding window is a second preset value.
According to one embodiment of the invention, the document boltzmann model module comprises:
an object learning function unit for performing a model learning process by a maximum likelihood function to obtain a learning object function as shown below:
Figure BDA0001862124400000031
Figure BDA0001862124400000032
Figure BDA0001862124400000033
wherein L (W; X) represents the learning objective function, Z (W) represents a distribution function, X represents the sample set, W represents a parameter value of the document Boltzmann machine model, and W represents the distribution function1、w2、w3Representing a subset of W, a first order parameter
Figure BDA0001862124400000034
Representing the state of node i, a second order parameter
Figure BDA0001862124400000035
Third order parameter representing degree of association between node i and node j
Figure BDA0001862124400000036
Indicating the degree of association between node i, node j, and node k.
According to one embodiment of the invention, the optimization module comprises:
the edge probability value unit is used for obtaining a plurality of small models corresponding to the document Boltzmann machine model and an edge probability value corresponding to a single small model according to the relevance of nodes in the document Boltzmann machine model;
and the selecting unit is used for selecting the maximum value P of the edge probability values according to the edge probability values of the small models, and recording the small model corresponding to the maximum value P as the optimized document Boltzmann machine model.
According to an embodiment of the present invention, when the document boltzmann model includes b nodes, the edge probability value unit includes:
a single group of sub-units, which is used for obtaining the corresponding relation between any two nodes in the document Boltzmann machine model when the relation value between any two nodes in the document Boltzmann machine model is zero
Figure BDA0001862124400000041
Selecting edge probability values corresponding to different small models and a single model, and selecting the maximum value P1 of the edge probability values;
a plurality of groups of subunits used for forming two groups, three groups and four groups … in the document Boltzmann machine model until
Figure BDA0001862124400000042
When the correlation value between any two nodes in the group is zero, obtaining a plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model, and selecting the maximum values P2, P3 and P4 … of the edge probability values until the maximum values P2, P3 and P4 … of the edge probability values are zero
Figure BDA0001862124400000043
The document boltzmann mechanism optimization method for document query provided by the invention applies a boltzmann machine to the field of document query, can naturally capture the dependency relationship among terms, and generalizes the distribution hypothesis used by a traditional language model. More effective query likelihood can be obtained, and the retrieval accuracy is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 shows a flowchart of a document Boltzmann mechanism optimization method for document queries, according to one embodiment of the invention;
FIG. 2 is a diagram showing a document Boltzmann mechanism optimization method for document queries to obtain a document Boltzmann model according to an embodiment of the invention;
FIG. 3 shows a schematic diagram of a document Boltzmann machine model optimized for document queries according to another embodiment of the invention; and
FIG. 4 shows a block diagram of an apparatus for optimizing the construction of a document Boltzmann machine for document query according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below with reference to the accompanying drawings.
FIG. 1 shows a flowchart of a document Boltzmann mechanism optimization method for document queries, according to one embodiment of the invention.
As shown in fig. 1, in step S101, a selected document is sampled to obtain a plurality of groups of text segments, and the obtained plurality of groups of text segments are collected to obtain a sample set. In one embodiment, a sliding window is used for sampling to obtain the text segments that can be overlapped, wherein the size of the sliding window is a first preset value, and the step length of the sliding window is a second preset value.
Then, in step S102, model learning processing is performed according to the sample set, and a document boltzmann model corresponding to the selected document is obtained. In one embodiment, the model learning process is performed by a maximum likelihood function to obtain a learning objective function as shown below:
Figure BDA0001862124400000051
Figure BDA0001862124400000052
Figure BDA0001862124400000053
wherein L (W; X) represents a learning objective function, Z (W) represents a distribution function, X represents a sample set, W represents a parameter value of a document Boltzmann machine model, and W1、w2、w3Representing a subset of W, a first order parameter
Figure BDA0001862124400000054
Representing the state of node i, a second order parameter
Figure BDA0001862124400000055
Third order parameter representing degree of association between node i and node j
Figure BDA0001862124400000056
Representing associations between node i, node j, and node kDegree of the disease.
Finally, in step S103, the generated document boltzmann model is optimized according to the bayesian information criterion, so as to obtain an optimized document boltzmann model. In one embodiment, according to the relevance of nodes in the document boltzmann model, a plurality of small models corresponding to the document boltzmann model and edge probability values corresponding to a single small model are obtained. And selecting the maximum value P of the edge probability values according to the edge probability values of the small models, and recording the small model corresponding to the maximum value P as the optimized document Boltzmann machine model.
When the document boltzmann model contains b nodes, in one embodiment, when the correlation value between any two nodes in the document boltzmann model is zero, the corresponding node of the document boltzmann model is obtained
Figure BDA0001862124400000057
Selecting edge probability values corresponding to different small models and a single model, and selecting the maximum value P1 of the edge probability values; when two groups, three groups and four groups … in the boltzmann model are recorded till
Figure BDA0001862124400000058
When the correlation value between any two nodes of the group is zero, obtaining a plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model, and selecting the maximum values P2, P3 and P4 … of the edge probability values until the maximum values P2, P3 and P4 … of the edge probability values are zero
Figure BDA0001862124400000061
FIG. 2 is a schematic diagram of a document Boltzmann mechanism optimization method for document queries to obtain a document Boltzmann model according to an embodiment of the invention.
Documents are modeled using the fully visible Boltzmann Machine (BM) with the goal of naturally capturing dependencies between terms, generalizing the distribution assumptions used by traditional language models. The model represents a class of Boltzmann distribution, is more generalized than a plurality of distributions, further obtains more effective query likelihood and improves the retrieval accuracy.
FIG. 2, for any document diOverlappable document fragments are obtained using a sliding window, the sliding window size being set to a fixed value σ and the sliding step size being set to 1. Each dimension of each document fragment indicates whether a corresponding term exists in the current document fragment, i.e. if a term appears in the document fragment, the value of the corresponding dimension in the vector representation is 1, otherwise 0, and the obtained sample set X is used for learning a model for representing the document i. The learning method selects the most basic maximum likelihood function, and can obtain a learning objective function, namely the log likelihood of a sample:
Figure BDA0001862124400000062
wherein Z (W) ═ ΣXexp[-E(x;W)]Is a partition function;
Figure BDA0001862124400000063
Figure BDA0001862124400000064
x is the sample set, w1,w2,w3Etc. are different subsets of the parameter set W. First order parameter
Figure BDA0001862124400000065
Representing the state of the modeling node i, a second-order parameter
Figure BDA0001862124400000066
The degree of association between node i and node j is modeled, while higher order parameters may model the degree of connectivity between multiple nodes. Thus, when each node represents the state of a term, the boltzmann machine can naturally model the dependencies between terms.
Updating parameters according to the gradient of the objective function in the learning process:
Figure BDA0001862124400000067
wherein,<.>0and<.>mean values based on the sample distribution and the current model distribution are respectively represented. For any document diCorresponding to a document Boltzmann machine model BMiThe probability of any document fragment x of this document is:
log p(x|di)=log p(X|BMi)=log p(x;Wi)
the document boltzmann model DBM can be regarded as a probability distribution of dynamic system states, a query also represents a state of the system, and the system presents the state with a certain probability. In other words, since a query can be regarded as a value input by the boltzmann machine, its probability, i.e., query likelihood, can be directly calculated.
In some cases, all the words appearing in the vocabulary are included in the structure of the boltzmann machine, but this places a large burden on model learning and probability calculation. In a model implementation, given a query Q ═ Q (Q)1,q2,q3,…qi…), a query word q is formulated for each nodeiOnce obtained for document diThe boltzmann machine model BM (parameter value is W) can obtain the query likelihood:
Figure BDA0001862124400000071
wherein x isQIs a vector with dimension values of 1, so that the assignment is due to all the query terms appearing in query Q. The form of query generation is obviously different from the traditional language model, independence among terms is not assumed, the query is regarded as a text segment, and the generation probability of the text segment in each document Boltzmann machine is calculated by taking the text segment as a whole.
The invention not only makes full use of the strong multidimensional data distribution modeling capability of the Boltzmann machine, but also better conforms to the search intention of the user from the cognitive angle, namely, the non-independent terms input by the user are a concept expressed by text fragments.
In one embodiment, the present invention provides methods on four standard public TREC data sets: AP8889 (queries 151-200), WSJ8792 (queries 151-200), ROBUST (queries 601-700), and WT10G (queries 501-500) were verified. The four datasets used were indexed using index 5.1 and the text was stemmed using Poter before indexing, with word removal disabled. The rearrangement experiment is carried out in the first 50 documents of the first round of query, and the query results of the first round are sorted by the language model query likelihood. The window size for the document boltzmann machine to sample the text segment is set to 16.
In the embodiment, the retrieval effects of different document models without using any smoothing method are compared. Two language models were chosen as comparison bases, and the parameter mu in the dirichlet smoothing method used in the two models was set to 0 (denoted by LM (μ ═ 0) in the table) and a small integer ∈ (denoted by LM (μ ═ epsilon) in the table). The model under test includes a document boltzmann machine with only first order parameters, denoted 1-DBM, and a document boltzmann machine with first and second order parameters, denoted 2-DBM. Both measured models set the parameter v in the smoothing method to 0 (to eliminate the smoothing effect). The results of the model comparisons without the smoothing method are shown in table 1. From the results, it can be seen that the effect of the document boltzmann machine (1-DBM, 2-DBM) is superior to the language model (LM (μ ═ 0), LM (μ ═ ∈)) on most data sets without the influence of the smoothing method. In addition, it can be observed that the performance of the 2-DBM is always better than that of the 1-DBM, which shows that the high-order parameters of the document boltzmann machine are used as the supplement of the low-order parameters, and the improvement on the whole retrieval effect is effective.
TABLE 1 comparison of search results for different document models
Figure BDA0001862124400000072
Figure BDA0001862124400000081
The method for searching by using the Boltzmann Machine (BM) provided by the invention uses the fully visible Boltzmann machine to perform document modeling, and aims to generalize the distribution hypothesis used by the traditional language model. The model may represent a class of boltzmann machine distributions that are more generalized than polynomial distributions. The proposed Document Boltzmann Model (DBM) can naturally obtain the relation between terms, and further obtain more effective query likelihood.
FIG. 3 shows a schematic diagram of a document Boltzmann machine model optimized by the document Boltzmann mechanism for document queries according to another embodiment of the present invention.
The Bayesian information criterion BIC is a basic method in the decision of a statistical module, and the basic idea is as follows: 1. knowing a class conditional probability density parameter expression and prior probability; 2. converting into posterior probability by Bayesian formula; 3. and carrying out decision classification according to the posterior probability.
In selecting a model, the Bayesian approach is to maximize the model M at a given data (y1, …, yn)iThe posterior probability of (d). According to Bayes theorem:
P(Mi|y1,…,yn)=P(y1,…,yn|Mi)*P(Mi)/P(y1,…,yn)
wherein P (y1, …, yn | Mi) is model MiGiven that the data P (y 1.., yn) is constant, given that the data is unknown, the respective model is reasonable, i.e., P (Mi) is a constant value, thus maximizing the model MiIs equivalent to the maximum model MiP (y1, …, yn | Mi).
In the optimization of the document boltzmann machine DBM by using bayesian information amount BIC, P (y1, …, yn | Mi) ═ ^ integral theta i L (theta i | y1,., yn) gi (theta i) d theta i; wherein Θ i is a parameter vector of the model DBM, namely a set of correlation values between any two nodes in the DBM; the data (y1, …, yn) is a collection of nodes in the DBM model.
It should be noted that nodes in the DBM model refer to keywords, and include two types of keywords: 1. keywords input by a user; 2. and the first n documents are retrieved through the language model LM, and except the key words input by the user, the first n words with the highest occurrence probability are found in the n documents.
The optimization shown in fig. 3 comprises the following specific steps: (assume there are k nodes in the DBM model)
The association value between any two nodes of the DBM is set to be 0, and the DBM are shared
Figure BDA0001862124400000082
In one case, i.e. correspond to
Figure BDA0001862124400000083
And each model corresponds to an edge probability value P (y1, …, yn | Mi). Selecting the maximum edge probability value, and marking as P1And save the corresponding model, denoted as M1
The association value between any two nodes in the DBM is set to be 0, and the two nodes have the same association value
Figure BDA0001862124400000091
In one case, i.e. correspond to
Figure BDA0001862124400000092
Different models, each model corresponding to an edge probability value P (y1, …, yn | Mi). Selecting the maximum edge probability value, and marking as P2And saving the corresponding model, denoted as M2
And so on, … for three groups and four groups in the DBM model
Figure BDA0001862124400000093
The association value for any two nodes of the set is set to 0 ….
Finally, in DBM model
Figure BDA0001862124400000094
Setting the associated value of any two nodes in the group as 0, selecting the maximum edge probability value and marking as PjAnd saving the corresponding model, denoted as Mj
From the above edge probability values { P1,P2…PjAnd selecting the maximum value P of the edge probability, wherein the corresponding model M is the optimal model.
FIG. 4 shows a block diagram of an apparatus for optimizing the construction of a document Boltzmann machine for document query according to an embodiment of the present invention.
As shown in fig. 4, the apparatus comprises: a sample set module 401, a document boltzmann machine model module 402, and an optimization module 403. The sample set module 401 comprises a sampling processing unit 4011. The document boltzmann model module 402 includes a target learning function unit 4021. The optimization module 403 includes an edge probability value unit 4031 and a selection unit 4032. The edge probability value unit 4031 includes a single set of sub-units 40311 and multiple sets of sub-units 40312.
The sample set module 401 is configured to sample a selected document to obtain a plurality of groups of text segments, and collect the plurality of groups of text segments to obtain a sample set. The sampling processing unit 4011 is configured to perform sampling processing by using a sliding window to obtain a text segment that can be overlapped, where a size of the sliding window is a first preset value, and a step length of the sliding window is a second preset value.
The document boltzmann model module 402 is configured to perform model learning processing according to the sample set, and obtain a document boltzmann model corresponding to the selected document. The target learning function unit 4021 is configured to perform model learning processing by a maximum likelihood function to obtain a learning target function as shown below:
Figure BDA0001862124400000095
Figure BDA0001862124400000096
Figure BDA0001862124400000097
wherein L (W; X) represents a learning objective function, Z (W) represents a distribution function, X represents a sample set, W represents a parameter value of a document Boltzmann machine model, and W1、w2、w3Representing a subset of W, a first order parameter
Figure BDA0001862124400000101
Representing the state of node i, a second order parameter
Figure BDA0001862124400000102
Third order parameter representing degree of association between node i and node j
Figure BDA0001862124400000103
Indicating the degree of association between node i, node j, and node k.
The optimization module 403 is configured to perform optimization processing on the generated document boltzmann model according to a bayesian information criterion, so as to obtain an optimized document boltzmann model.
The edge probability value unit 4031 is used for obtaining a plurality of small models corresponding to the document boltzmann model and an edge probability value corresponding to a single small model according to the relevance of nodes in the document boltzmann model. The single-group subunit 40311 is used for obtaining a correlation value between any two nodes in the document boltzmann model when the correlation value is zero, and obtaining a corresponding relation between the document boltzmann model
Figure BDA0001862124400000104
And selecting the edge probability values corresponding to different small models and the single model, and selecting the maximum value P1 of the edge probability values. The multi-group subunit 40312 is used for two, three and four groups … in the boltzmann model when the document is written
Figure BDA0001862124400000105
When the correlation value between any two nodes of the group is zero, obtaining a plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model, and selecting the maximum values P2, P3 and P4 … of the edge probability values until the maximum values P2, P3 and P4 … of the edge probability values are zero
Figure BDA0001862124400000106
The selecting unit 4032 is configured to select a maximum value P of the edge probability values according to the edge probability values of the multiple small models, and record the small model corresponding to the maximum value P as the optimized document boltzmann model.
The document boltzmann mechanism optimization method for document query provided by the invention applies a boltzmann machine to the field of document query, can naturally capture the dependency relationship among terms, and generalizes the distribution hypothesis used by a traditional language model. More effective query likelihood can be obtained, and the retrieval accuracy is improved.
It is to be understood that the disclosed embodiments of the invention are not limited to the particular structures, process steps, or materials disclosed herein but are extended to equivalents thereof as would be understood by those ordinarily skilled in the relevant arts. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.
Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A document Boltzmann mechanism optimization method for document query is characterized by comprising the following steps:
sampling the selected document to obtain a plurality of groups of text fragments, and gathering the plurality of groups of text fragments to obtain a sample set;
performing model learning processing according to the sample set to obtain a document Boltzmann machine model corresponding to the selected document;
and performing optimization processing on the generated document Boltzmann machine model through a Bayesian information criterion to obtain the optimized document Boltzmann machine model.
2. The method of claim 1, wherein the step of sampling the selected document to obtain the plurality of groups of text segments further comprises the steps of:
sampling is carried out by utilizing a sliding window to obtain the overlapped text segments, wherein the size of the sliding window is a first preset value, and the step length of the sliding window is a second preset value.
3. The method of claim 1, wherein the step of performing model learning processing based on the sample set further comprises the steps of:
model learning processing is performed by a maximum likelihood function to obtain a learning objective function as shown below:
Figure FDA0001862124390000011
Z(W)=∑Xexp[-E(x;W)]
Figure FDA0001862124390000012
wherein L (W; X) represents the learning objective function, Z (W) represents a distribution function, X represents the sample set, W represents a parameter value of the document Boltzmann machine model, and W represents the distribution function1、w2、w3Representing a subset of W, a first order parameter
Figure FDA0001862124390000013
Representing the state of node i, a second order parameter
Figure FDA0001862124390000014
Third order parameter representing degree of association between node i and node j
Figure FDA0001862124390000015
Indicating the degree of association between node i, node j, and node k.
4. The method of claim 1, wherein the step of performing optimization processing through Bayesian information criterion to obtain the optimized document Boltzmann machine model further comprises the steps of:
obtaining a plurality of small models corresponding to the document Boltzmann machine model and an edge probability value corresponding to a single small model according to the relevance of nodes in the document Boltzmann machine model;
and selecting the maximum value P of the edge probability values according to the edge probability values of the small models, and recording the small model corresponding to the maximum value P as the optimized document Boltzmann machine model.
5. The method of claim 4, wherein when the document Boltzmann machine model includes b nodes, the step of obtaining a plurality of small models corresponding to the document Boltzmann machine model and an edge probability value corresponding to a single small model further comprises the steps of:
when the correlation value between any two nodes in the document Boltzmann machine model is zero, obtaining the corresponding correlation value of the document Boltzmann machine model
Figure FDA0001862124390000021
Selecting edge probability values corresponding to different small models and a single model, and selecting the maximum value P1 of the edge probability values;
when the two groups, the three groups and the four groups … in the document Boltzmann machine model reach to
Figure FDA0001862124390000022
When the correlation value between any two nodes of the group is zero, obtainingA plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model are selected, and the maximum values P2, P3 and P4 … of the edge probability values are selected until the maximum values are reached
Figure FDA0001862124390000023
6. A document boltzmann mechanism optimization apparatus for document query, the apparatus comprising:
the system comprises a sample set module, a document selection module and a document analysis module, wherein the sample set module is used for sampling a selected document to obtain a plurality of groups of text fragments, and collecting the plurality of groups of text fragments to obtain a sample set;
the document Boltzmann machine model module is used for performing model learning processing according to the sample set to obtain a document Boltzmann machine model corresponding to the selected document;
and the optimization module is used for carrying out optimization processing on the generated document Boltzmann machine model through a Bayesian information criterion to obtain the optimized document Boltzmann machine model.
7. The apparatus of claim 6, wherein the sample set module comprises:
the sampling processing unit is used for carrying out sampling processing by utilizing a sliding window to obtain a text segment which can be overlapped, wherein the size of the sliding window is a first preset value, and the step length of the sliding window is a second preset value.
8. The apparatus of claim 6, wherein the document Boltzmann machine model module comprises:
an object learning function unit for performing a model learning process by a maximum likelihood function to obtain a learning object function as shown below:
Figure FDA0001862124390000031
Z(W)=∑Xexp[-E(x;W)]
Figure FDA0001862124390000032
wherein L (W; X) represents the learning objective function, Z (W) represents a distribution function, X represents the sample set, W represents a parameter value of the document Boltzmann machine model, and W represents the distribution function1、w2、w3Representing a subset of W, a first order parameter
Figure FDA0001862124390000033
Representing the state of node i, a second order parameter
Figure FDA0001862124390000034
Third order parameter representing degree of association between node i and node j
Figure FDA0001862124390000035
Indicating the degree of association between node i, node j, and node k.
9. The apparatus of claim 6, wherein the optimization module comprises:
the edge probability value unit is used for obtaining a plurality of small models corresponding to the document Boltzmann machine model and an edge probability value corresponding to a single small model according to the relevance of nodes in the document Boltzmann machine model;
and the selecting unit is used for selecting the maximum value P of the edge probability values according to the edge probability values of the small models, and recording the small model corresponding to the maximum value P as the optimized document Boltzmann machine model.
10. The apparatus of claim 9, wherein when the document boltzmann model includes b nodes, the edge probability value unit includes:
a single group of sub-units for use when the document BohrWhen the correlation value between any two nodes in the Boltzmann machine model is zero, the corresponding correlation value of the Boltzmann machine model of the document is obtained
Figure FDA0001862124390000036
Selecting edge probability values corresponding to different small models and a single model, and selecting the maximum value P1 of the edge probability values;
a plurality of groups of subunits used for forming two groups, three groups and four groups … in the document Boltzmann machine model until
Figure FDA0001862124390000037
When the correlation value between any two nodes in the group is zero, obtaining a plurality of different small models corresponding to the document Boltzmann machine model and edge probability values corresponding to a single small model, and selecting the maximum values P2, P3 and P4 … of the edge probability values until the maximum values P2, P3 and P4 … of the edge probability values are zero
Figure FDA0001862124390000041
CN201811339382.5A 2018-11-12 2018-11-12 Document boltzmann machine construction optimization method and device for document query Pending CN111177327A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811339382.5A CN111177327A (en) 2018-11-12 2018-11-12 Document boltzmann machine construction optimization method and device for document query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811339382.5A CN111177327A (en) 2018-11-12 2018-11-12 Document boltzmann machine construction optimization method and device for document query

Publications (1)

Publication Number Publication Date
CN111177327A true CN111177327A (en) 2020-05-19

Family

ID=70655502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811339382.5A Pending CN111177327A (en) 2018-11-12 2018-11-12 Document boltzmann machine construction optimization method and device for document query

Country Status (1)

Country Link
CN (1) CN111177327A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023007270A1 (en) * 2021-07-26 2023-02-02 Carl Wimmer Foci analysis tool

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5479576A (en) * 1992-01-30 1995-12-26 Ricoh Company, Ltd. Neural network learning system inferring an input-output relationship from a set of given input and output samples
CN108763418A (en) * 2018-05-24 2018-11-06 辽宁石油化工大学 A kind of sorting technique and device of text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5479576A (en) * 1992-01-30 1995-12-26 Ricoh Company, Ltd. Neural network learning system inferring an input-output relationship from a set of given input and output samples
CN108763418A (en) * 2018-05-24 2018-11-06 辽宁石油化工大学 A kind of sorting technique and device of text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
安德斯•斯科隆多著: "《广义潜变量模型 多层次、纵贯性以及结构方程模型》", 31 January 2011, 重庆大学出版社 *
黄历铭: "将文档玻尔兹曼机应用于查询扩展", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023007270A1 (en) * 2021-07-26 2023-02-02 Carl Wimmer Foci analysis tool

Similar Documents

Publication Publication Date Title
US9589208B2 (en) Retrieval of similar images to a query image
WO2021139262A1 (en) Document mesh term aggregation method and apparatus, computer device, and readable storage medium
CA2781105A1 (en) Automatically mining person models of celebrities for visual search applications
Huang et al. Topic detection from large scale of microblog stream with high utility pattern clustering
WO2016180308A1 (en) Video retrieval methods and apparatuses
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
JP2009282980A (en) Method and apparatus for image learning, automatic notation, and retrieving
US11010411B2 (en) System and method automatically sorting ranked items and generating a visual representation of ranked results
JP2011198364A (en) Method of adding label to medium document and system using the same
CN104166684A (en) Cross-media retrieval method based on uniform sparse representation
CN112214335B (en) Web service discovery method based on knowledge graph and similarity network
CN114077705A (en) A method and system for profiling media accounts on social platforms
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN103778206A (en) Method for providing network service resources
CN112860685A (en) Automatic recommendation of analysis of data sets
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
CN111598712B (en) Training and searching method for data feature generator in social media cross-modal search
CN109948154A (en) A system and method for character acquisition and relationship recommendation based on mailbox name
WO2022116324A1 (en) Search model training method, apparatus, terminal device, and storage medium
CN103870489B (en) Chinese personal name based on search daily record is from extending recognition methods
CN105701227A (en) Cross-media similarity measure method and search method based on local association graph
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN111177327A (en) Document boltzmann machine construction optimization method and device for document query
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN118445406A (en) Integration system based on massive polymorphic circuit heritage information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200519

RJ01 Rejection of invention patent application after publication