[go: up one dir, main page]

Next Article in Journal
Multi-Type Node Detection in Network Communities
Previous Article in Journal
A Quantum Cellular Automata Type Architecture with Quantum Teleportation for Quantum Computing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Probabilistic Modeling with Matrix Product States

1
Flatiron Institute, New York, NY 10010, USA
2
Tunnel, New York, NY 10001, USA
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(12), 1236; https://doi.org/10.3390/e21121236
Submission received: 13 November 2019 / Revised: 10 December 2019 / Accepted: 12 December 2019 / Published: 17 December 2019
(This article belongs to the Section Quantum Information)
Figure 1
<p>A bird’s eye view of the training dynamics of exact single-site DMRG on the unit sphere. (<b>a</b>) The initial vector <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>0</mn> </msub> </semantics></math> and the vector <math display="inline"><semantics> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> </semantics></math> lie in the unit sphere of <math display="inline"><semantics> <mrow> <mi mathvariant="script">H</mi> </mrow> </semantics></math>. (<b>b</b>) The vector <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>0</mn> </msub> </semantics></math> is used to define the subspace <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mn>1</mn> </msub> </semantics></math>. The unit vectors in <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mn>1</mn> </msub> </semantics></math> define a lower dimensional sphere in <math display="inline"><semantics> <mrow> <mi mathvariant="script">H</mi> </mrow> </semantics></math> (in blue). The vector <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>1</mn> </msub> </semantics></math> is the vector in that sphere that is closest to <math display="inline"><semantics> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> </semantics></math>. (<b>c</b>) The vector <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>1</mn> </msub> </semantics></math> is used to define the subspace <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mn>2</mn> </msub> </semantics></math>. The unit sphere in <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mn>2</mn> </msub> </semantics></math> (in blue) contains <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>1</mn> </msub> </semantics></math> but does not contain <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>0</mn> </msub> </semantics></math>. The vector <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>2</mn> </msub> </semantics></math> is the unit vector in <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mn>2</mn> </msub> </semantics></math> closest to <math display="inline"><semantics> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> </semantics></math>. (<b>d</b>) The vector <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>2</mn> </msub> </semantics></math> is used to define the subspace <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mn>3</mn> </msub> </semantics></math>. The vector <math display="inline"><semantics> <msub> <mi>ψ</mi> <mn>3</mn> </msub> </semantics></math> is the unit vector in <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mn>3</mn> </msub> </semantics></math> closest to <math display="inline"><semantics> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> </semantics></math>. And so on.</p> ">
Figure 2
<p>A representative bias-variance tradeoff curve showing negative log-likelihood (base 2) as a function of bond dimension for exact single-site DMRG on the <math display="inline"><semantics> <msub> <mi>P</mi> <mn>20</mn> </msub> </semantics></math> dataset. For bond dimension 3, the generalization gap is approximately <math display="inline"><semantics> <mrow> <mi>ϵ</mi> <mo>=</mo> <mn>0.0237</mn> </mrow> </semantics></math>. For reference, the uniform distribution on bitstrings has NLL of 20. Memorizing the training data would yield a NLL of approximately <math display="inline"><semantics> <mrow> <mn>13.356</mn> </mrow> </semantics></math>.</p> ">
Figure 3
<p>A representative bias-variance tradeoff curve showing negative log-likelihood (base 2) as a function of bond dimension for exact single-site DMRG on the div7 dataset. For bond dimension 8, the generalization gap is approximately <math display="inline"><semantics> <mrow> <mi>ϵ</mi> <mo>=</mo> <mn>0.032</mn> </mrow> </semantics></math>. For reference, the uniform distribution on bitstrings has NLL of 20, the target distribution has a NLL of <math display="inline"><semantics> <mrow> <mn>17.192</mn> </mrow> </semantics></math>, and memorizing the training data would yield a NLL of approximately <math display="inline"><semantics> <mrow> <mn>13.87</mn> </mrow> </semantics></math>.</p> ">
Figure A1
<p>The shaded region represents the model class <math display="inline"><semantics> <mrow> <mi mathvariant="script">M</mi> </mrow> </semantics></math>. The red points all lie in <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math>. The vector <math display="inline"><semantics> <msub> <mover accent="true"> <mi>ψ</mi> <mo>˜</mo> </mover> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math> is defined to be the unit vector in <math display="inline"><semantics> <msub> <mi mathvariant="script">H</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math> closest to the target <math display="inline"><semantics> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> </semantics></math>. Note that <math display="inline"><semantics> <msub> <mover accent="true"> <mi>ψ</mi> <mo>˜</mo> </mover> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math> does not lie in <math display="inline"><semantics> <mrow> <mi mathvariant="script">M</mi> </mrow> </semantics></math>. The vector <math display="inline"><semantics> <msubsup> <mi>ψ</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>SVD</mi> </msubsup> </semantics></math> is defined to be the vector in <math display="inline"><semantics> <mrow> <mi mathvariant="script">M</mi> <mo>∩</mo> <msub> <mi mathvariant="script">H</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> </semantics></math> closest to to <math display="inline"><semantics> <msub> <mover accent="true"> <mi>ψ</mi> <mo>˜</mo> </mover> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math>. In this picture, <math display="inline"><semantics> <mrow> <mrow> <mo>∥</mo> </mrow> <msubsup> <mi>ψ</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>SVD</mi> </msubsup> <mo>−</mo> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> <mrow> <mo>∥</mo> <mo>&gt;</mo> <mo>∥</mo> </mrow> <msub> <mi>ψ</mi> <mi>t</mi> </msub> <mo>−</mo> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> <mrow> <mo>∥</mo> <mo>.</mo> </mrow> </mrow> </semantics></math> There may be a point, such as the one labelled <math display="inline"><semantics> <msubsup> <mi>ψ</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>better</mi> </msubsup> </semantics></math>, which lies in <math display="inline"><semantics> <mrow> <mi mathvariant="script">M</mi> <mo>∩</mo> <msub> <mi mathvariant="script">H</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> </semantics></math> and is closer to <math display="inline"><semantics> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> </semantics></math> than <math display="inline"><semantics> <msubsup> <mi>ψ</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>SVD</mi> </msubsup> </semantics></math>, notwithstanding the fact that is is further from <math display="inline"><semantics> <msub> <mover accent="true"> <mi>ψ</mi> <mo>˜</mo> </mover> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math>. This figure, to scale, depicts a scenario in which <math display="inline"><semantics> <mrow> <mrow> <mo>∥</mo> </mrow> <msub> <mi>ψ</mi> <mi>t</mi> </msub> <mo>−</mo> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> <mrow> <mo>∥</mo> </mrow> </mrow> </semantics></math> = 0.09, <math display="inline"><semantics> <mrow> <mrow> <mo>∥</mo> </mrow> <msubsup> <mi>ψ</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>SVD</mi> </msubsup> <mo>−</mo> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> <mrow> <mo>∥</mo> </mrow> </mrow> </semantics></math> = 0.10, <math display="inline"><semantics> <mrow> <mrow> <mo>∥</mo> </mrow> <msubsup> <mi>ψ</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>better</mi> </msubsup> <mo>−</mo> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> <mrow> <mo>∥</mo> </mrow> </mrow> </semantics></math> = 0.07, <math display="inline"><semantics> <mrow> <mrow> <mo>∥</mo> </mrow> <msub> <mover accent="true"> <mi>ψ</mi> <mo>˜</mo> </mover> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>−</mo> <msub> <mi>ψ</mi> <mover accent="true"> <mi>π</mi> <mo>^</mo> </mover> </msub> <mrow> <mo>∥</mo> </mrow> </mrow> </semantics></math> = 0.06, <math display="inline"><semantics> <mrow> <mrow> <mo>∥</mo> </mrow> <msubsup> <mi>ψ</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>SVD</mi> </msubsup> <mo>−</mo> <msub> <mover accent="true"> <mi>ψ</mi> <mo>˜</mo> </mover> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>∥</mo> </mrow> </mrow> </semantics></math> = 0.07, and <math display="inline"><semantics> <mrow> <mrow> <mo>∥</mo> </mrow> <msubsup> <mi>ψ</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>better</mi> </msubsup> <mo>−</mo> <msub> <mover accent="true"> <mi>ψ</mi> <mo>˜</mo> </mover> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>∥</mo> </mrow> </mrow> </semantics></math> = 0.08.</p> ">
Versions Notes

Abstract

:
Inspired by the possibility that generative models based on quantum circuits can provide a useful inductive bias for sequence modeling tasks, we propose an efficient training algorithm for a subset of classically simulable quantum circuit models. The gradient-free algorithm, presented as a sequence of exactly solvable effective models, is a modification of the density matrix renormalization group procedure adapted for learning a probability distribution. The conclusion that circuit-based models offer a useful inductive bias for classical datasets is supported by experimental results on the parity learning problem.

1. Introduction

The possibility of exponential speedups for certain linear algebra operations has inspired a wave of research into quantum algorithms for machine learning purposes [1]. Many of these exponential speedups hinge on assumptions of fault tolerant quantum devices and efficient data preparation, which are unlikely to be realized in the near future. Focus has thus shifted to hybrid quantum-classical algorithms which involve optimizing the parameters of a variational quantum circuit to prepare a desired quantum state and have the potential to be implemented on near-term intermediate scale quantum devices [2].
Hybrid quantum-classical algorithms have been found to solve difficult eigenvalue problems [3] and to perform hard combinatorial optimization [4]. A number of recent works consider unsupervised learning within the hybrid quantum-classical framework [5,6,7,8,9].
In the context of machine learning, as emphasized in [2], it is less clear if variational hybrid quantum-classical algorithms offer advantages over existing purely classical algorithms. Density estimation, which attempts to learn a probability distribution from training data, has been suggested as an area to look for advantages [7] because a quantum advantage has been identified in the ability of quantum circuits to sample from certain probability distributions that are hard to sample classically [10]. In high-dimensional density estimation relevant to machine learning, expressive power is only part of the story and indeed algorithms in high-dimensional regime rely crucially on their inductive bias. Do the highly expressive probability distributions implied by quantum circuits offer a useful inductive bias for modeling high-dimensional classical data? We address this question in this paper.
We work within the confines of a classically tractable subset of quantum states modeled by tensor networks, which may be thought of as those states that can be prepared by shallow quantum circuits. Even more narrowly, we restrict to matrix product states akin to one-dimensional shallow circuits. Mathematically, tensor networks are a graphical calculus for describing interrelated matrix factorizations for which there exist polylogarithmic algorithms for a restricted set of linear algebra computations. We propose an unsupervised training algorithm for a generative model inspired by the density matrix renormalization group (DMRG) procedure. The training dynamics take place on the unit sphere of a Hilbert space, where in contrast to many variational methods, a state is modified in a sequence of deterministic steps that do not involve gradients. The efficient access to certain vector operations afforded by the tensor network ansatz allows us to implement our algorithm in a purely classical fashion.
We experimentally probe the inductive bias of the model by training on the dataset P 20 consisting of bitstrings of length 20 having an even number of 1 bits. The algorithm rapidly learns the uniform distribution on P 20 to high precision, indicating that the tensor network quantum circuit model provides a useful inductive bias for this classical dataset and the resulting trained model is small, only 336 parameters. The P 20 dataset can be frustrating to learn for other models, such as restricted Boltzman machines (RBMs) trained with gradient-based methods. The difficulty of training RBMs to learn parity with contrastive divergence and related training algorithms is noted in [11]. The difficulty for other gradient based deep-learning methods on parity problems has been studied in [12]. To put the work in this paper in context, we note that generative modeling using tensor networks has been considered for several datasets for which classical neural models trained with gradient based methods are successful [13,14]. We also note that shallow quantum circuits have already been successful for a related supervised parity classification problem [15].
In an effort to improve accessibility, we avoid the language of quantum-many body physics and quantum information and explain the algorithm and results in terms of elementary linear algebra and statistics. While this means some motivational material is omitted, we believe it sharpens the exposition. One exception is the visual language of tensor networks where the benefits of simplifying tensor contractions outweigh the costs of using elementary, but cumbersome, notation. We refer readers unfamiliar with tensor network notation to [16,17,18,19] or to the many other surveys.
The organization of the paper is as follows. In Section 2 we state the optimization problem at the population level and propose a finite-sample estimator. In Section 3 and Section 4 we describe an abstract discrete-time dynamical system evolving on the unit sphere of Hilbert space which optimizes our empirical objective by exactly solving an effective problem in a sequence of isometrically embedded Hilbert subspaces. In Section 5 we provide a concrete realization of this dynamical system for a class of tensor networks called matrix product states. Section 6 outlines experiments demonstrating that the proposed iterative solver successfully learns the parity language using limited data.

2. The Problem Formulation

Recall that a unit vector ψ in a finite-dimensional Hilbert space H defines a probability distribution P ψ on any orthonormal basis by setting the probability of each basis vector e to be
P ψ ( e ) : = | ψ , e | 2 .
We refer to the probability distribution P ψ in Equation (1) as the Born distribution induced by ψ .
Let π be a probability distribution on a finite set X and fix a field of scalars, either R or C . Let H be the free vector space on the set X . Use | x to denote the vector in H corresponding to the element x X . The space H has a natural inner product defined by declaring the vectors { | x : x X } to be an orthonormal basis.
Define a unit vector ψ π H by
ψ π : = x X π ( x ) | x .
Notice that ψ π realizes π as a Born distribution:
π ( x ) = P ψ π | x for all x X .
The formula for ψ π as written in Equation (2) involves perfect knowledge of π and unrestricted access to the Hilbert space H . This paper is concerned with situations when knowledge about π is limited to a finite number of training examples, and ψ is restricted to some tractable subset M of the unit sphere.
At the population level, the problem to be solved is to find the closest approximation ψ * to ψ π within M ,
ψ * : = arg min ψ M ψ ψ π .
We assume access to a sequence ( X i ) i = 1 n of samples drawn independently from π , giving rise to the associated empirical distribution
π ^ ( x ) : = 1 n i = 1 n δ X i ( x ) .
It is natural to define the following estimator whose Born distribution coincides with the empirical distribution
ψ π ^ = x X π ^ ( x ) | x .
We are thus led to consider the following optimization problem.
Problem 1.
Given a sequence { X i } i = 1 n of i.i.d. samples drawn from π and a subset M { ψ H : ψ = 1 } of the unit sphere in H, find
ψ ^ : = arg min ψ M ψ ψ π ^ .
Our proposal differs from existing literature on Born Machines which have employed log-likelihood objective functions minimized by gradient descent (see [20] for a review). As we will see, the choice of loss function as the l 2 norm allows analytical updates with guaranteed improvement. This should be contrasted with the log-likelihood objective for which no such guarantee exists and gradient descent may diverge if the learning rate is not chosen appropriately.
Although the problem formulation contains no explicit regularization term, regularization is achieved implicitly by controlling the complexity of the model class M . In the experiments section, the model hypothesis class is defined by a small integer hyperparameter called bond-dimension. We solve the problem for several choices of bond-dimension using a held-out test set to measure overfitting and generalization. In the case where X consists of strings, the associated Hilbert space H has a dimension that is exponential in the string length. The model hypothesis class M should be chosen so that the induced Born distribution P ψ ^ offers a useful inductive bias for modeling high-dimensional probability distributions over the space of sequences. We note, as an aside, that the plug-in estimator ψ ψ π ^ is a biased estimator of the population objective ψ ψ π .

3. Outline of Our Approach to Solving the Problem

We present an algorithm that, given a fixed realization of data ( x 1 , , x n ) X n and an initial state ψ 0 M , produces a deterministic sequence { ψ t } t 0 of unit vectors in M . The algorithm is a variation of the density matrix renormalization group (DMRG) procedure which we call exact single-site DMRG in which each step produces a vector closer to ψ π ^ . The sequence is defined inductively as follows: given ψ t , the inductive step defines a subspace H t + 1 of H , which also contains ψ t . Then ψ t + 1 is defined to be the vector in H t + 1 closest to ψ π ^ . Inspired by ideas from the Renormalization Group we provide an analytic formula for ψ t + 1 . The fact that the distance to the target vector ψ π ^ decreases after each iteration follows as a simple consequence of the following facts
ψ t H t + 1 and ψ t + 1 = arg min { ψ H t + 1 : ψ = 1 } ψ π ^ ψ .

4. Effective Versions of the Problem

Each proposal subspace H t mentioned in the previous section will be defined as the image of an “effective” space. We begin with a general description of an effective space.
Let α : H eff H be an isometric embedding of a Hilbert space H eff into H . We refer to H eff as the effective Hilbert space. The isometry α and its adjoint map α * are summarized by the following diagram,
Entropy 21 01236 i001
The composition α * α = id H eff is the identity on H eff . The composition in the other order α α * is an orthogonal projection onto α ( H eff ) which is a subspace of H isometrically isomorphic to H eff . Call this orthogonal projection P
P : = α α * .
The effective version of the problem formulated in Section 2 is to find the unit vector ψ α ( H eff ) in the image of the effective Hilbert space that is closest to ψ π ^ . This effective problem is solved exactly in two simple steps. The first step is orthogonal projection: P ( ψ π ^ ) is the vector in α ( H eff ) closest to ψ π ^ . The second step is to normalize P ( ψ π ^ ) , which may not be a unit vector, to obtain the unit vector in α ( H eff ) closest to ψ π ^ .
Therefore, the analytic solution to the effective problem is P ( ψ π ^ ) / P ( ψ π ^ ) where
P ( ψ π ^ ) = α α * ψ π ^
= α α * x X π ^ ( x ) | x
= α x X π ^ ( x ) α * ( | x ) .
In the exact single-site DMRG algorithm, the space α ( H eff ) is contained within our model hypothesis class M . We also offer a multi-site DMRG algorithm in the Appendix A. In this multi-site algorithm, the analytic solution to the effective problem in α ( H eff ) does not lie in M so the solution to the effective problem needs to undergo an additional “model repair” step.
Before going on to the details of the algorithm, it might be helpful to look more closely at the solution to the effective problem. For each training example x i , call the vector α * ( | x i ) H eff an effective data point. Then, the argument of α in (10) becomes the weighted sum of effective data
x X π ^ ( x ) α * ( | x ) .
The effective data are not necessarily mutually orthogonal and so the vector in (11) will not be a unit vector. One may normalize to obtain a unit vector in H eff and then apply α to obtain the analytic solution to the effective problem. Normalizing in H eff and then applying α is the same as applying α and then normalizing in H since α is an isometry.

5. The Exact Single-Site DMRG Algorithm

Now specialize to the case that π is a probability distribution on a set X of sequences. Suppose that X = A N consists of sequences of length N in fixed alphabet A = { e 1 , , e d } . The Hilbert space H , defined as the free Hilbert space on X , has a natural tensor product structure V N where V is the free Hilbert space on the alphabet A. We refer to V as the site space. So in this situation, the vectors { | e 1 , , | e d } are an orthonormal basis for the d-dimensional site space V and the vectors
| e i 1 e i 2 e i N : = | e i 1 | e i 2 | e i N
are an orthonormal basis for the d N dimensional space H = V N . We choose as model hypothesis class the subset M H consisting of normalized elements in H that have a low rank matrix product state (MPS) factorization. Vectors in this model hypothesis class have efficient representations, even in cases where the Hilbert space H is of exponentially high dimension. For simplicity of presentation, we consider matrix product states with a single fixed bond space W, although everything that follows could be adapted to work with tensor networks without loops having arbitrary bond spaces.
The exact single-site DMRG algorithm begins with an initial vector ψ 0 M and produces ψ 1 , ψ 2 , inductively by solving an effective problem in the subspace
H t + 1 : = α t + 1 ( H eff , t + 1 )
which we now describe. Let us drop the subscript t + 1 from the isometry α t + 1 and the effective Hilbert space H eff , t + 1 in the relevant effective problem—just be aware that the embedding
α : H eff H
will change from step to step. The map α is defined using an MPS factorization of ψ t in mixed canonical form relative to a fixed site which varies at each step according to a predetermined schedule. For the purposes of illustration, the third site is the fixed site in the pictures below.
Entropy 21 01236 i002
The effective space is H eff = W V W and the isometric embedding α : W V W V N is defined for any ϕ W V W by replacing the tensor at the fixed site of ψ t with ϕ :
Entropy 21 01236 i003
To see that α is an isometry, use the gauge condition that the MPS factorization of ψ t is in mixed canonical form relative to the fixed site, as illustrated below:
Entropy 21 01236 i004
The adjoint map α * : V N W V W has a clean pictorial depiction as well.
Entropy 21 01236 i005
To see that α * as pictured above is, in fact, the adjoint of α , note that for any η H and any ϕ H eff , both η , α ( ϕ ) and α * ( η ) , ϕ result in the same tensor contraction:
Entropy 21 01236 i006
In the picture above, begin with the blue tensors. Contracting with the yellow tensor gives α ( ϕ ) and then contracting with the red tensor gives η , α ( ϕ ) . On the other hand, first contracting with the red tensor yields α * ( η ) resulting in α * ( η ) , ϕ after contracting with the yellow tensor.
Now, Equation (10) describes an analytic solution for the vector in H t + 1 : = α ( W V W ) closest to ψ π ^ . Namely, α ( ϕ / ϕ ) where
ϕ = x X π ^ ( x ) α * ( | x ) .
For each sample | x i = | e i 1 e i 2 e i N , the effective data point α * ( | x i ) V W V is given by the contraction
Entropy 21 01236 i007
Once the effective form α * ( | x ) of each distinct training example | x has been computed, weighted by π ^ ( x ) , summed, and normalized, one obtains an expression for the unit vector ϕ / ϕ W V W , depicted as follows,
Entropy 21 01236 i008
Finally, apply the map α to get ψ t + 1 :
Entropy 21 01236 i009
To complete the description of the exact single-site DMRG algorithm, we need to choose a schedule in which to update the tensors. We use the following schedule, organized into back-and-forth sweeps, for the fixed site at each step
1 , 2 , 3 , , N 1 , N , N 1 , , 3 , 2 , Sweep 1 1 , 2 , , N 1 , N , N 1 , , 2 , Sweep 2 1 , 2 ,
A schedule that proceeds by moving the fixed site one position at a time allows us to take advantage of two efficiencies resulting in an algorithm that is linear in both the number of training examples n and the number of sites N. One efficiency is that most of the calculations of the effective data in Equation (21) used to compute ψ t + 1 can be reused when computing ψ t + 2 . The second efficiency is that when inserting the updated tensor in Equation (22), it can be done so that the resulting MPS factorization of ψ t + 1 as pictured in Equation (23) will be in mixed canonical form relative to a site adjacent to the updated tensor, which avoids a costly gauge fixing step.

6. Experiments

This section considers the problem of unsupervised learning of probability distributions on bitstrings of fixed length (Code available online: https://github.com/TunnelTechnologies/dmrg-exact). The first problem we consider is the parity language P N , which consists of bitstrings of length N containing an even number of 1 bits. The goal of this task is to learn the probability distribution p which assigns uniform mass to each bitstring in P N and zero elsewhere. More explicitly,
p ( x ) = 1 | P N | I P N ( x ) = 1 | P N | , x P N 0 , x P N
where I P N : { 0 , 1 } N { 0 , 1 } denotes the indicator function of the subset P N { 0 , 1 } N . The above unsupervised learning problem is harder than the parity classification problem considered in [12] because the training signal does not exploit data labels. Of the total | P N | = 2 N 1 such bitstrings, we reserved random disjoint subsets of size 2 % for training, cross-validation and testing purposes. A NLL of N 1 corresponds to the entropy of the uniform distribution on P N . If the model memorizes the training set, it will assign to it a negative-log-likelihood (NLL) of N 1 + log 2 ( 0.02 ) corresponding to the entropy of the uniform distribution on the training data. A NLL of N corresponds to the entropy of the uniform distribution on all bitstrings of length N. The measure of generalization performance is the gap ϵ between the NLL of the training and testing data. We performed exact single-site DMRG over the real number field using the P 20 dataset for different choices of bond dimension, which refers to the dimensionality of the bond space W in the effective Hilbert space H eff = W V W . Training was terminated according to an early stopping criterion as determined by distance between the MPS state and the state of the cross-validation sample. Since the bond dimension controls the complexity of the model class, and since matrix product states are universal approximators of functions on { 0 , 1 } N , we expect overfitting to occur for sufficiently large bond dimension. Indeed, the NLL as a function of bond dimension reported in Figure 2 displays the expected bias-variance tradeoff, with optimal model complexity occurring at bond dimension 3 with corresponding generalization gap ϵ = 0.0237 .
The second problem we consider is unsupervised learning of the divisible-by-7 language which consists of the binary representation of integers which are divisible by 7. The dataset was constructed using first 149797 such integers which lie in the range [ 1 , 2 20 ] . We trained a length-20 MPS to learn the uniform distribution on the divisible-by-7 language as we did for P20, except utilizing subsets of size 10% for training, testing and cross-validation. Figure 3 illustrates that the model trained on exact single site DMRG with a bond dimension of 8 learns the DIV7 dataset with nearly perfect accuracy, producing a model with a generalization gap of ϵ = 0.032 .

7. Discussion

A number of recent works have explored the parity dataset using restricted Boltzmann machines (RBMs) and found it to be difficult to learn, even in experiments that train using the entire dataset [11,21]. Recall that an RBM is a universal approximator of distributions on { 0 , 1 } N , given sufficiently many hidden units. Ref. [21] proved that any probability distribution on { 0 , 1 } N can be approximated within ϵ in KL-divergence by an RBM with m 2 ( N 1 ) ( 1 ϵ ) + 0.1 hidden units. For P 20 this bound works out to be about 4 × 10 5 hidden nodes. It would be interesting to know whether it could be learned with significantly fewer.
It is not difficult to train a feedforward neural network to classify bitstrings by parity using labelled data, but we do not know if there are unsupervised generative neural models that do well learning P N . Additionally, quantum circuits can be trained to classify labelled data [15]. It is reasonable to expect that recurrent models whose training involve conditional probabilities π ( x 1 , , x k | x k + 1 , , x N ) might be frustrated by P N since the conditional distributions contain no information: any bitstring of length less than N has the same number of completions in P N as not in P N .
The reader may be interested in [22,23] where quantum models are used to learn classical data. Those works considered quantum Boltzman machines which were shown to learn the distribution more effectively than their classical counterparts using the same dataset. The complexity of classically simulating a QBM scales exponentially with the number of sites in contrast to the tensor network algorithms presented here, which scale linearly in the number of sites (for fixed bond dimension).
The main goal of this paper is to demonstrate the existence of classical datasets for which tensor network models trained via DMRG learn more effectively than generative neural models. It will be interesting to understand better how and why [24].

8. Conclusions and Outlook

The essence of DMRG in the Quantum Physics literature is to solve an eigenvalue problem in a high-dimensional Hilbert space H by iteratively solving an effetive eigenvalue problem in an isometrically embedded Hilbert subspace H eff H . In this paper we have shown how similar reasoning allows to solve a high-dimensional distribution estimation problem by iteratively solving a related linear algebra problem in effective Hilbert space. The proposed algorithm offers a number of advantages over existing gradient-based techniques including a guaranteed improvement theorem, and empirically performs well on tasks for which gradient-based methods are known to fail.

Author Contributions

Formal analysis, J.S. and J.T.; Software, J.S. and J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Tunnel.

Acknowledgments

The authors thank Tai-Danae Bradley, Giuseppe Carleo, Joseph Hirsh, Maxim Kontsevich, Jackie Shadlen, Miles Stoudenmire, and Yiannis Vlassopoulos for many helpful conversations.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Multi-Site DMRG

For completeness we now describe a related multi-site DMRG algorithm. The model clas M now consists of normalized vectors with matrix product factorizations, with possibly different bond spaces having dimension less than a fixed upper bound. The algorithm begins with an initial vector ψ 0 M and produces ψ 1 , ψ 2 , inductively. The inductive step is similar in that we solve an effective problem in the image of an effective Hilbert space
H t + 1 : = α t + 1 ( H eff , t + 1 )
to find the unit vector in H t + 1 that is closest to the target state ψ π ^ , which we now denote with a tilde:
ψ ˜ t + 1 : = arg min { ψ H t + 1 : ψ = 1 } ψ π ^ ψ .
In multi-site DMRG, as opposed to single-site DMRG, the image of the effective space H t + 1 is not contained in the MPS model hypothesis class M . So, the solution ψ ˜ t + 1 to the effective problem must undergo a “model repair” step
ψ ˜ t + 1 ψ t + 1
to produce a vector ψ t + 1 M . In summary:
  • Use ψ t to define an isometric embedding α t + 1 : H eff H with ψ t H t + 1 : = α t + 1 ( H eff ) .
  • Let ψ ˜ t + 1 be the unit vector in H t + 1 closest to ψ π ^ .
  • Perform a model repair of ψ ˜ t + 1 to obtain a vector ψ t + 1 M . There are multiple ways to do the model repair.
In order to define the effective problem in the inductive step of multi-site DMRG, one uses an MPS factorization of ψ t in mixed canonical gauge relative to an interval of r-sites. In the picture below, the interval consists of the two sites 3 and 4.
Entropy 21 01236 i010
The effective Hilbert space H eff = W L V r W R where W L and W R are the bond spaces to the left and right of the fixed interval of sites, and r is the length of the chosen interval. The map α : W L V r W R V n is given by replacing the interval of sites and contracting
Entropy 21 01236 i011
The map α and its adjoint α * are described by, and have properties proved by, pictures completely analogous to those detailed for single-site DMRG in Section 5. The effective problem is also solved the same way. What is not the same is that the vector in H t + 1 = α ( W L V r W R ) which solves the effective problem is outside of the model class M and so one performs a model repair step ψ ˜ t + 1 ψ t + 1 , pictured graphically in H eff by:
Entropy 21 01236 i012
One way to perform the model repair is to choose
ψ t + 1 : = arg min ψ M H t + 1 ψ ψ ˜ t + 1
but the flexibility of the model repair step allows for other possibilities. One can use the model repair to implement a dynamic tradeoff between proximity to ψ ˜ t + 1 and other constraints of interest, such as bond dimension. Many of these implementations have good algorithms arising from singular value decompositions manageable in the effective Hilbert space. Let use denote such a model repair choice as ψ t + 1 SVD . Be aware that if ψ t + 1 SVD is the vector in M H t + 1 nearest to ψ ˜ t + 1 as in Equation (A7), there is no guarantee that ψ t + 1 SVD will be nearer to ψ π ^ than the previous iterate. In fact, we have experimentally observed the sequence obtained by this kind of model repair to move away from ψ π ^ . See Figure A1 for an illustration of this possibility.
Figure A1. The shaded region represents the model class M . The red points all lie in H t + 1 . The vector ψ ˜ t + 1 is defined to be the unit vector in H t + 1 closest to the target ψ π ^ . Note that ψ ˜ t + 1 does not lie in M . The vector ψ t + 1 SVD is defined to be the vector in M H t + 1 closest to to ψ ˜ t + 1 . In this picture, ψ t + 1 SVD ψ π ^ > ψ t ψ π ^ . There may be a point, such as the one labelled ψ t + 1 better , which lies in M H t + 1 and is closer to ψ π ^ than ψ t + 1 SVD , notwithstanding the fact that is is further from ψ ˜ t + 1 . This figure, to scale, depicts a scenario in which ψ t ψ π ^ = 0.09, ψ t + 1 SVD ψ π ^ = 0.10, ψ t + 1 better ψ π ^ = 0.07, ψ ˜ t + 1 ψ π ^ = 0.06, ψ t + 1 SVD ψ ˜ t + 1 = 0.07, and ψ t + 1 better ψ ˜ t + 1 = 0.08.
Figure A1. The shaded region represents the model class M . The red points all lie in H t + 1 . The vector ψ ˜ t + 1 is defined to be the unit vector in H t + 1 closest to the target ψ π ^ . Note that ψ ˜ t + 1 does not lie in M . The vector ψ t + 1 SVD is defined to be the vector in M H t + 1 closest to to ψ ˜ t + 1 . In this picture, ψ t + 1 SVD ψ π ^ > ψ t ψ π ^ . There may be a point, such as the one labelled ψ t + 1 better , which lies in M H t + 1 and is closer to ψ π ^ than ψ t + 1 SVD , notwithstanding the fact that is is further from ψ ˜ t + 1 . This figure, to scale, depicts a scenario in which ψ t ψ π ^ = 0.09, ψ t + 1 SVD ψ π ^ = 0.10, ψ t + 1 better ψ π ^ = 0.07, ψ ˜ t + 1 ψ π ^ = 0.06, ψ t + 1 SVD ψ ˜ t + 1 = 0.07, and ψ t + 1 better ψ ˜ t + 1 = 0.08.
Entropy 21 01236 g0a1
One might hope to improve the model repair step, say by pre-conditioning the singular value decomposition in a way that is knowledgeable about the target ψ π ^ . For the experiments reported in this paper, single-site DMRG consistently outperformed multi-site DMRG for several choices of model repair step, and we include multi-site DMRG only for pedagogical reasons. The adaptability of the bond dimension afforded by the multi-site DMRG algorithm could provide benefits that outweigh the challenges of good model repair in some situations.

References

  1. Biamonte, J.; Wittek, P.; Pancotti, N.; Rebentrost, P.; Wiebe, N.; Lloyd, S. Quantum machine learning. Nature 2017, 549, 195. [Google Scholar] [CrossRef]
  2. Preskill, J. Quantum Computing in the NISQ era and beyond. Quantum 2018, 2, 79. [Google Scholar] [CrossRef]
  3. Peruzzo, A.; McClean, J.; Shadbolt, P.; Yung, M.H.; Zhou, X.Q.; Love, P.J.; Aspuru-Guzik, A.; O’brien, J.L. A variational eigenvalue solver on a photonic quantum processor. Nat. Commun. 2014, 5, 4213. [Google Scholar] [CrossRef] [Green Version]
  4. Farhi, E.; Goldstone, J.; Gutmann, S. A quantum approximate optimization algorithm. arXiv 2014, arXiv:1411.4028. [Google Scholar]
  5. Huggins, W.; Patil, P.; Mitchell, B.; Whaley, K.B.; Stoudenmire, E.M. Towards quantum machine learning with tensor networks. Quantum Sci. Technol. 2019, 4, 024001. [Google Scholar] [CrossRef] [Green Version]
  6. Liu, J.G.; Wang, L. Differentiable learning of quantum circuit Born machines. Phys. Rev. A 2018, 98, 062324. [Google Scholar] [CrossRef] [Green Version]
  7. Benedetti, M.; Garcia-Pintos, D.; Perdomo, O.; Leyton-Ortega, V.; Nam, Y.; Perdomo-Ortiz, A. A generative modeling approach for benchmarking and training shallow quantum circuits. npj Quantum Inf. 2019, 5, 45. [Google Scholar] [CrossRef]
  8. Du, Y.; Hsieh, M.H.; Liu, T.; Tao, D. The expressive power of parameterized quantum circuits. arXiv 2018, arXiv:1810.11922. [Google Scholar]
  9. Killoran, N.; Bromley, T.R.; Arrazola, J.M.; Schuld, M.; Quesada, N.; Lloyd, S. Continuous-variable quantum neural networks. Phys. Rev. Res. 2019, 1, 033063. [Google Scholar] [CrossRef] [Green Version]
  10. Shepherd, D.; Bremner, M.J. Temporally unstructured quantum computation. Proc. R. Soc. A Math. Phys. Eng. Sci. 2015, 465. [Google Scholar] [CrossRef]
  11. Romero, E.; Mazzanti Castrillejo, F.; Delgado, J.; Buchaca, D. Weighted Contrastive Divergence. Neural Netw. 2018. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Shalev-Shwartz, S.; Shamir, O.; Shammah, S. Failures of Gradient-Based Deep Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; pp. 3067–3075. [Google Scholar]
  13. Han, Z.Y.; Wang, J.; Fan, H.; Wang, L.; Zhang, P. Unsupervised generative modeling using matrix product states. Phys. Rev. X 2018, 8, 031012. [Google Scholar] [CrossRef] [Green Version]
  14. Cheng, S.; Wang, L.; Xiang, T.; Zhang, P. Tree tensor networks for generative modeling. Phys. Rev. B 2019, 99, 155131. [Google Scholar] [CrossRef] [Green Version]
  15. Farhi, E.; Neven, H. Classification with quantum neural networks on near term processors. arXiv 2018, arXiv:1802.06002. [Google Scholar]
  16. Stoudenmire, E.M. The Tensor Network. 2019. Available online: https://tensornetwork.org (accessed on 13 February 2019).
  17. Schollwöck, U. The density-matrix renormalization group in the age of matrix product states. Ann. Phys. 2011, 326, 96–192. [Google Scholar] [CrossRef] [Green Version]
  18. Bridgeman, J.C.; Chubb, C.T. Hand-waving and interpretive dance: An introductory course on tensor networks. J. Phys. A Math. Theor. 2017, 50, 223001. [Google Scholar] [CrossRef] [Green Version]
  19. Orús, R. A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Ann. Phys. 2014, 349, 117–158. [Google Scholar] [CrossRef] [Green Version]
  20. Glasser, I.; Sweke, R.; Pancotti, N.; Eisert, J.; Cirac, J.I. Expressive power of tensor-network factorizations for probabilistic modeling, with applications from hidden Markov models to quantum machine learning. arXiv 2019, arXiv:1907.03741. [Google Scholar]
  21. Montúfar, G.F.; Rauh, J.; Ay, N. Expressive Power and Approximation Errors of Restricted Boltzmann Machines. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11), Granada, Spain, 12–15 December 2011; Curran Associates Inc.: Red Hook, NY, USA, 2011; pp. 415–423. [Google Scholar]
  22. Amin, M.H.; Andriyash, E.; Rolfe, J.; Kulchytskyy, B.; Melko, R. Quantum boltzmann machine. Phys. Rev. X 2018, 8, 021050. [Google Scholar] [CrossRef] [Green Version]
  23. Kappen, H.J. Learning quantum models from quantum or classical data. arXiv 2018, arXiv:1803.11278. [Google Scholar]
  24. Bradley, T.D.; Stoudenmire, E.M.; Terilla, J. Modeling Sequences with Quantum States: A Look Under the Hood. arXiv 2019, arXiv:1910.07425. [Google Scholar]
Figure 1. A bird’s eye view of the training dynamics of exact single-site DMRG on the unit sphere. (a) The initial vector ψ 0 and the vector ψ π ^ lie in the unit sphere of H . (b) The vector ψ 0 is used to define the subspace H 1 . The unit vectors in H 1 define a lower dimensional sphere in H (in blue). The vector ψ 1 is the vector in that sphere that is closest to ψ π ^ . (c) The vector ψ 1 is used to define the subspace H 2 . The unit sphere in H 2 (in blue) contains ψ 1 but does not contain ψ 0 . The vector ψ 2 is the unit vector in H 2 closest to ψ π ^ . (d) The vector ψ 2 is used to define the subspace H 3 . The vector ψ 3 is the unit vector in H 3 closest to ψ π ^ . And so on.
Figure 1. A bird’s eye view of the training dynamics of exact single-site DMRG on the unit sphere. (a) The initial vector ψ 0 and the vector ψ π ^ lie in the unit sphere of H . (b) The vector ψ 0 is used to define the subspace H 1 . The unit vectors in H 1 define a lower dimensional sphere in H (in blue). The vector ψ 1 is the vector in that sphere that is closest to ψ π ^ . (c) The vector ψ 1 is used to define the subspace H 2 . The unit sphere in H 2 (in blue) contains ψ 1 but does not contain ψ 0 . The vector ψ 2 is the unit vector in H 2 closest to ψ π ^ . (d) The vector ψ 2 is used to define the subspace H 3 . The vector ψ 3 is the unit vector in H 3 closest to ψ π ^ . And so on.
Entropy 21 01236 g001
Figure 2. A representative bias-variance tradeoff curve showing negative log-likelihood (base 2) as a function of bond dimension for exact single-site DMRG on the P 20 dataset. For bond dimension 3, the generalization gap is approximately ϵ = 0.0237 . For reference, the uniform distribution on bitstrings has NLL of 20. Memorizing the training data would yield a NLL of approximately 13.356 .
Figure 2. A representative bias-variance tradeoff curve showing negative log-likelihood (base 2) as a function of bond dimension for exact single-site DMRG on the P 20 dataset. For bond dimension 3, the generalization gap is approximately ϵ = 0.0237 . For reference, the uniform distribution on bitstrings has NLL of 20. Memorizing the training data would yield a NLL of approximately 13.356 .
Entropy 21 01236 g002
Figure 3. A representative bias-variance tradeoff curve showing negative log-likelihood (base 2) as a function of bond dimension for exact single-site DMRG on the div7 dataset. For bond dimension 8, the generalization gap is approximately ϵ = 0.032 . For reference, the uniform distribution on bitstrings has NLL of 20, the target distribution has a NLL of 17.192 , and memorizing the training data would yield a NLL of approximately 13.87 .
Figure 3. A representative bias-variance tradeoff curve showing negative log-likelihood (base 2) as a function of bond dimension for exact single-site DMRG on the div7 dataset. For bond dimension 8, the generalization gap is approximately ϵ = 0.032 . For reference, the uniform distribution on bitstrings has NLL of 20, the target distribution has a NLL of 17.192 , and memorizing the training data would yield a NLL of approximately 13.87 .
Entropy 21 01236 g003

Share and Cite

MDPI and ACS Style

Stokes, J.; Terilla, J. Probabilistic Modeling with Matrix Product States. Entropy 2019, 21, 1236. https://doi.org/10.3390/e21121236

AMA Style

Stokes J, Terilla J. Probabilistic Modeling with Matrix Product States. Entropy. 2019; 21(12):1236. https://doi.org/10.3390/e21121236

Chicago/Turabian Style

Stokes, James, and John Terilla. 2019. "Probabilistic Modeling with Matrix Product States" Entropy 21, no. 12: 1236. https://doi.org/10.3390/e21121236

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop