[go: up one dir, main page]

 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (5)

Search Parameters:
Keywords = skew Bhattacharyya divergences

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
16 pages, 656 KiB  
Article
Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity
by Frank Nielsen
Entropy 2024, 26(3), 193; https://doi.org/10.3390/e26030193 - 23 Feb 2024
Viewed by 1309
Abstract
Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning, among others. An exponential family can either be normalized subtractively by its cumulant or free energy function, or equivalently normalized divisively by its partition function. Both the [...] Read more.
Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning, among others. An exponential family can either be normalized subtractively by its cumulant or free energy function, or equivalently normalized divisively by its partition function. Both the cumulant and partition functions are strictly convex and smooth functions inducing corresponding pairs of Bregman and Jensen divergences. It is well known that skewed Bhattacharyya distances between the probability densities of an exponential family amount to skewed Jensen divergences induced by the cumulant function between their corresponding natural parameters, and that in limit cases the sided Kullback–Leibler divergences amount to reverse-sided Bregman divergences. In this work, we first show that the α-divergences between non-normalized densities of an exponential family amount to scaled α-skewed Jensen divergences induced by the partition function. We then show how comparative convexity with respect to a pair of quasi-arithmetical means allows both convex functions and their arguments to be deformed, thereby defining dually flat spaces with corresponding divergences when ordinary convexity is preserved. Full article
Show Figures

Figure 1

Figure 1
<p>Strictly log-convex functions form a proper subset of strictly convex functions.</p>
Full article ">Figure 2
<p>The canonical divergence <math display="inline"><semantics> <mi mathvariant="script">D</mi> </semantics></math> and dual canonical divergence <math display="inline"><semantics> <msup> <mi mathvariant="script">D</mi> <mo>*</mo> </msup> </semantics></math> on a dually flat space <math display="inline"><semantics> <mi mathvariant="script">M</mi> </semantics></math> equipped with potential functions <math display="inline"><semantics> <mi mathvariant="script">F</mi> </semantics></math> and <math display="inline"><semantics> <msup> <mi mathvariant="script">F</mi> <mo>*</mo> </msup> </semantics></math> can be viewed as single-parameter contrast functions on the product manifold <math display="inline"><semantics> <mrow> <mi mathvariant="script">M</mi> <mo>×</mo> <mi mathvariant="script">M</mi> </mrow> </semantics></math>: The divergence <math display="inline"><semantics> <mi mathvariant="script">D</mi> </semantics></math> can be expressed using either the <math display="inline"><semantics> <mrow> <mi>θ</mi> <mo>×</mo> <mi>θ</mi> </mrow> </semantics></math>-coordinate system as a Bregman divergence or the mixed <math display="inline"><semantics> <mrow> <mi>θ</mi> <mo>×</mo> <mi>η</mi> </mrow> </semantics></math>-coordinate system as a Fenchel–Young divergence. Similarly, the dual divergence <math display="inline"><semantics> <mi mathvariant="script">D</mi> </semantics></math> can be expressed using either the <math display="inline"><semantics> <mrow> <mi>η</mi> <mo>×</mo> <mi>η</mi> </mrow> </semantics></math>-coordinate system as a dual Bregman divergence or the mixed <math display="inline"><semantics> <mrow> <mi>η</mi> <mo>×</mo> <mi>θ</mi> </mrow> </semantics></math>-coordinate system as a dual Fenchel–Young divergence.</p>
Full article ">Figure 3
<p>Statistical divergences between normalized <math display="inline"><semantics> <msub> <mi>p</mi> <mi>θ</mi> </msub> </semantics></math> and unnormalized <math display="inline"><semantics> <msub> <mover accent="true"> <mi>p</mi> <mo>˜</mo> </mover> <mi>θ</mi> </msub> </semantics></math> densities of an exponential family <math display="inline"><semantics> <mi mathvariant="script">E</mi> </semantics></math> with corresponding divergences between their natural parameters. Without loss of generality, we consider a natural exponential family (i.e., <math display="inline"><semantics> <mrow> <mi>t</mi> <mo>(</mo> <mi>x</mi> <mo>)</mo> <mo>=</mo> <mi>x</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>k</mi> <mo>(</mo> <mi>x</mi> <mo>)</mo> <mo>=</mo> <mn>0</mn> </mrow> </semantics></math>) with cumulant function <span class="html-italic">F</span> and partition function <span class="html-italic">Z</span>, with <math display="inline"><semantics> <msub> <mi>J</mi> <mi>F</mi> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>B</mi> <mi>F</mi> </msub> </semantics></math> respectively denoting the Jensen and Bregman divergences induced by the generator <span class="html-italic">F</span>. The statistical divergences <math display="inline"><semantics> <msub> <mi>D</mi> <mrow> <mi>R</mi> <mo>,</mo> <mi>α</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>D</mi> <mrow> <mi>B</mi> <mo>,</mo> <mi>α</mi> </mrow> </msub> </semantics></math> denote the Rényi <math display="inline"><semantics> <mi>α</mi> </semantics></math>-divergences and skewed <math display="inline"><semantics> <mi>α</mi> </semantics></math>-Bhattacharyya distances, respectively. The superscript “s” indicates rescaling by the multiplicative factor <math display="inline"><semantics> <mfrac> <mn>1</mn> <mrow> <mi>α</mi> <mo>(</mo> <mn>1</mn> <mo>−</mo> <mi>α</mi> <mo>)</mo> </mrow> </mfrac> </semantics></math>, while the superscript “*” denotes the reverse divergence obtained by swapping the parameter order.</p>
Full article ">
35 pages, 988 KiB  
Article
Revisiting Chernoff Information with Likelihood Ratio Exponential Families
by Frank Nielsen
Entropy 2022, 24(10), 1400; https://doi.org/10.3390/e24101400 - 1 Oct 2022
Cited by 7 | Viewed by 4862
Abstract
The Chernoff information between two probability measures is a statistical divergence measuring their deviation defined as their maximally skewed Bhattacharyya distance. Although the Chernoff information was originally introduced for bounding the Bayes error in statistical hypothesis testing, the divergence found many other applications [...] Read more.
The Chernoff information between two probability measures is a statistical divergence measuring their deviation defined as their maximally skewed Bhattacharyya distance. Although the Chernoff information was originally introduced for bounding the Bayes error in statistical hypothesis testing, the divergence found many other applications due to its empirical robustness property found in applications ranging from information fusion to quantum information. From the viewpoint of information theory, the Chernoff information can also be interpreted as a minmax symmetrization of the Kullback–Leibler divergence. In this paper, we first revisit the Chernoff information between two densities of a measurable Lebesgue space by considering the exponential families induced by their geometric mixtures: The so-called likelihood ratio exponential families. Second, we show how to (i) solve exactly the Chernoff information between any two univariate Gaussian distributions or get a closed-form formula using symbolic computing, (ii) report a closed-form formula of the Chernoff information of centered Gaussians with scaled covariance matrices and (iii) use a fast numerical scheme to approximate the Chernoff information between any two multivariate Gaussian distributions. Full article
(This article belongs to the Section Information Theory, Probability and Statistics)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Plot of the Bhattacharryya distance <math display="inline"><semantics> <mrow> <msub> <mi>D</mi> <mrow> <mi>B</mi> <mo>,</mo> <mi>α</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>p</mi> <mo>:</mo> <mi>q</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> (strictly concave, displayed in blue) and the log-normalizer <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mrow> <mi>p</mi> <mi>q</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>α</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> of the induced LREF <math display="inline"><semantics> <msub> <mi mathvariant="script">E</mi> <mrow> <mi>p</mi> <mi>q</mi> </mrow> </msub> </semantics></math> (strictly convex, displayed in red) for two univariate normal densities <math display="inline"><semantics> <mrow> <mi>p</mi> <mo>=</mo> <msub> <mi>p</mi> <mrow> <mn>0</mn> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>)</mo> </mrow> </mrow> </semantics></math> (standard normal) and <math display="inline"><semantics> <mrow> <mi>q</mi> <mo>=</mo> <msub> <mi>p</mi> <mrow> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> </mrow> </semantics></math>: The curves <math display="inline"><semantics> <mrow> <msub> <mi>D</mi> <mrow> <mi>B</mi> <mo>,</mo> <mi>α</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>p</mi> <mo>:</mo> <mi>q</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>−</mo> <msub> <mi>F</mi> <mrow> <mi>p</mi> <mi>q</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>α</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> are horizontally mirror symmetric to each others. The Chernoff information optimal skewing value <math display="inline"><semantics> <msup> <mi>α</mi> <mo>*</mo> </msup> </semantics></math> between these two univariate normal distributions can be calculated exactly in closed-form, see <a href="#sec5dot2-entropy-24-01400" class="html-sec">Section 5.2</a> (approximated numerically here for plotting the vertical grey line by <math display="inline"><semantics> <mrow> <msup> <mi>α</mi> <mo>*</mo> </msup> <mo>≈</mo> <mn>0.4215580558605244</mn> </mrow> </semantics></math>).</p>
Full article ">Figure 2
<p>The best unique parameter <math display="inline"><semantics> <msup> <mi>α</mi> <mo>*</mo> </msup> </semantics></math> defining the Chernoff information optimal skewing parameter is found by setting the derivative of the strictly convex function <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mrow> <mi>p</mi> <mi>q</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>α</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> to zero. At the optimal value <math display="inline"><semantics> <msup> <mi>α</mi> <mo>*</mo> </msup> </semantics></math>, we have <math display="inline"><semantics> <mrow> <msub> <mi>D</mi> <mi>C</mi> </msub> <mrow> <mo>[</mo> <mi>p</mi> <mo>:</mo> <mi>q</mi> <mo>]</mo> </mrow> <mo>=</mo> <msub> <mi>D</mi> <mi>KL</mi> </msub> <mrow> <mo>[</mo> <msubsup> <mrow> <mo>(</mo> <mi>p</mi> <mi>q</mi> <mo>)</mo> </mrow> <msup> <mi>α</mi> <mo>*</mo> </msup> <mi>G</mi> </msubsup> <mo>:</mo> <mi>p</mi> <mo>]</mo> </mrow> <mo>=</mo> <msub> <mi>D</mi> <mi>KL</mi> </msub> <mrow> <mo>[</mo> <msubsup> <mrow> <mo>(</mo> <mi>p</mi> <mi>q</mi> <mo>)</mo> </mrow> <msup> <mi>α</mi> <mo>*</mo> </msup> <mi>G</mi> </msubsup> <mo>:</mo> <mi>q</mi> <mo>]</mo> </mrow> <mo>=</mo> <mo>−</mo> <mi>F</mi> <mrow> <mo>(</mo> <msup> <mi>α</mi> <mo>*</mo> </msup> <mo>)</mo> </mrow> <mo>&gt;</mo> <mn>0</mn> </mrow> </semantics></math>.</p>
Full article ">Figure 3
<p>The Chernoff information distribution <math display="inline"><semantics> <msubsup> <mrow> <mo>(</mo> <mi>P</mi> <mi>Q</mi> <mo>)</mo> </mrow> <mrow> <msup> <mi>α</mi> <mo>*</mo> </msup> </mrow> <mi>G</mi> </msubsup> </semantics></math> with density <math display="inline"><semantics> <msubsup> <mrow> <mo>(</mo> <mi>p</mi> <mi>q</mi> <mo>)</mo> </mrow> <mrow> <msup> <mi>α</mi> <mo>*</mo> </msup> </mrow> <mi>G</mi> </msubsup> </semantics></math> is obtained as the unique intersection of the exponential arc <math display="inline"><semantics> <mrow> <msup> <mi>γ</mi> <mi>G</mi> </msup> <mrow> <mo>(</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> linking density <span class="html-italic">p</span> to density <span class="html-italic">q</span> of <math display="inline"><semantics> <mrow> <msup> <mi>L</mi> <mn>1</mn> </msup> <mrow> <mo>(</mo> <mi>μ</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> with the left-sided Kullback–Leibler divergence bisector <math display="inline"><semantics> <mrow> <msubsup> <mi>Bi</mi> <mi>KL</mi> <mi>left</mi> </msubsup> <mrow> <mo>(</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> of <span class="html-italic">p</span> and <span class="html-italic">q</span>: <math display="inline"><semantics> <mrow> <msubsup> <mrow> <mo>(</mo> <mi>p</mi> <mi>q</mi> <mo>)</mo> </mrow> <msup> <mi>α</mi> <mo>*</mo> </msup> <mi>G</mi> </msubsup> <mo>=</mo> <msup> <mi>γ</mi> <mi>G</mi> </msup> <mrow> <mo>(</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>)</mo> </mrow> <mo>∩</mo> <msubsup> <mi>Bi</mi> <mi>KL</mi> <mi>left</mi> </msubsup> <mrow> <mo>(</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>)</mo> </mrow> </mrow> </semantics></math>.</p>
Full article ">Figure 4
<p>Illustration of the dichotomic search for approximating the optimal skewing parameter <math display="inline"><semantics> <msup> <mi>α</mi> <mo>*</mo> </msup> </semantics></math> to within some prescribed numerical precision <math display="inline"><semantics> <mrow> <mi>ϵ</mi> <mo>&gt;</mo> <mn>0</mn> </mrow> </semantics></math>.</p>
Full article ">Figure 5
<p>Taxonomy of exponential families: Regular (and always steep) or steep (but not necessarily regular). The Kullback–Leibler divergence between two densities of a regular exponential family amounts to dual Bregman divergences.</p>
Full article ">Figure 6
<p>The Chernoff information optimal skewing parameter <math display="inline"><semantics> <msup> <mi>α</mi> <mo>*</mo> </msup> </semantics></math> for two densities <math display="inline"><semantics> <msub> <mi>p</mi> <msub> <mi>θ</mi> <mn>1</mn> </msub> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>p</mi> <msub> <mi>θ</mi> <mn>2</mn> </msub> </msub> </semantics></math> of some regular exponential family <math display="inline"><semantics> <mi mathvariant="script">E</mi> </semantics></math> inducing an exponential family dually flat manifold <math display="inline"><semantics> <mrow> <mi mathvariant="script">M</mi> <mo>=</mo> <mo>(</mo> <mrow> <mo>{</mo> <msub> <mi>p</mi> <mi>θ</mi> </msub> <mo>}</mo> </mrow> <mo>,</mo> <msub> <mi>g</mi> <mi>F</mi> </msub> <mo>=</mo> <msup> <mo>∇</mo> <mn>2</mn> </msup> <mi>F</mi> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> <mo>,</mo> <msup> <mo>∇</mo> <mi>m</mi> </msup> <mo>,</mo> <msup> <mo>∇</mo> <mi>e</mi> </msup> <mo>)</mo> </mrow> </semantics></math> is characterized by the intersection of their <math display="inline"><semantics> <msup> <mo>∇</mo> <mi>e</mi> </msup> </semantics></math>-flat exponential geodesic with their mixture bisector a <math display="inline"><semantics> <msup> <mo>∇</mo> <mi>m</mi> </msup> </semantics></math>-flat right-sided Bregman bisector.</p>
Full article ">Figure 7
<p>Interpolation along the <span class="html-italic">e</span>-geodesic and the <span class="html-italic">m</span>-geodesic passing through two given multivariate normal distributions. No closed-form is known for Riemannian geodesic with respect to the metric Levi–Civita connection (shown in dashed style).</p>
Full article ">Figure 8
<p>The natural parameter space of the non-regular full exponential family of singly truncated normal distributions is not regular (i.e., not open): The negative real axis corresponds to the exponential family of exponential distributions.</p>
Full article ">
21 pages, 1068 KiB  
Article
Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences
by Frank Nielsen
Entropy 2022, 24(3), 421; https://doi.org/10.3390/e24030421 - 17 Mar 2022
Cited by 11 | Viewed by 5414
Abstract
By calculating the Kullback–Leibler divergence between two probability measures belonging to different exponential families dominated by the same measure, we obtain a formula that generalizes the ordinary Fenchel–Young divergence. Inspired by this formula, we define the duo Fenchel–Young divergence and report a majorization [...] Read more.
By calculating the Kullback–Leibler divergence between two probability measures belonging to different exponential families dominated by the same measure, we obtain a formula that generalizes the ordinary Fenchel–Young divergence. Inspired by this formula, we define the duo Fenchel–Young divergence and report a majorization condition on its pair of strictly convex generators, which guarantees that this divergence is always non-negative. The duo Fenchel–Young divergence is also equivalent to a duo Bregman divergence. We show how to use these duo divergences by calculating the Kullback–Leibler divergence between densities of truncated exponential families with nested supports, and report a formula for the Kullback–Leibler divergence between truncated normal distributions. Finally, we prove that the skewed Bhattacharyya distances between truncated exponential families amount to equivalent skewed duo Jensen divergences. Full article
(This article belongs to the Special Issue Information and Divergence Measures)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Visualizing the Fenchel–Young divergence.</p>
Full article ">Figure 2
<p>Visualizing the sided and symmetrized Bregman divergences.</p>
Full article ">Figure 3
<p>(<b>a</b>) Visual illustration of the Legendre–Fenchel transformation: <math display="inline"><semantics> <mrow> <msup> <mi>F</mi> <mo>*</mo> </msup> <mrow> <mo>(</mo> <mi>η</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> is measured as the vertical gap (left long black line with both arrows) between the origin and the hyperplane of the “slope” <math display="inline"><semantics> <mi>η</mi> </semantics></math> tangent at <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> </semantics></math> evaluated at <math display="inline"><semantics> <mrow> <mi>θ</mi> <mo>=</mo> <mn>0</mn> </mrow> </semantics></math>. (<b>b</b>) The Legendre transforms <math display="inline"><semantics> <mrow> <msubsup> <mi>F</mi> <mn>1</mn> <mo>*</mo> </msubsup> <mrow> <mo>(</mo> <mi>η</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msubsup> <mi>F</mi> <mn>1</mn> <mo>*</mo> </msubsup> <mrow> <mo>(</mo> <mi>η</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> of two functions <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> such that <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> <mo>≥</mo> <msub> <mi>F</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> reverse the dominance order: <math display="inline"><semantics> <mrow> <msubsup> <mi>F</mi> <mn>2</mn> <mo>*</mo> </msubsup> <mrow> <mo>(</mo> <mi>η</mi> <mo>)</mo> </mrow> <mo>≥</mo> <msubsup> <mi>F</mi> <mn>1</mn> <mo>*</mo> </msubsup> <mrow> <mo>(</mo> <mi>η</mi> <mo>)</mo> </mrow> </mrow> </semantics></math>.</p>
Full article ">Figure 4
<p>The duo Bregman divergence induced by two strictly convex and differentiable functions <math display="inline"><semantics> <msub> <mi>F</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>F</mi> <mn>2</mn> </msub> </semantics></math> such that <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> <mo>≥</mo> <msub> <mi>F</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> </mrow> </semantics></math>. We check graphically that <math display="inline"><semantics> <mrow> <msub> <mi>B</mi> <mrow> <msub> <mi>F</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>F</mi> <mn>2</mn> </msub> </mrow> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>:</mo> <msup> <mi>θ</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> <mo>≥</mo> <msub> <mi>B</mi> <msub> <mi>F</mi> <mn>2</mn> </msub> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>:</mo> <msup> <mi>θ</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> </mrow> </semantics></math> (vertical gaps).</p>
Full article ">Figure 5
<p>The duo half squared Euclidean distance <math display="inline"><semantics> <mrow> <msubsup> <mi>D</mi> <mi>a</mi> <mn>2</mn> </msubsup> <mrow> <mo>(</mo> <mi>θ</mi> <mo>:</mo> <msup> <mi>θ</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> <mo>:</mo> <mo>=</mo> <mfrac> <mi>a</mi> <mn>2</mn> </mfrac> <msup> <mi>θ</mi> <mn>2</mn> </msup> <mo>+</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <msup> <mrow> <msup> <mi>θ</mi> <mo>′</mo> </msup> </mrow> <mn>2</mn> </msup> <mo>−</mo> <mi>θ</mi> <msup> <mi>θ</mi> <mo>′</mo> </msup> </mrow> </semantics></math> is non-negative when <math display="inline"><semantics> <mrow> <mi>a</mi> <mo>≥</mo> <mn>1</mn> </mrow> </semantics></math>: (<b>a</b>) half squared Euclidean distance (<math display="inline"><semantics> <mrow> <mi>a</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>), (<b>b</b>) <math display="inline"><semantics> <mrow> <mi>a</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>, (<b>c</b>) <math display="inline"><semantics> <mrow> <mi>a</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> </mrow> </semantics></math>, which shows that the divergence can be negative then since <math display="inline"><semantics> <mrow> <mi>a</mi> <mo>&lt;</mo> <mn>1</mn> </mrow> </semantics></math>.</p>
Full article ">Figure 6
<p>The Legendre transform reverses the dominance ordering: <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mi>θ</mi> <mn>2</mn> </msup> <mo>≥</mo> <msub> <mi>F</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mi>θ</mi> <mn>4</mn> </msup> <mo>⇔</mo> <msubsup> <mi>F</mi> <mn>1</mn> <mo>*</mo> </msubsup> <mrow> <mo>(</mo> <mi>η</mi> <mo>)</mo> </mrow> <mo>≤</mo> <msubsup> <mi>F</mi> <mn>2</mn> <mo>*</mo> </msubsup> <mrow> <mo>(</mo> <mi>η</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> for <math display="inline"><semantics> <mrow> <mi>θ</mi> <mo>∈</mo> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>]</mo> </mrow> </semantics></math>.</p>
Full article ">Figure 7
<p>The duo Jensen divergence <math display="inline"><semantics> <mrow> <msub> <mi>J</mi> <mrow> <msub> <mi>F</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>F</mi> <mn>2</mn> </msub> <mo>,</mo> <mi>α</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>θ</mi> <mn>1</mn> </msub> <mo>:</mo> <msub> <mi>θ</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> </semantics></math> is greater than the Jensen divergence <math display="inline"><semantics> <mrow> <msub> <mi>J</mi> <mrow> <msub> <mi>F</mi> <mn>1</mn> </msub> <mo>,</mo> <mi>α</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>θ</mi> <mn>1</mn> </msub> <mo>:</mo> <msub> <mi>θ</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> </semantics></math> for <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> <mo>≥</mo> <msub> <mi>F</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>θ</mi> <mo>)</mo> </mrow> </mrow> </semantics></math>.</p>
Full article ">
14 pages, 5219 KiB  
Article
Convex Optimization via Symmetrical Hölder Divergence for a WLAN Indoor Positioning System
by Osamah Abdullah
Entropy 2018, 20(9), 639; https://doi.org/10.3390/e20090639 - 25 Aug 2018
Cited by 7 | Viewed by 3258
Abstract
Modern indoor positioning system services are important technologies that play vital roles in modern life, providing many services such as recruiting emergency healthcare providers and for security purposes. Several large companies, such as Microsoft, Apple, Nokia, and Google, have researched location-based services. Wireless [...] Read more.
Modern indoor positioning system services are important technologies that play vital roles in modern life, providing many services such as recruiting emergency healthcare providers and for security purposes. Several large companies, such as Microsoft, Apple, Nokia, and Google, have researched location-based services. Wireless indoor localization is key for pervasive computing applications and network optimization. Different approaches have been developed for this technique using WiFi signals. WiFi fingerprinting-based indoor localization has been widely used due to its simplicity, and algorithms that fingerprint WiFi signals at separate locations can achieve accuracy within a few meters. However, a major drawback of WiFi fingerprinting is the variance in received signal strength (RSS), as it fluctuates with time and changing environment. As the signal changes, so does the fingerprint database, which can change the distribution of the RSS (multimodal distribution). Thus, in this paper, we propose that symmetrical Hölder divergence, which is a statistical model of entropy that encapsulates both the skew Bhattacharyya divergence and Cauchy–Schwarz divergence that are closed-form formulas that can be used to measure the statistical dissimilarities between the same exponential family for the signals that have multivariate distributions. The Hölder divergence is asymmetric, so we used both left-sided and right-sided data so the centroid can be symmetrized to obtain the minimizer of the proposed algorithm. The experimental results showed that the symmetrized Hölder divergence consistently outperformed the traditional k nearest neighbor and probability neural network. In addition, with the proposed algorithm, the position error accuracy was about 1 m in buildings. Full article
Show Figures

Figure 1

Figure 1
<p>Signal-to-noise ratio of received strength signal indicator variations over time.</p>
Full article ">Figure 2
<p>The Bregman divergence represents the vertical distance between the potential function and hyperplane at <span class="html-italic">q</span>.</p>
Full article ">Figure 3
<p>Interpreting the Jensen-Bregman divergence.</p>
Full article ">Figure 4
<p>Hölder divergence encompasses the skew Bhattacharyya divergence and the Cauchy-Schwarz divergence.</p>
Full article ">Figure 5
<p>The offline and online stages of location WiFi-based fingerprinting architecture.</p>
Full article ">Figure 6
<p>The layout used in the experimental work in the College of Engineering and Applied.</p>
Full article ">Figure 7
<p>The implementation results of different number of clusters with respect to the average of the localization distance.</p>
Full article ">Figure 8
<p>The implementation result of the average localization error under different AP selection schemes.</p>
Full article ">Figure 9
<p>Experiment results: The Cumulative distribution function (CDF) of localization error when using 50 nearest neighbors.</p>
Full article ">
6948 KiB  
Article
On Hölder Projective Divergences
by Frank Nielsen, Ke Sun and Stéphane Marchand-Maillet
Entropy 2017, 19(3), 122; https://doi.org/10.3390/e19030122 - 16 Mar 2017
Cited by 15 | Viewed by 6155
Abstract
We describe a framework to build distances by measuring the tightness of inequalities and introduce the notion of proper statistical divergences and improper pseudo-divergences. We then consider the Hölder ordinary and reverse inequalities and present two novel classes of Hölder divergences and pseudo-divergences [...] Read more.
We describe a framework to build distances by measuring the tightness of inequalities and introduce the notion of proper statistical divergences and improper pseudo-divergences. We then consider the Hölder ordinary and reverse inequalities and present two novel classes of Hölder divergences and pseudo-divergences that both encapsulate the special case of the Cauchy–Schwarz divergence. We report closed-form formulas for those statistical dissimilarities when considering distributions belonging to the same exponential family provided that the natural parameter space is a cone (e.g., multivariate Gaussians) or affine (e.g., categorical distributions). Those new classes of Hölder distances are invariant to rescaling and thus do not require distributions to be normalized. Finally, we show how to compute statistical Hölder centroids with respect to those divergences and carry out center-based clustering toy experiments on a set of Gaussian distributions which demonstrate empirically that symmetrized Hölder divergences outperform the symmetric Cauchy–Schwarz divergence. Full article
(This article belongs to the Special Issue Information Geometry II)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Hölder proper divergence (bi-parametric) and Hölder improper pseudo-divergence (tri-parametric) encompass Cauchy–Schwarz divergence and skew Bhattacharyya divergence.</p>
Full article ">Figure 2
<p>First row: the Hölder pseudo divergence (HPD) <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>}</mo> </mrow> </semantics> </math>, KL divergence and reverse KL divergence. Remaining rows: the HD <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mi>γ</mi> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1.5</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from top to bottom) and <math display="inline"> <semantics> <mrow> <mi>γ</mi> <mo>∈</mo> <mo>{</mo> <mn>0.5</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>5</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from left to right). The reference distribution <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math> is presented as “★”. The minimizer of <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>, if different from <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math>, is presented as “•”. Notice that <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>2</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mo>=</mo> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> </mrow> </semantics> </math>. (<b>a</b>) Reference categorical distribution <math display="inline"> <semantics> <mrow> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>=</mo> <mrow> <mo stretchy="false">(</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>; (<b>b</b>) reference categorical distribution <math display="inline"> <semantics> <mrow> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>=</mo> <mrow> <mo stretchy="false">(</mo> <mn>1</mn> <mo>/</mo> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>6</mn> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>.</p>
Full article ">Figure 2 Cont.
<p>First row: the Hölder pseudo divergence (HPD) <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>}</mo> </mrow> </semantics> </math>, KL divergence and reverse KL divergence. Remaining rows: the HD <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mi>γ</mi> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1.5</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from top to bottom) and <math display="inline"> <semantics> <mrow> <mi>γ</mi> <mo>∈</mo> <mo>{</mo> <mn>0.5</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>5</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from left to right). The reference distribution <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math> is presented as “★”. The minimizer of <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>, if different from <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math>, is presented as “•”. Notice that <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>2</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mo>=</mo> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> </mrow> </semantics> </math>. (<b>a</b>) Reference categorical distribution <math display="inline"> <semantics> <mrow> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>=</mo> <mrow> <mo stretchy="false">(</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>; (<b>b</b>) reference categorical distribution <math display="inline"> <semantics> <mrow> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>=</mo> <mrow> <mo stretchy="false">(</mo> <mn>1</mn> <mo>/</mo> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>6</mn> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>.</p>
Full article ">Figure 3
<p>First row: <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>, where <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math> is the standard Gaussian distribution and <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>}</mo> </mrow> </semantics> </math> compared to the KL divergence. The rest of the rows: <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mi>γ</mi> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1.5</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from top to bottom) and <math display="inline"> <semantics> <mrow> <mi>γ</mi> <mo>∈</mo> <mo>{</mo> <mn>0.5</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>5</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from left to right). Notice that <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>2</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mo>=</mo> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> </mrow> </semantics> </math>. The coordinate system is formed by <span class="html-italic">μ</span> (mean) and <span class="html-italic">σ</span> (standard deviation).</p>
Full article ">Figure 4
<p>Variational <span class="html-italic">k</span>-means clustering results on a toy dataset consisting of a set of 2D Gaussians organized into two or three clusters. The cluster centroids are represented by contour plots using the same density levels. (<b>a</b>) <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>=</mo> <mi>γ</mi> <mo>=</mo> <mn>1.1</mn> </mrow> </semantics> </math> (Hölder clustering); (<b>b</b>) <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>=</mo> <mi>γ</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics> </math> (Cauchy–Schwarz clustering); (<b>c</b>) <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>=</mo> <mi>γ</mi> <mo>=</mo> <mn>1.1</mn> </mrow> </semantics> </math> (Hölder clustering); (<b>d</b>) <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>=</mo> <mi>γ</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics> </math> (Cauchy–Schwarz clustering).</p>
Full article ">
Back to TopTop