Weighted entropy: basic inequalities

This paper represents an extended version of an earlier note [10]. The concept of weighted entropy takes into account values of different outcomes, i.e., makes entropy context-dependent, through the weight function. We analyse analogs of the Fisher information inequality and entropy power inequality for the weighted entropy and discuss connections with weighted Lieb's splitting inequality. The concepts of rates of the weighted entropy and information are also discussed.


Introduction
This paper represents an extended version of an earlier note [10]. 1 We also follow earlier publications discussing related topics: [20,21,19,18]. The Shannon entropy (SE) of a probability distribution p or the Shannon differential entropy (SDE) of a probability density function (PDF) f is context-free, i.e., does not depend on the nature of outcomes x i or x, but only upon probabilities p(x i ) or values f (x). It gives the notion of entropy a great flexibility which explains its successful applications. However, in many situations it seems insufficient, and the context-free property appears as a drawback. Viz., suppose you learn a news about severe weather conditions in an area far away from your place. Such conditions usually do not happen; an event like this has a small probability p ≪ 1 and conveys a high information − log p. At the same time you hear that a tree near your parking lot in the town has fallen and damaged a number of cars. The probability of this event is also low, so the amount of information is again high. However, the value of this information for you is higher than in the first event. Considerations of this character can motivate a study of weighted information and entropy, making them context-dependent.
Definition 1.1. Let us define the weighted entropy (WE) as Here a non-negative weight function (WF) x i → ϕ(x i ) is introduced, representing a value/utility of an outcomes x i . A similar approach can be used for the differential entropy of a probability density function (PDF) f . Define the weighted differential entropy (WDE) as An initial example of a WF ϕ may be ϕ(x) = 1 (x ∈ A) where A is a particular subset of outcomes (an event). A heuristic use of the WE with such a WF was demonstrated in [4,5]. Another example repeatedly used below is f (x) = f No C (x), a d-dimensional Gaussian PDF with mean 0 and covariance matrix C. Here For ϕ(x) = 1 we get the normal SDE h(f No C ) = 1 2 log[(2πe) d det C]. In this note we give a brief introduction into the concept of the weighted entropy. We do not always give proofs, referring the reader to the quoted original papers. Some basic properties of WE and WDE have been presented in [20]; see also references therein to early works on the subject. Applications of the WE and WDE to the security quantification of information systems are discussed in [15]. Other domains range from the stock market to the image processing, see, e.g., [6,9,12,14,23,26].
Throughout this note we assume that the series and integrals in (1.2)-(1.3) and the subsequent equations converge absolutely, without stressing it every time again. To unify the presentation, we will often use integrals X dµ relative to a reference σ-finite measure µ on a Polish space X with a Borel σ-algebra X. In this regard, the acronym PM/DF (probability mass/density function) will be employed. Usual measurability assumptions will also be in place for the rest of the presentation. We also assume that the WF ϕ > 0 on an open set in X .
In some parts of the presentation, the sums and integrals comprising a PM/DF will be written as expectations: this will make it easier to explain/use assumptions and properties involved. Viz., Eqns (1.2)-(1.3) can be given as h w

The weighted Gibbs inequality
Given two non-negative functions f, g (typically, PM/DFs), define the weighted Kullback-Leibler divergence (or the relative WE, briefly RWE) as [20] states: For an exponential family in the canonical form with the sufficient statistics T (x) we have where ∇ stands for the gradient w.r.t. to the parameter vector θ, and 3 Concavity/convexity of the weighted entropy Theorems 2.1 and 2.2 from [20] offer the following assertion: is convex: given two pairs of PDFs (f 1 , f 2 ) and (g 1 , g 2 ),

Weighted Ky-Fan and Hadamard inequalities
The map C → δ(C) := log det(C) gives a concave function of a (strictly) positive- λ 1 + λ 2 = 1 and λ 1,2 ≥ 0. This is the well-known Ky-Fan inequality. It terms of differential entropies it is equivalent to the bound and is closely related to a maximising property of the Gaussian differential entropy h(f No C ). Theorem 4.1 below presents one of new bounds of Ky-Fan type, in its most explicit form, for the WF ϕ(x) = exp (x T t), t ∈ R d . Cf. Theorem 3.5 from [20]. In this case the identity h w Here functions F (1) and F (2) incorporate parameters C i and λ i :  For t = 0 we obtain ϕ ≡ 1, and (4.4) coincides with (4.1). Theorem 4.1 is related to the maximisation property of the weighted Gaussian entropy which takes the form of Theorem 4.2. Cf. Example 3.2 in [20].
Cf. (1.4). Assume that  [20,22]. Here we will focus on a weighted version of Hadamard inequality asserting that for a [20], Theorem 3.7. Let f No Cjj stand for the Gaussian PDF on R with the zero mean and the variance C jj . Set:

Theorem 5.1 (Connection between WFIM and weighted KL-divergence measures).
For smooth families {f θ , θ ∈ Θ ∈ R 1 } and a given WF ϕ, we get where D θ stands for ∂ ∂θ . Proof. By virtue of a Taylor expansion of log f θ2 around θ 1 , we obtain denotes the reminder term which has a hidden dependence on x. Multiply both sides of (5.3) by ϕ and take expectations assuming that we can interchange differentiation and expectation appropriately. Next, observe that Therefore the claimed result, i.e., (5.2), is achieved.

Weighted entropy power inequality
Let X 1 , X 2 be independent RVs with PDFs f 1 , f 2 and X = X 1 + X 2 . The famous Shannon entropy power inequality (EPI) states that see, e.g., [1,7]. The EPI is widely used in electronics, i.e., consider a RV Y which satisfies where a i ∈ R 1 , {X i } are IID RVs. Then the EPI means with equality if and only if either X is Gaussian or if Y n = X n−k , for some k, that is, the filtering operation is a pure delay. Clearly, a possible extension of the EPI gives more flexibility in signal processing. We are interested in the weighted entropy power inequality (WEPI) .

Proof. Note that
Using (6.8), we have the following inequality Furthermore, recalling the definition of κ in (6.5) we obtain .
(6.12) By virtue of assumption (6.7), we derive The definition of κ in (6.5) leads directly to the result.

Similar assertions can be established for other examples of PDFs f 1 (x)
and f 2 (x), i.e. uniform, exponential, Gamma, Cauchy, etc.

A weighted Fisher information inequality
Let Z = (X, Y) be a pair of independent RVs X and Y ∈ R d , with sample values z = (x, y) ∈ R d × R d and marginal PDFs f 1 (x, θ), f 2 (y, θ), respectively. Let f Z|X+Y (x, y|u) stand for the conditional PDF as we employ the following reduced WFs: Next, let us introduce the matrices M ϕ and G ϕ : Note that for ϕ ≡ 1 we have M ϕ = G ϕ = 0 and the classical Fisher information inequality emerges (cf. [27]). Finally, we define Theorem 8.1 (A weighted Fisher information inequality (WFII)). Let X and Y be independent RVs. Assume that f (1) Proof. We use the same methodology as in Theorem 1 from [27]. Recalling Corollary 4.8, (iii) in [20] substitute P := [1,1].
Next, we need the following well-known expression for the inverse of a block matrix where By using the Schwarz inequality, we derive with equality iff f (1) Thus, owing to the (8.7), particularly for P = [1, 1], we can write Substituting (8.9), in above expression, we have Consequently by simplifying (8.11), one yields By using Corollary 3.4, (iii) from [20], we obtain the property claimed in (8.5): This concludes the proof.
The WFIM of RV Z can be written as The weighted entropy power is a concave function In the literature, several elegant proofs, employing the Fisher information inequality or basic properties of mutual information, have been proposed in order to prove that the entropy power (EP) is a concave function of γ [2,25]. We are interested in the weighted entropy power (WEP) defined as follows: .
Compute the second derivative of the WEP where In view of (9.2) the concavity of the WEP is equivalent to the inequality In the spirit of the WEP, we shall present a new proof of concavity of EP. Regarding this, let us apply the WFII (8.5) to ϕ ≡ 1. Then a straightforward computation gives d dγ d tr J(Z) ≥ 1. . Let X ∼ f X be a RV in R n , with a PDF f X ∈ C 2 . For a standard Gaussian RV N ∼ N(0, I d ) independent of X, and given γ > 0, define the RV Z = X + √ γN with PDF f Z . Let V r be the d-sphere of radius r centered at the origin and having surface denoted by S r . Assume that for given WF ϕ and ∀γ ∈ (0, 1) the relations and lim r→∞ Sr are fulfilled. Then Here If we assume that ϕ ≡ 1, then the equality (9.8) directly implies (9.4). Hence, the standard entropy power is a concave function of γ.
Next, we establish the concavity of the WEP when the WF is close to a constant. Proof. It is sufficient to check that By a straightforward calculation These formulas imply and using the Stokes formula one can bound this term by δ. Finally, |R(γ)| ≤ δ in view of (9.7), which leads to the claimed result.

Rates of weighted entropy and information
This section follows [18]. The concept of a rate of the WE or WDE emerges when we work with outcomes in a context of a discrete-time random process (RP): Here the WF ϕ n is made dependent on n: two immediate cases are where (a) ϕ n (x n 1 ) = n j=0 ψ(x j ) and (b) ϕ n (x n 1 ) = n j=0 ψ(x j ) (an additive and multiplicative WF, respectively). Next, X n−1 0 = (X 0 , . . . , X n−1 ) is a random string generated by an RP. For simplicity, let us focus on RPs taking values in a finite set X . Symbol P stands for the probability measure of X, and E denotes the expectation under P. For an RP with IID values, the joint probability of a sample x n−1 being the probability of an individual outcome x ∈ X . In the case of a Markov chain, p n (x n−1 Here λ(x) gives an initial distribution and p(x, y) is the transition probability on X ; to reflect this fact, we will sometimes use the notation h w ϕn (p n , λ). The quantity In the IID case, the WI and WE admit the following representations. Define S(p) = −E[log p(X)] and H w ψ = −E[ψ(X) log p(X)] to be the SE and the WE, of the one-digit distribution (the capital letter is used to make it distinct from h w ϕn , the multi-time WE).
(A) For an additive WF: and h w ϕn (p n ) = n(n − 1)S(p)E ψ(X) + nH w ψ (p) := n(n − 1)A 0 + nA 1 . (10.3) (B) For a multiplicative WF: The values A 0 , B 0 and their analogs in a general situation are referred to as primary rates, and A 1 , B 1 as secondary rates.

10.A WI and WE rates for asymptotically additive WFs
Here we will deal with a stationary RP X = (X j , j ∈ Z) and use the above notation p n (x n−1 0 ) = P(X n−1 0 = x n−1 0 ) for the joint probability. We will refer to the limit present in the Shannon-McMillan-Breiman (SMB) theorem (see, e.g., [1,7]) taking place for an ergodic RP: Here P(y|x −1 −∞ ) is the conditional PM/DF for X 0 = y given x −1 −∞ , an infinite past realization of X. An assumption upon WFs ϕ n called asymptotic additivity (AA) is that lim n→∞ 1 n ϕ n X n−1 0 = α, P-a.s. and/or in L 2 (P). Example 10.2. Clearly, the condition of stationarity cannot be dropped. Indeed, let ϕ n (x n−1 0 ) = αn be an additive WF and X be a (non-stationary) Gaussian process with covariances C = {C ij , i, j ∈ Z 1 + }. Let f n be a n-dimensional PDF of the vector (X 1 , . . . , X n ). Then Suppose that the eigenvalues λ 1 ≤ · · · ≤ λ j ≤ · · · ≤ λ n of C n have the order λ j ≈ cj. Then by Stirling's formula the second term in (10.10) dominates and the scaling of h w ϕn (f n ) is (n 2 log n) −1 instead of n −2 as n → ∞. Theorem 10.1. can be considered as an analog of the SMB theorem for the primary WE rate in the case of an AA WF. A specification of the secondary rate A 1 is given in Theorem 10.3 for an additive WF. The WE rates for multiplicative WFs are studied in Theorem 10.4 for the case where X is a stationary ergodic Markov chain on X .

10.B WI and WE rates for asymptotically multiplicative WFs
The WI rate is given in Theorem 10.3. Here we use the condition of asymptotic multiplicativity: Here B 0 = log µ (10.14) and µ > 0 is the Perron-Frobenius eigenvalue of the matrix M = (ψ(x)p(x, y)) coinciding with the norm of M.
The secondary rate B 1 in this case is identified through the invariant probabilities π(x) of the Markov chain and the Perron-Frobenius eigenvectors of matrices M and M T .

(10.15)
Conditions of Theorem 10.5 may be checked under some restrictions on the WF ψ, see [18].