Convexity and robustness of the R\'{e}nyi entropy

We study convexity properties of R\'{e}nyi entropy as function of $\alpha>0$ on finite alphabets. We also describe robustness of the R\'{e}nyi entropy on finite alphabets, and it turns out that the rate of respective convergence depends on initial alphabet. We establish convergence of the disturbed entropy when the initial distribution is uniform but the number of events increases to $\infty$ and prove that limit of R\'{e}nyi entropy of binomial distribution is equal to R\'{e}nyi entropy of Poisson distribution.

In the present paper we investigate some properties of the Rényi entropy, which was proposed by Rényi in [1], including its limit value as α → 1, i.e., the Shannon entropy p k log(p k ).
Due to this continuity, it is possible to put H 1 (p) = H(p). We consider the Rényi entropy as a functional of various parameters. The first approach is to fix the distribution and consider H α (p) as the function of α > 0. Some of the properties of H α (p) as the function of α > 0 are well known. In particular, it is known that H α (p) is continuous and non-increasing in α ∈ (0, ∞), lim α→0+ H α (p) = log m, where m is the number of non-zero probabilities, and lim α→+∞ H α (p) = − log max k p k . However, for the reader's convenience, we provide the short proofs of this and some other simple statements in the Appendix. One can see that these properties of the entropy itself and its first derivative are common for all finite distributions. Alao, it is known that Rényi entropy is Schur concave as a function of distribution vector, that is Some additional results such as lower bounds on the difference in Rényi entropy for distributions defined on countable alphabets could be found in [2]. Those results usually use Rényi divergence of order α of a distribution P from a distribution Q which is very similar to Kullback-Leibler divergence. Some of Rényi divergences most important properties were reviewed and extended in [3]. Rényi divergence for most commonly used univariate continuous distributions could be found in [4]. Rényi entropy and divergence is widely used in majorization theory [5,6], statistics [7,8], information theory [2,3,9] and many other fields. Boundedness of Rényi entropy was shown in [10] for discrete log-concave distributions depending on it's variance. There are other operational definitions of Rényi entropy given in [11], which are used in practice. However, in the present paper we restrict ourselves with standard Rényi entropy and go a step ahead in comparison with standard properties, namely, we investigate convexity of the Rényi entropy with the help of the second derivative. It turned out that from this point of view, the situation is much more interesting and uncertain in comparison with the behavior of the 1st derivative, and crucially depends on the distribution. One might say that all the standard guesses are wrong. Of course, the second derivative is continuous (evidently, it simply means that it is continuous at 1 because at all other points, the continuity is obvious), but then the surprises begin. If the second derivative starts with a positive value at zero, it can either remain positive or have inflection points, depending on the distribution. If it starts from the negative value, it can have the first infection point both before 1 and after 1, due to the distribution, too (point 1 is interesting as some crucial point for entropy, so, we compare the value of inflection points with it). The value of the second derivative at zero is bounded from below but unbounded from above. Some superposition of entropy is convex, and this fact simultaneously describes why other similar properties depend on distribution. Due to the over-complexity of some expressions, which defied analytical consideration, we propose several illustrations performed by numerical methods. We investigate robustness of the Rényi entropy w.r.t. the distribution, and it turns out that the rate of respective convergence depends on initial distribution, too. Further, we establish convergence of the disturbed entropy when the initial distribution is uniform but the number of events increases to ∞ and prove that limit of Rényi entropy of binomial distribution is equal to entropy of Poisson distribution. It was previously proved in [12] that Shannon entropy of binomial distribution is increasing to entropy of Poisson distribution. Our proof of this particular fact is simpler because uses only Lebesgue's dominated convergence theorem. The paper is organized as follows. Section 2 is devoted to the convexity properties of the Rényi entropy, Section 3 describes robustness of the Rényi entropy, and Section 4 contains some auxiliary results.

Convexity of the Rényi entropy
To start, we consider the general properties of the 2nd derivative of the Rényi entropy.

The form and the continuity of the 2nd derivative
Obviously, function f ∈ C ∞ (R + ), and its first derivatives equal .
In particular, if to consider the random variable ξ taking values log p k with probability p k , then 1) and the sign of f (1) is not clear (as we can see below, it can be both + and −).
(b) The 2nd derivative H α (p) can be also presented as for some 0 < θ < α.

Behavior of the 2nd derivative at the origin
Let us consider the starting point for the 2nd derivative, i.e., the behavior of H α (p) at zero as a function of a distribution vector p. Analyzing (2.2), we see that H α (p) as function of α is continuous in 0. Moreover, so we can present H α (p) as Now we are interested in the sign of H α (p). Give an example of distributions for which H α (p) > 0 it is very simple, one of such examples is given at Figure  1. Concerning negative H α (p), it is also possible, however, at this moment we prefer to start with a more general result.
Lemma 2.2. If some probability vector is p a point of local extremum of H α (p) then either p = p(unif orm) = 1 N , . . . , 1 N or it contains two different probabilities.
Proof. Let us formulate the necessary conditions for H 0 (p) to have a local extremum at some point. Taking into account limitation N k=1 p k = 1, these conditions have a form We create a Lagrangian function If some p is an extreme point then there exist λ 0 and λ such that λ 2 0 + λ 2 = 0 and ∂L ∂pi (p) = 0 for all 1 ≤ i ≤ N , i.e., If λ 0 = 0 then λ = 0. However, λ 2 0 + λ 2 = 0, therefore we can put λ 0 = 1. Then taking a sum of these equalities we get that λ = −2 whence Thus, if the vector of probabilities is a vector of local extremum of H 0 (p), then it contains no more than two different probabilities. Obviously, it can be Note that H 0 (p(unif orm)) = 0. Therefore, in order to find the distribution for which H 0 (p) < 0 let us consider the distribution vector that contains only two different probabilities p 0 , q 0 such that: where N, k ∈ N, N > k and p 0 , q 0 ∈ (0, 1).
Proof. First, we will show that H 0 (p) is non-positive. For that we rewrite H 0 (p) in terms of p 0 and q 0 : We want to show that under conditions (2.5) N 2 p 0 q 0 can't be equal to 1. Suppose that N 2 p 0 q 0 = 1. Then it follows from (2.5) that It means that q 0 and p 0 are algebraic numbers. Thus, their defference p 0 − q 0 is also algebraic. On the other hand, by Lindemann-Weierstrass theorem 1 N (log p 0 − log q 0 ) is transcendental number, which contradicts (2.5). So N 2 p 0 q 0 = 1 and H 0 (p) < 0. Theorem 2.5. For any n > 2 there exists N ≥ n and a probability vector Proof. Consider the distribution vector p that satisfies conditions (2.5). From Lemma 2.4 we know that H 0 (p) < 0. Now we want to show that there exist arbitrarily large N ∈ N and distribution vector p of length N that satisfy those conditions. For that we denote Then 0 < x < 1 < y and r < 1 and x − y = log x − log y. Function x − log x is decreasing on (0, 1), is increasing on (1, +∞) and is equal to 1 at point 1.
Let y = y(x) be implicit function defined by x − y = log x − log y. By that we get 1-to-1 correspondence from x ∈ (0, 1) to y ∈ (1, +∞). We also have fuction is rational then we can pick N, k ∈ N such that r = k N and get distribution vector p satisfying (2.5) with p 0 = x N , q 0 = y N . However, we won't find such x , we will just show that they exist. To do that observe that y(x) is continuous function of x and so is function Then for any r ∈ (r(x 0 ), 1) there exists x ∈ (0, x 0 ) such that r(x ) = r . By taking r ∈ Q we get that there exists x such that k N < 1 and is rational. Finally, we want to show that N can be arbitrarily large. For that simply observe that k N = r so as r → 1− we get that N → +∞. Lemma 2.6. Let N be fixed. Then H 0 (p) as the function of vector p is bounded from below and is unbounded from above.
Proof. Recall that H 0 (p) = 0 on the uniform distribution and exclude this case from further consideration. In order to simplify the notations, we denote x k = log p k , and let Note that there exists n ≤ N − 1 such that Let's establish that H 0 (p) is bounded from below. In this connection, rewrite S N as By Cauchy-Schwarz inequality we have There exists M > 0 such that for every Resuming, we get that S N is bounded from below, and consequently H 0 (p) is bounded from below for fixed N .
Now we want to establish that H 0 (p) is not bounded from above. In this connection, let ε > 0, and let us consider the distribution of the form p 1 = ε, p 2 = ... = p N = 1−ε N −1 . Then we have

Superposition of entropy that is convex
Now we establish that the superposition of entropy with some decreasing function is convex. Namely, we shall consider function and prove its convexity. Since now we consider the tools that do not include differebtiation, we can assume that some probabilities are zero. In order to provide convexity, we start with the following simple and known result whose proof is added for the reader's convenience.
Lemma 2.7. For any measure space (X , Σ, µ) and any measurable f ∈ L p (X , Σ, µ) for some interval p ∈ [a, b], f p = f L p (X ,Σ,µ) is log-convex as a function of 1/p on this interval.
Therefore, by the Hölder inequality as required.
Proof. Follows from Lemma 2.7 by setting Remark 2.9. It follows immediately from (2.6) that for the function If it happened that there is such p that G · (p) is non-decreasing on an interval then G 1 α−1 (p) be convex on that interval and H α (p) be convex, too. However, In some sense, this is a reason why we can not say something definite concerning the 2nd derivative of entropy either on the whole semiaxes or even in the interval [1, +∞).

Robustness of the Rényi entropy
Now we study the asymptotic behavior of the Rényi entropy depending on the behavior of the involved probabilities.The first problem is the stability of the entropy w.r.t. involved probabilities and the rate of its convergence to the limit value when probabilities tend to their limit value with the fixed rate.

Rate of convergence of the disturbed entropy when the initial distribution is arbitrary but fixed
Let's look at distributions that are "near" some fixed distribution p = (p k , 1 ≤ k ≤ N ) and construct the approximate distribution p( ) = (p k ( ), 1 ≤ k ≤ N ) as follows. Now we can assume that some probabilities are zero, and we shall see that this assumption influences the rate of convergence of the Rényi entropy to the limit value. So, let 0 ≤ N 1 < N be a number of zero probabilities, and for them we consider approximate values of the form p k ( ) = c k ε, 0 ≤ c k ≤ 1, 1 ≤ k ≤ N 1 . Further, let N 2 = N − N 1 ≥ 1 be a number of nonzero probabilities, and for them we consider approximate values of the form Assume also that there exists k ≤ N such that c k = 0, otherwise H α (p) − H α (p( )) = 0. So, we disturb intial probabilities linearly in ε with different weights whose sum should necessarily be zero. These assumptions supply that 0 ≤ p k ( ) ≤ 1 and p 1 ( ) + ... + p N ( ) = 1. Now we want to find out how entropy of the disturbed distribution will differ from the initial entropy, depending on parameters ε, N and α. We start with α = 1.
(ii) Let for all k ≤ N 1 c k = 0 and N k=N1+1 c k log p k = 0. Then (iii) Let for all k ≤ N 1 c k = 0 and N k=N1+1 c k log p k = 0. Then Proof. First of all, we will find asymptotic behavior of two auxiliary functions as ε → 0. First, let 0 ≤ c k ≤ 1. Then Second, let p k > 0, |c k | ≤ 1. Taking into account Taylor expansion of logarithm we can write: (3.1) In particular, we immediately get from (3.1) that Now simply observe the following.
(ii) Since for any k ≤ N 1 we have that c k = 0 and the total sum c 1 + ... + c N = 0 then c N1+1 + ... + c N = 0. Furthermore, in this case (iii) In this case we have the following relations: Theorem is proved. Now we proceed with α < 1.
Proof. Similarly to proof of Theorem 3.1, we start with several asymptotic relations as ε → 0. Namely, let p k > 0, |c k | ≤ 1. Taking into account Taylor expansion of (1 + x) α that has a form we can write: As a consequence, we get the following asymptotic relations: and (i) Applying L'Hospital's rule, we get: (ii) In this case we can transform the value under a limit as follows: (iii) Finally, in the 3rd case, Theorem is proved.
Now we conclude with α > 1. In this case, five different asymptotics are possible.
Then whatever N 1 ≥ 0 and c k for k ≤ N 1 are equal, we have that Then whatever N 1 ≥ 0 and c k for k ≤ N 1 are equal, we have that Proof. As in the proof of Theorem 3.2, we shall use expansions (3.2) and (3.3).
The main tool will be L'Hospital's rule.
Then whatever N 1 ≥ 0 and α > 1 are equal, we have the following relations: (ii) Let N k=N1+1 c k p α−1 k = 0, N 1 ≥ 1, and there exists k ≤ N 1 such that c k = 0. Then for α < 2 we have that (iii) Let N k=N1+1 c k p α−1 k = 0, N 1 ≥ 0 and for all k ≤ N 1 we have that c k = 0. Then for α < 2 it holds that (iv) Obviously, in the case α = 2 we have the simple value of the entropy: Therefore, if N k=N1+1 c k p α−1 k = 0, α = 2, then, whatever N 1 ≥ 0 and c k for k ≤ N 1 are equal, we have that Then whatever N 1 ≥ 0 and c k for k ≤ N 1 are equal, we have that Theorem is proved.
3.2 Convergence of the disturbed entropy when the initial distribution is uniform but the number of events increases to ∞ The second problem is to establish conditions of stability of the entropy of uniform distribution when the number of events tends to ∞. Let N > 1, p N (uni) = ( 1 N , . . . , 1 N ) be a vector of uniform N -dimensional distribution, ε = ε(N ) ≤ 1 N , and {c kN ; N ≥ 1, 1 ≤ k ≤ N } be a family of fixed numbers (not totally zero) such that |c kN | ≤ 1 and N k=1 c kN = 0. Note that for any N ≥ 1 there are strictly positive numbers c kN for some k and consider the disturbed distribution vector Proof. We know that N ε → 0, N → ∞ and the family of numbers {c kn ; n ≥ 1, 1 ≤ k ≤ n} is bounded. Therefore the values sup n≥1, 1≤k≤n (1 + N c kn ε) → 1, N → ∞ as the function of N , and for every N ≥ 1 sup n≥1, 1≤k≤n (1+N c kn ε) ≥ 1. Recall that function x log x is increasing in x ≥ 1 and x log x ≤ 0 for 0 < x < 1. Moreover, Renyi entropy is maximal on the uniform distribution. As a consequence of all these observations and assumptions we get that (1 + N c kn ε) log sup n≥1, 1≤k≤n (1 + N c kn ε) = sup n≥1, 1≤k≤n (1 + N c kn ε) log sup n≥1, 1≤k≤n (1 + N c kn ε) → 0, N → ∞.

Binomial and Poisson distribution
In this section we look at convergence of Rényi entropy of binomial distribution to Rényi entropy of Poisson distribution.  Proof. Indeed, The fact that H α (p) ≤ H 1 (p) ≤ H β (p), where 0 < β < 1 < α follows from Lemma 4.1.