Large deviations for i.i.d. replications of the total progeny of a Galton--Watson process

The Galton--Watson process is the simplest example of a branching process. The relationship between the offspring distribution, and, when the extinction occurs almost surely, the distribution of the total progeny is well known. In this paper, we illustrate the relationship between these two distributions when we consider the large deviation rate function (provided by Cram\'{e}r's theorem) for empirical means of i.i.d. random variables. We also consider the case with a random initial population. In the final part, we present large deviation results for sequences of estimators of the offspring mean based on i.i.d. replications of total progeny.


Introduction
There is a vast literature on branching processes. Here we cite the monographs [1,3,12]; moreover, we also cite the monographs [18] for the multitype case, [10], which focuses on statistical inference, and [13] and [15] for applications in biology.
The simplest example of a branching process is the Galton-Watson process. We consider the case of a population that has a unique individual at the beginning and all the individuals (of all generations) live for a unitary time; moreover, at the end of their lifetimes, every individual of the population (of every generation) produces a random number of new individuals acting independently of all the rest, according to a specific fixed distribution. So, if we consider a sequence of random variables {V n : n ≥ 0} such that V n is the population size at time n (for all n ≥ 0), we have V 0 = 1 and V n := Vn−1 k=1 X n,k (for n ≥ 1), where {X n,i : n, i ≥ 1} is a family of nonnegative integer-valued i.i.d. random variables. In other words, X n,1 , . . . , X n,Vn−1 represent the offspring generated at time n by each of V n−1 individuals that live at time n − 1. We recall some other preliminaries on the Galton-Watson process in Section 2, where, in particular, we consider a slightly different notation to allow the case with a random initial population (instead of the case with a unitary initial population cited before).
In this paper, we present large deviation results. The theory of large deviations is a collection of techniques that gives an asymptotic estimate of small probabilities in an exponential scale (see, e.g., [6] as a reference). We recall some preliminaries in Section 2. The literature on large deviations for branching processes is large. Here we essentially recall some references with results concerning the Galton-Watson process.
In several references, the large-time behavior for the supercritical case is studied, namely the case where the offspring mean µ is strictly larger than one (in such a case, the extinction probability is strictly less than one). Here we recall [2] (see also [4] for the multitype case), [5], where the main object is the study of the tails of W := lim n→∞ V n /µ n , [19] with a careful analysis based on harmonic moments of {V n : n ≥ 0}, [20] (and [21]) with some conditional large deviation results based on some local limit theorems, [8] where the central role of some "lower deviation probabilities" is highlighted for the study of the asymptotic behavior of the Lotka-Nagaev estimator V n+1 /V n of µ.
Other references study the most likely paths to extinction at some time n 0 when the initial population k is large. The idea is to consider the representation of a branching process with initial population equal to k as a sum of k i.i.d. replications of the process with a unitary initial population; in this case, Cramér's theorem for empirical means of i.i.d. random variables (on R n0 ) plays a crucial role. A most likely path to extinction in [16] (see also [17]) is a trajectory that minimizes the rate function among the paths that reach the level 0 at time n 0 . A generalization of this concept for the most likely paths to reach a level b ≥ 0 can be found in [11].
In this paper, we are interested in a different direction. Namely, we are interested in the empirical means of i.i.d. replications of the total progeny of a Galton-Watson process. The total progenies of branching processes are studied in several references: here we cite the old references [7,14,22] for a Galton-Watson process, and [9] (see Section 2.2) among the references concerning different branching processes. The total progeny of a Galton-Watson process is an almost surely finite random variable when the extinction occurs almost surely, and therefore the supercritical case will not be considered. Some relationships between the offspring distribution and the total progeny distribution of a Galton-Watson process are well known (see (3) for the probability mass functions and (4) for the probability generating functions).
A new relationship is provided by Proposition 1, where we illustrate how the rate function for the empirical means of total progenies can be expressed in terms of the analogous rate function for the empirical means of a single progeny. This is a quite natural problem to investigate large deviations, and, as we can expect, (4) has an important role in the proof; in fact, the large deviation rate function for empirical means of i.i.d. random variables (provided by Cramér's theorem recalled below; see Theorem 1) is given by the Legendre transform of the logarithm of the (common) moment generating function of the random variables. Moreover, the relationship provided by Proposition 1 can have interest in information theory because the involved rate functions can be expressed in terms of suitable relative entropies (or Kullback-Leibler divergences); see, for example, [23] for a discussion on the rate function expressions in terms of the relative entropy.
Another result presented in this paper is Proposition 2, that is a version of Proposition 1, where the initial population V 0 is a random variable with a suitable distribution. Finally, in Propositions 3 and 4, we prove large deviation results for some estimators of the offspring mean µ in terms of i.i.d. replications of the total progeny and of the initial population (we are considering the case where the initial population V 0 is a random variable as in Proposition 2).
We conclude with the outline of the paper. We start with some preliminaries in Section 2. In Section 3, we prove the results concerning the large deviation rate functions related to Cramér's theorem. Finally, in Section 4, we prove the large deviation results for the estimators of the offspring mean µ.

Preliminaries
We start with some preliminaries on the Galton-Watson process. In the second part, we recall some preliminaries on large deviations.

Preliminaries on Galton-Watson process
Here we introduce a slightly different notation, and, moreover, we recall some preliminaries in order to define the total progeny of a Galton-Watson process. We start with some notation concerning the offspring distribution (note that µ f defined further coincides with µ in the Introduction): • the probability mass function p h := P (X n,i = h) (for all integer h ≥ 0); • the probability generating function f (s) := h≥0 s h p h ; • the mean value µ f := h≥0 hp h (and we have µ f = f ′ (1)).
So, from now on, we consider the following slightly different notation: (in place of {V n : n ≥ 0} presented before). More precisely: • the probability generating function of V f,g does not depend on f ), and therefore • for a family of i.i.d. random variables {X n,i : n, i ≥ 1} with probability generating function f , we have

Remark 1.
Note that {V f,g n : n ≥ 0} here corresponds to {V n : n ≥ 0} presented in the Introduction if q 1 = 1 or, equivalently, if g = id (i.e. g(s) = s for all s).
If we consider the extinction probability then it is known that we have and, if q 0 < 1 (we obviously have p f,g ext = 1 if q 0 = 1), then we have the following cases: Then, if p 0 > 0 and µ f ≤ 1, then the random variable Y f,g defined by is almost surely finite and provides the total progeny of {V f,g n : n ≥ 0}. In view of what follows, we consider the probability generating function where {π f,g k : k ≥ 0} is the probability mass function of the random variable Y f,g . Moreover, we have the mean value ν f,g := k≥0 kπ f,g k , and we have ν f,g = in particular, Finally, we recall some well-known connections between total progeny and offspring distributions (see e.g. [7]): for the probability mass functions, we have where {p * n h : h ≥ 0} is the nth power of convolution of {p h : h ≥ 0}; for the probability generating functions, we have

Preliminaries on large deviations
We start with the concept of large deviation principle (LDP). A sequence of random variables {W n : n ≥ 1} taking values in a topological space W satisfies the LDP with rate function I : We also recall that a rate function I is said to be good if all its level sets {{w ∈ W : Remark 2. If P (W n ∈ S) = 1 for some closed set S (at least eventually with respect to n), then I(w) = ∞ for w / ∈ S; this can be checked by taking the lower bound for the open set O = S c .
In particular, we refer to Cramér's theorem on R d (see e.g. Theorems 2.2.3 and 2.2.30 in [6] for the cases d = 1 and d ≥ 2), and we recall its statement. We remark that, in this paper, we consider the cases d = 1 (in such a case, the rate function need not to be a good rate function) and d = 2. Moreover, we use the symbol ·, · for the inner product in R d .
Thus, we can set α = log G f,id (e β ) (for β ∈ D(G f,id )) in the expression of I f (x), and we get Then (we take into account (4) in the second equality below) and, for x ∈ [0, 1), we get We conclude by taking x = y−1 y for y ≥ 1 (thus, x ∈ [0, 1)), and we obtain the desired equality with some easy computations. Now we present Proposition 2, which concerns the LDP for the empirical means of i.i.d. bivariate random variables {(Y n , Z n ) : n ≥ 1} distributed as (Y f,g , V f,g 0 ). In particular, we obtain an expression for the rate function I G f,g ,g in terms of I f in Lemma 1 and I g defined by I g (z) := sup γ∈R γz − log g e γ . (5) ] is finite in a neighborhood of (β, γ) = (0, 0). Let {(Ȳ n ,Z n ) : n ≥ 1} be the sequence of empirical means defined by (Ȳ n ,Z n ) := ( 1 n n k=1 Y k , 1 n n k=1 Z k ) (for all n ≥ 1). Then {(Ȳ n ,Z n ) : n ≥ 1} satisfies the LDP with good rate function I G f,g ,g defined by Remark 3. We are assuming (implicitly) that p 0 > 0 and µ f ≤ 1; in fact, since we require that E[e βY f,g +γV f,g 0 ] is finite in a neighborhood of (β, γ) = (0, 0), we are assuming that µ f < 1 and µ g < ∞.
Proof. The LDP is a consequence of Cramér's theorem (Theorem 1) with d = 2, and the rate function I G f,g ,g is defined by Throughout the proof, we restrict our attention on the pairs (y, z) such that y ≥ z ≥ 0. In fact, almost surely, we have Y f,g ≥ V f,g 0 ≥ 0, and thereforeȲ n ≥Z n ≥ 0; thus, by Remark 2 we have I G f,g ,g (y, z) = ∞ if condition y ≥ z ≥ 0 fails.
We remark that E[s Y f,g |V f,g 0 ] = (G f,id (s)) V f,g 0 , and therefore thus, Furthermore, the function as in the proof of Proposition 1; then, for δ := γ + log G f,id (e β ), we obtain Thus, we have (note that the last equality holds by Proposition 1) We conclude by showing the inverse inequality To this end, we take two sequences {β n : n ≥ 1} and {δ n : n ≥ 1} such that Then we have I G f,g ,g (y, z) ≥ β n y + δ n − log G f,id e βn z − log g e δn , and we get (6) letting n go to infinity.

Large deviations for estimators of µ f
In this section, we prove two LDPs for two sequences of estimators of the offspring mean µ f . Namely, if {(Ȳ n ,Z n ) : n ≥ 1} is the sequence in Proposition 2 (see also the precise assumptions in Remark 3; in particular, we have µ f < 1), then we consider: Obviously, these estimators are well defined if the denominatorsȲ n are different from zero; then, in order to have well-defined estimators, we always assume that q 0 = 0 (where q 0 is as in (1)), and, noting that, in general, I g (0) = − log q 0 , we have Moreover, both sequences converge to (2)), and they coincide when the initial population is deterministic (equal to µ g almost surely).
The LDPs of these two sequences are proved in Propositions 3 and 4. Moreover, Corollary 1 and Remark 4 concern the comparison between the convergence of the first sequence {Ȳ n −Zn Yn : n ≥ 1} and its analogue when the initial population is deterministic (equal to the mean). Propositions 3 and 4 are proved by combining the contraction principle (see e.g. Theorem 4.2.1 in [6]) and Proposition 2 (note that the rate function I G f,g ,g in Proposition 2 is good, as it is required to apply the contraction principle). We remark that, in the proofs of Propositions 3 and 4, we take into account that I G f,g ,g (0, 0) = ∞ by Proposition 2 and I g (0) = ∞. At the end of this section, we present some remarks on the comparison between the rate functions in Propositions 3 and 4 (Remarks 5 and 6).
We start with the LDP of the first sequence of estimators. : n ≥ 1} satisfies the LDP with good rate function J G f,g ,g defined by Proof. By Proposition 2 and the contraction principle we have the LDP of {Ȳ n −Zn Yn : n ≥ 1} with good rate function J G f,g ,g defined by The case x / ∈ [0, 1) is trivial because we have the infimum over the empty set. For x ∈ [0, 1), we rewrite this expression as follows (where we take into account the expression of the rate function I G f,g ,g in Proposition 2): by taking into account the definition of I g in (5) and the well-known properties of Legendre transforms (see e.g. Lemma 4.5.8 in [6]; see also Lemma 2.2.5(a) and Exercise 2.2.22 in [6] for the convexity and the lower semicontinuity of γ → log g(e γ )).
We have an immediate consequence of this proposition that concerns the case with a deterministic initial population equal to µ g (almost surely). Namely, if we consider the probability generating function g ⋄ defined by g ⋄ (s) := s µg (for all s), then we mean the case g = g ⋄ , and therefore: • V f,g⋄ 0 = µ g almost surely; thus, Z n = µ g andZ n = µ g almost surely (for all n ≥ 1); • the rate function J G f,g⋄ ,g⋄ is by Proposition 3.
Corollary 1 (Comparison between J G f,g ,g in Proposition 3 and J G f,g⋄ ,g⋄ ). We have for all x ∈ R. Moreover the inequality turns into an equality if and only if we have one of the following cases: • x / ∈ [0, 1) and J G f,g ,g (x) = J G f,g⋄ ,g⋄ (x) = ∞; • x = µ f and J G f,g ,g (x) = J G f,g⋄ ,g⋄ (x) = 0; • V f,g 0 is deterministic, equal to µ g , and J G f,g ,g (x) = J G f,g⋄ ,g⋄ (x) for all x ∈ R.
Proof. The case x / ∈ [0, 1) is trivial. On the contrary, if x ∈ [0, 1), then by Jensen's inequality we have − log g e − I f (x) moreover, the cases where the inequality turns into an equality follow from the wellknown properties of Jensen's inequality.
Remark 4 (Comparison between convergence of estimators of µ f ). Assume that µ f > 0 and the initial population is not deterministic. Then there exists η > 0 such that Thus, we can say that {Ȳ f,g⋄ n −µḡ Y f,g⋄ n : n ≥ 1} converges to µ f (as n → ∞) faster than {Ȳ f,g n −Zn Y f,g n : n ≥ 1}; in fact, we can find ε > 0 such that We can repeat the same argument to say that {Ȳ f,g⋄ n −µḡ Y f,g⋄ n : n ≥ 1} converges to µ f (as n → ∞) faster than {X n : n ≥ 1} in Lemma 1. In fact, we have V f,g⋄ 0 = µ g almost surely, µ g is an integer, and, since µ g > 0 because q 0 = 0, we have µ g ≥ 1; then we have .
Now we present the LDP for the second sequence of estimators. : n ≥ 1} satisfies the LDP with good rate function J µg defined by Proof. By Proposition 2 and the contraction principle we have the LDP of {Ȳ n −µḡ Yn : n ≥ 1} with good rate function J µg defined by The case x ≥ 1 is trivial because we have the infimum over the empty set (we recall that µ g > 0 because q 0 = 0). For x < 1, we have J µg (x) = inf I G f,g ,g µ g 1 − x , z : z > 0 , and we obtain the desired formula by taking into account the expression of the rate function I G f,g ,g in Proposition 2.
Remark 5 (We can have J µg (x) < ∞ for some x < 0). We know that, for J G f,g ,g in Proposition 3, we have J G f,g ,g (x) = ∞ for x / ∈ [0, 1). On the contrary, as we see, we could have J µg (x) < ∞ for some x < 0. In order to explain this fact, we denote the minimum value r such that q r > 0 by r min ; then we have µ g ≥ r min ; moreover, we have µ g > r min if q rmin < 1. In conclusion, we can say that if µ g > r min , then the range of negative values x such that J µg (x) < ∞ is in fact, for x < 1, both I f ( ) and I g (z) are finite for z ∈ [r min , µg 1−x ], and therefore we can say that J µg (x) < ∞ if r min ≤ µg 1−x or, equivalently, if (9) holds. Remark 6 (Estimators of µ f when µ f = 0). If µ f = 0, that is, f (s) = 1 for all s or, equivalently, p 0 = 1, then the rate function in Proposition 3 is Then it is easy to check that J G f,g ,g coincides with I f , and therefore J G f,g ,g coincides with J G f,g⋄ ,g⋄ in (7) (note that, in particular, we cannot have the strict inequalities in (8) in Remark 4 stated for the case µ f > 0). Finally, if µ f = 0 (and as usual q 0 = 0 or, equivalently, µ g > 0), then we have z = µg 1−x in the variational formula of the rate function in Proposition 4, and therefore Note the rate function in (10) can also be derived by combining the contraction principle and the rate function I g for the empirical means {Z n : n ≥ 1}; in fact, we have {Ȳ n −µḡ

Zn
: n ≥ 1}, and the rate function I g is good by the hypotheses of Proposition 4 (see Proposition 2 and Remark 3). Finally, we also note that inequality (9) appears in the rate function expression (10).