Nonparametric Bayesian inference for multidimensional compound Poisson processes

Given a sample from a discretely observed multidimensional compound Poisson process, we study the problem of nonparametric estimation of its jump size density $r_0$ and intensity $\lambda_0$. We take a nonparametric Bayesian approach to the problem and determine posterior contraction rates in this context, which, under some assumptions, we argue to be optimal posterior contraction rates. In particular, our results imply the existence of Bayesian point estimates that converge to the true parameter pair $(r_0,\lambda_0)$ at these rates. To the best of our knowledge, construction of nonparametric density estimators for inference in the class of discretely observed multidimensional L\'{e}vy processes, and the study of their rates of convergence is a new contribution to the literature.


Introduction
Let N = (N t ) t≥0 be a Poisson process of constant intensity λ > 0 and let {Y j } be independent and identically distributed (henceforth abbreviated to i.i.d.) R dvalued random vectors defined on the same probability space and having a common distribution function R, which is assumed to be absolutely continuous with respect to the Lebesgue measure with density r.Assume N and {Y j } are independent and define an R d -valued process X = (X t ) t≥0 by The process X is called a compound Poisson process (henceforth abbreviated to CPP) and forms a basic stochastic model in a variety of applied fields, such as e.g.risk theory and queueing; see Embrechts et al. (1997) and Prabhu (1998).
Suppose that corresponding to the true parameter pair (λ 0 , r 0 ), a sample X ∆ , X 2∆ , . . ., X n∆ from X is available, where the sampling mesh ∆ > 0 is assumed to be fixed and thus independent of n.The problem we study in this note is nonparametric estimation of r 0 (and of λ 0 ).This is referred to as decompounding and is well-studied for one-dimensional CPPs, see Buchmann and Grübel (2003), Buchmann and Grübel (2004), Comte et al. (2014), Duval (2013) and van Es et al. (2007).Some practical situations in which this problem may arise are listed in Duval (2013), p. 3964.However, the methods used in the above papers do not seem to admit (with the exception of van Es et al. (2007)) a generalisation to the multidimensional setup.This is also true for papers studying non-parametric inference for more general classes of Lévy processes (of which CPPs form a particular class), such as e.g.Comte and Genon-Catalot (2010), Comte and Genon-Catalot (2011) and Neumann and Reiß (2009).In fact, there is a dearth of publications dealing with non-parametric inference for multi-dimensional Lévy processes.An exception is Bücher and Vetter (2013), where, the setup is however specific, in that it is geared to inference in Lévy copula models, and unlike the present work, the high frequency sampling scheme is assumed (∆ = ∆ n → 0 and n∆ n → ∞).
In this work we will establish the posterior contraction rate in a suitable metric around the true parameter pair (λ 0 , r 0 ).This concerns study of asymptotic frequentist properties of Bayesian procedures, which has lately received considerable attention in the literature, see e.g.Ghosal et al. (2000) and Ghosal and van der Vaart (2001), and is useful in that it provides their justification from the frequentist point of view.Our main result says that for a β-Hölder regular density r 0 , under some suitable additional assumptions on the model and the prior, the posterior contracts at the rate n −β/(2β+d) (log n) ℓ , which, perhaps up to a logarithmic factor, is arguably the optimal posterior contraction rate in our problem.Finally, our Bayesian procedure is adaptive: the construction of our prior does not require knowledge of the smoothness level β in order to achieve the posterior contraction rate given above.
The proof of our main theorem employs certain results from Ghosal and van der Vaart (2001) and Shen et al. (2013), but involves a substantial number of technicalities specifically characteristic of decompounding.
We remark that a practical implementation of the Bayesian approach to decompounding lies outside the scope of the present paper.Preliminary investigations and a small scale simulation study we performed show that it is feasible and under certain conditions leads to good results.However, the technical complications one has to deal with are quite formidable, and therefore the results of our study of implementational aspects of decompounding will be reported elsewhere.
The rest of the paper is organised as follows: in the next section we introduce some notation and recall a number of notions useful for our purposes.Section 3 contains our main result, Theorem 2, and a brief discussion on it.The proof of Theorem 2 is given in Section 4. Finally, Section 5 contains the proof of the key technical lemma used in our proofs.

Preliminaries
Assume without loss of generality that ∆ = 1, and let where {Y j } are i.i.d. with distribution function R 0 , while T, that is independent of {Y j }, has Poisson distribution with parameter λ 0 .The problem of decompounding the jump size density r 0 introduced in Section 1 is equivalent to estimation of r 0 from observations Z n = {Z 1 , Z 2 , . . ., Z n }, and we will henceforth concentrate on this alternative formulation.We will use the following notation: 2.1.Likelihood.We will first specify the dominating measure for Q λ,r , which allows one to write down the likelihood in our model.Define the random measure µ by Under R λ,r , the random measure µ is a Poisson point process on [0, 1] × (R d \ {0}) with intensity measure Λ(dt, dx) = λdtr(x)dx.Provided λ, λ > 0, and r > 0, by formula (46.1) on p. 262 in Skorohod (1964) we have The density k λ,r of Q λ,r with respect to Q λ, r is then given by the conditional expectation (2) where the subscript in the conditional expectation operator signifies the fact that it is evaluated under R λ, r ; see Theorem 2 on p. 245 in Skorohod (1964) and Corollary 2 on p. 246 there.Hence the likelihood (in the parameter pair (λ, r)) associated with the sample Z n is given by 2.2.Prior.We will use the product prior Π = Π 1 × Π 2 for (λ 0 , r 0 ).The prior Π 1 for λ 0 will be assumed to be supported on the interval [λ, λ] and to possess a density π 1 with respect to the Lebesgue measure.
The prior for r 0 will be specified as a Dirichlet process mixture of normal densities.Namely, introduce a convolution density where F is a distribution function on R d , Σ is a d × d positive definite real matrix and φ Σ denotes the density of the centred d-dimensional normal distribution with covariance matrix Σ.Let α be a finite measure on R d and let D α denote the Dirichlet process distribution with base measure α (see Ferguson (1973), or alternatively Ghosal (2010) for a modern overview).Recall that if F ∼ D α , then for any Borel-measurable partition B 1 , . . ., B k of R d the distribution of the vector (F (B 1 ), . . ., F (B k )) is the k-dimensional Dirichlet distribution with parameters α(B 1 ), . . ., α(B k ).The Dirichlet process location mixture of normals prior Π 2 is obtained as the law of the random function r F,Σ , where F ∼ D α and Σ ∼ G for some prior distribution function G on the set of d × d positive definite matrices.For additional information on Dirichlet process mixtures of normal densities see e.g. the original papers Ferguson (1983) and Lo (1984), or a recent paper Shen et al. (2013) and the references therein.

2.3.
Posterior.Let R denote the class of probability densities of the form (4).By Bayes' theorem, the posterior measure of any measurable set A ⊂ (0, ∞) × R is given by The priors Π 1 and Π 2 indirectly induce the prior Π = Π 1 × Π 2 on the collection of densities k λ,r .We will take a liberty to use the symbol Π to signify both the prior on (λ 0 , r 0 ) and the density k λ0,r0 .The posterior in the first case will be understood as the posterior for the pair (λ 0 , r 0 ), while in the second case as the posterior for the density k λ0,r0 .Thus, setting A = {k λ,r : (λ, r) ∈ A}, we have In the Bayesian paradigm the posterior encapsulates all the inferential conclusions for the problem at hand.Once the posterior is available, one can next proceed with computation of other quantities of interest in Bayesian statistics, such as Bayes point estimates or credible sets.
2.4.Distances.The Hellinger distance h(Q 0 , Q 1 ) between two probability laws Q 0 and Q 1 on a measurable space (Ω, F) is given by We also define the V-discrepancy by In addition to this, for positive real numbers x and y we put Using the same symbols K, V and h is justified as follows.Suppose Ω is a singleton {ω} and consider the Dirac measures δ x and δ y that put mass x and y respectively on Ω.Then K(δ x , δ y ) = K(x, y), and similar equalities are valid for the V-discrepancy and the Hellinger distance.
2.5.Class of locally β-Hölder functions.For any β ∈ R, ⌊β⌋ will denote the largest integer strictly smaller than β, while N 0 will stand for the union N {0} for N the set of natural numbers.For a multi-index k The usual Euclidean norm of a vector y ∈ R d will be denoted by y .Let β > 0, τ 0 ≥ 0 be constants and let L : R d → R + be a measurable function.We define the class C β,L,τ0 (R d ) of locally β-Hölder regular functions as the set of all those functions r : R d → R, such that all mixed partial derivatives D k r of r up to order k .≤ ⌊β⌋ exist and for every k with k .= ⌊β⌋ verify See p. 625 in Shen et al. (2013) for this class of functions.

Main result
Define the complements of the Hellinger-type neighbourhoods of (λ 0 , r 0 ) by where {ǫ n } is a sequence of positive numbers.We say that ε n is a posterior contraction rate, if there exists a constant M > 0, such that -probability.The ε-covering number of a subset B of a metric space equipped with the metric ρ is the minimum number of balls of radius ε needed to cover it.Let Q be a set of CPP laws Q λ,r .Furthermore, we set We recall the following general result on posterior contraction rates.
Theorem 1 (Ghosal and van der Vaart (2001)).Suppose that for positive sequences Then, for ε n = max(ε n , ε n ) and a large enough constant M > 0, we have that as n → ∞, in Q n λ0,r0 -probability, assuming the i.i.d.observations {Z j } have been generated according to Q λ0,r0 .
In order to derive posterior contraction rate in our problem, we impose the following conditions on the true parameter pair (λ 0 , r 0 ).Assumption 1. Denote the true parameter values for the compound Poisson process by (λ 0 , r 0 ). (i (ii) The true density r 0 is bounded, belongs to the set C β,L,τ0 (R d ) and additionally verifies for some ǫ > 0 and all k Furthermore, we assume that there exist strictly positive constants a, b, c and τ, such that Conditions on r 0 come from Theorem 1 in Shen et al. (2013) and are quite reasonable.They simplify greatly when r 0 has a compact support.
We also need to make some assumptions on the prior Π defined in Section 2.2.
Assumption 2. The prior Π = Π 1 × Π 2 on (λ 0 , r 0 ) satisfies the following assumptions: (i) The prior on λ, Π 1 , has a density π 1 (with respect to the Lebesgue measure) that is supported on the finite interval [λ, λ] ⊂ (0, ∞) and is such that for some constants π 1 and π 1 ; (ii) The base measure α of the Dirichlet process prior D α is finite and possesses a strictly positive density on R d , such that for all sufficiently large x > 0, and some strictly positive constants a 1 , b 1 and C 1 , where for all small enough x > 0 and for any 0 < s 1 ≤ • • • ≤ s d and t ∈ (0, 1), Here eig j (Σ −1 ) denotes the jth smallest eigenvalue of the matrix Σ −1 .
This assumption comes from Shen et al. (2013), see p. 626 there, to which we refer for an additional discussion.In particular, it is shown there that an inverse Wishart distribution (a popular prior distribution for covariance matrices) satisfies the assumptions on G with κ = 2.As far as α is concerned, we can take it such that its rescaled version α is a non-degenerate Gaussian distribution on R d .
Remark 1.The assumption (10) requiring that the prior density π 1 is bounded away from zero on the interval [λ, λ] can be relaxed to allowing it to take value zero at the end points of this interval, provided λ 0 is an interior point of [λ, λ].
We now state our main result.
Theorem 2. Let Assumptions 1 and 2 hold.Then there exists a constant M > 0, such that as n → ∞, We conclude this section with a brief discussion on the obtained result: the logarithmic factor (log n) ℓ is negligible for practical purposes.If κ = 1, the posterior contraction rate obtained in Theorem 2 is essentially n −2β/(2β+d) , which is the minimax estimation rate in a number of non-parametric settings.This is arguably also the minimax estimation rate in our problem as well (cf.Theorem 2.1 in Gugushvili (2008) for a related result in the one-dimensional setting), although here we do not give a formal argument.Equally important is the fact that our result is adaptive: the posterior contraction rate in Theorem 2 is attained without the knowledge of the smoothness level β being incorporated in the construction of our prior Π.Finally, Theorem 2 in combination with Theorem 2.5 and the arguments on pp.506-507 in Ghosal et al. (2000) imply existence of Bayesian point estimates achieving (in the frequentist sense) this convergence rate.
Remark 2. After completion of this work we learned about the paper Donnet et al. (2014), that deals with non-parametric Bayesian estimation of intensity functions for Aalen counting processes.Although CPPs are in some respects similar to the latter class of processes, they are not counting processes.An essential difference between our work and Donnet et al. (2014) lies in the fact that, unlike Donnet et al. (2014), ours deals with discretely observed multi-dimensional processes.Also Donnet et al. (2014) use the log-spline prior, or the Dirichlet mixture of uniform densities, and not the Dirichlet mixture of normal densities as the prior.

Proof of Theorem 2
The proof of Theorem 2 consists in verification of the conditions in Theorem 1.The following lemma plays the key role to that end.
Lemma 1.The following estimates are valid: Moreover, there exists a constant C ∈ (0, ∞) depending on λ and λ only, such that for all λ 0 , λ ∈ [λ, λ] it holds that The proof of the lemma is given in Section 5. We proceed with the proof of Theorem 2.
Let ε n = n −γ (log n) ℓ for γ and ℓ > ℓ 0 as in the statement of Theorem 2. Set ε n = 2Cε n , where C is the constant from Lemma 1.We define the sieves of densities F n as in Theorem 5 in Shen et al. (2013), where = n and a 1 and a 2 are as in Assumption 2. We also put ( 17) In Shen et al. (2013) sieves of the type F n are used to verify conditions of Theorem 1 and to determine posterior contraction rates in the standard density estimation context.Below we will show that these sieves also work in the case of decompounding, in that we will verify conditions of Theorem 1 for the sieves Q n defined in (17).

Verification of (6). Introduce the notation
Let {λ i } be the centres of the balls from a minimal covering of [λ, λ] with h 1intervals of size Cǫ n .Let {r j } be centres of the balls from a minimal covering of by appropriate choices of i and j.Hence, By Proposition 2 and Theorem 5 in Shen et al. (2013), there exists a constant c 1 > 0, such that for all n large enough log On the other hand, With our choice of ε n , for all n large enough it holds that We can simply rename the constant c 1 /(2C 2 ) in the above formula into c 1 and thus ( 6) is satisfied with that constant.
4.2.Verification of (7) and (8).We first focus on (8).Introduce ).From ( 14) we obtain Furthermore, using (15), Combination of the above inequalities with the definition of the set B(ε, Q λ0,r0 ) in (5) yields Shen et al. (2013) yields that for some A, C > 0, and all sufficiently large n, We substitute ε with √ An −γ (log n) ℓ0 and write ε n = √ 3ACn −γ (log n) ℓ0 to arrive at Now for all n large enough, as γ < 1 2 , Consequently, for all n large enough Choosing c 2 = C+1 3AC , we have verified (8) (with c 4 = 1).For the verification of (7) we use the constants c 2 and ε n as above.Note first that Π . By Theorem 5 in Shen et al. (2013), cf. also p. 627 there, for some c 3 > 0 and any constant c > 0, it holds that Without loss of generality, we can take the positive constant c in the above display bigger than 3AC(c 2 + 4) − 4.This gives us We have thus verified conditions ( 6)-( 8) and the statement of Theorem 2 follows by Theorem 1, since εn ≥ ε n (eventually).

Proof of Lemma 1
We start with a lemma from Csiszár (1963), which will be used three times in the proof of Lemma 1.Consider a probability space (Ω, F, P).Let P 0 be a probability measure on (Ω, F) and assume P 0 ≪ P with Radon-Nikodym derivative ζ = dP0 dP .Furthermore let G be a sub-σ-algebra of F. The restrictions of P and P 0 to G are denoted P ′ and P ′ 0 , respectively.Then P ′ 0 ≪ P ′ and The proof consists in an application of Jensen's inequality for conditional expectations.This lemma is typically used as follows.The measures P and P 0 are possible distributions of some random element X.If X ′ = T (X) is some measurable transformation of X, then we consider P ′ and P ′ 0 as the corresponding distributions of X ′ .Here T could be a projection.In the present context we take X = (X t , t ∈ [0, 1]) and X ′ = X 1 , and so P in the lemma should be taken as R = R λ,r and P ′ as Q = Q λ,r .
In the proof of Lemma 1 for economy of notation a constant c(λ, λ) depending on λ and λ only may differ from a line to a line.We also abbreviate Q λ0,r0 and Q λ,r to Q 0 and Q, respectively.The same convention will be used for R λ0,r0 , R λ,r , P r0 and P r .
Proof of inequalities (11) and (14).Application of Lemma 2 with g(x) = (x log x)1 {x≥0} gives K(Q 0 , Q) ≤ K(R 0 , R).Using (1) and the expression for the mean of a stochastic integral with respect to a Poisson point process, see e.g.property 6 on p. 68 in Skorohod (1964), we obtain that where c(λ, λ) is some constant depending on λ and λ.The result follows.
Proof of inequalities (12) and (15).We have with an obvious definition of I and II.Application of Lemma 2 with g(x) = (x log 2 (x))1 {x≥1} (which is a convex function) gives As far as II is concerned, for x ≥ 0 we have the inequalities The first inequality is trivial, the second is a special case of inequality (8.5) in Ghosal, Ghosh and Van der Vaart (2000), and is equally elementary.The two inequalities together yield Applying this inequality with x = − log dQ0 dQ (which is positive on the event { dQ0 dQ < 1}) and taking the expectation with respect to Q gives For the final inequality, see Pollard (2002), p. 62, formula (12).
Combining the estimates on I and II we obtain that After some long and tedious calculations employing (1) and the expressions for the mean and the variance of a stochastic integral with respect to a Poisson point process, see e.g.property 6 on p. 68 in Skorohod (1964)  for some constant c(λ, λ) depending on λ and λ.Combining the estimates ( 22) and ( 24) on III and IV with the inequalities ( 21) and ( 11) yields ( 12).Similarly, the upper bounds ( 23) and ( 25) combined with ( 21) and ( 11) yield (15).