A Quantitative Functional Central Limit Theorem for Shallow Neural Networks

We prove a Quantitative Functional Central Limit Theorem for one-hidden-layer neural networks with generic activation function. The rates of convergence that we establish depend heavily on the smoothness of the activation function, and they range from logarithmic in non-differentiable cases such as the Relu to $\sqrt{n}$ for very regular activations. Our main tools are functional versions of the Stein-Malliavin approach; in particular, we exploit heavily a quantitative functional central limit theorem which has been recently established by Bourguin and Campese (2020).


Introduction and background
In this paper, we shall be concerned with one-hidden-layer neural networks with Gaussian initialization, that is random fields F : S d−1 → R of the form where V j ∈ R, W j ∈ R 1×d are, respectively, random variables and vectors, whose entries are independent Gaussian with zero mean and variance E[V 2 j ] = E[W 2  jℓ ] = 1, j = 1, ..., n, ℓ = 1, ..., d.Here, σ : R →R is an activation function whose properties and form we will discuss below.
The random field F is defined on the unit sphere S d−1 , with zero mean and covariance function such that, for any pair The function {S(•, •)} is actually isotropic and we can write for some function Γ : [−1, 1] → R , where x 1 , x 2 denotes the standard Euclidean inner product.For the sake of brevity and simplicity, in this paper we restrict our attention to univariate neural networks; the extension to the multivariate case can be obtained along similar lines, up to a normalizing constant depending on the output dimension.Our aim in this paper is to establish a quantitative functional central limit theorem, that is, to study the distance with a suitable probability metric between the random field F and a Gaussian random field on the sphere S d−1 with mean zero and covariance function S : The distribution of neural networks under random initialization is a classical topic in learning theory, the first result going back to the groundbreaking work [19], where it was proved that a central limit theorem holds as the width of the network diverges to infinity.Much more recently, a few authors have investigated the speed of convergence to the Gaussian limit distribution.In this respect, some influential papers [12,23,24] have studied the behaviour of higher-order cumulants, covering also deep neural networks.Quantitative central limit theorems in suitable probability metrics have been considered very recently in [3] and [4].The former authors have proved a finite-dimensional quantitative central limit theorem for neural networks of finite depth whose activation functions satisfy a Lipschitz condition; in [4], the authors have proved second-order Poincaré inequalities (which imply one-dimensional quantitative central limit theorems) for neural networks with C 2 activation functions.
Understanding the Gaussian behaviour of a neural network allows, for instance, to investigate the geometry of its landscape, e.g. the cardinality of its minima, the number of nodal components and many other quantities of interest.However, convergence of the finite-dimensional distributions is in general not sufficient to constraint such landscapes.For this reason, functional results, that is, bounds on the speed of convergence in functional spaces, are also of great interest.So far, the literature on quantitative functional central limit theorems is still limited: [10] and [15] have focused on one-hidden-layer networks, where the random coefficients in the inner layer are Gaussian for [10] and uniform on the sphere for [15], whereas the coefficients in the outer layer follow a Rademacher distribution for both.In particular, the authors in [10] manage to establish rates of convergence in Wasserstein distance which are (power of) logarithmic for ReLu and other activation functions, and algebraic for polynomial or very smooth activations, see below for more details.On the other hand, the rates in [15] for ReLu networks are of the form O(n − 1 2d−1 ); this is algebraic for fixed values of d, but it can actually converge to zero more slowly than the inverse of a logarithm if d is of the same order as n, as it is the case for many applications.

Purpose and plan of the paper
We consider in this work functional quantitative central limit theorems under general activations and for coefficients that are Gaussian for both layers, which seems the most relevant case for applications; our approach is largely based upon very recent results by [6] on Stein-Malliavin techniques for random elements taking values in Hilbert spaces (we refer to [20,21] for the general foundations of this approach, together with [16,5,1,9] for some more recent references).Our main results are collected in Section 2, whereas their proof with a few technical lemmas are given in Section 4. A short comparison with the existing literature is provided in Section 3. The Appendix A is mainly devoted to background results which we heavily exploit throughout the paper.
Hermite expansion holds, in the L 2 sense with respect to Gaussian measure (see e.g.[21]): , where H q (x) := (−1) q e x 2 2 where {H q } q=0,1,2,..., is the well-known sequence of Hermite polynomials.The coefficients J q (σ), which will play a crucial role in our arguments below, are defined according to the following (normalized) projection: In the following, when no confusion is possible, we may drop the dependence of J on σ for ease of notation.We remark that our notation is to some extent non-standard, insofar we have introduced the factor 1 √ q! inside the projection coefficient E[σ(Z)H q (Z)]; equivalently, we are defining the projection coefficients in terms of Hermite polynomials which have been normalized to have unit variance.Indeed, as well-known, In short, our main results state that a quantitative functional central limit theorem for neural networks built on σ holds, and the rate of convergence depends on the rate of decay of {J q (σ)} , as q → ∞; roughly put, it is logarithmic when this rate is polynomial (e.g., the ReLu case), whereas convergence occurs at algebraic rates for some activation functions which are smoother, with exponential decay of the coefficients.A more detailed discussion of these results and comparisons with the existing literature are given below in Section 3.
Let us discuss an important point about normalization.In this paper, the measure on the sphere S d−1 is normalized to have unit volume.The bound we obtain are not invariant to this normalization, and indeed they would be much tighter if the measure on the sphere was taken as usual to be , the surface volume of S d−1 .Indeed, by Stirling's formula s d achieves its maximum for d = 7 (s 7 = 33.073)and decays faster than exponentially as d → ∞.This means that, without the normalization that we chose, our bound on the d 2 metric would be actually smaller by a factor of roughly d −d/2 when the dimension grows.On the other hand, if we were to take standard Lebesgue measure λ then we would obtain, by a standard application of Hermite expansions and the Diagram Formula so that the L 2 norm would decay very quickly as d increases, making the interpretation of results less transparent.Following [6], the convergence in our central limit theorem is measured in the d 2 metric.This is given by ) is the space of real-valued applications on L 2 (S d−1 ) (with respect to the uniform measure) with two bounded Frechet derivatives.It is to be noted that the d 2 metric is bounded by the Wasserstein distance of order 2, i.e.
, where the infimum is taken over all the possible couplings of (F, Z).
Our first main statement is as follows.
Theorem 1.Under the previous assumptions and notations, we have that, for all where C is an absolute constant (in particular, independent of the input dimension d), and σ is the L 2 norm of σ taken with respect to the Gaussian density on R.
The proof is postponed to Section 4.1.From Theorem 1, optimizing over the choice of Q, it is immediate to obtain much more explicit bounds.In the case of polynomial decay of the Hermite coefficients, the choice Q = log n/(3 log 3) yields the following result.
Corollary 2. In the same setting as in Theorem 1, for J q (σ) q −α , α > 1 2 , we have Example 3 (ReLu).As shown in Lemma 18, for the ReLu activation Once again, we stress that the constant is independent of the input dimension d.
The statement of Theorem 1 is given in order to cover the most general activation functions, allowing for possibly non-differentiable choices such as the ReLu.Under stronger conditions, the result can be improved; in particular, assuming the activation function has a Malliavin derivative with bounded fourth moment (i.e., it belongs to the class D 1,4 , see [21,6]), we obtain the following extension.
Theorem 4.Under the previous assumptions and notations, and assuming furthermore that σ(W x) ∈ D 1,4 , we have that, for all Q ∈ N, where C is an absolute constant (in particular, independend of the input dimension d), and σ is the L 2 norm of σ taken with respect to the Gaussian density on R.
We prove Theorem 4 in Section 4.4.Again, imposing specific decay profiles on the Hermite expansion we can obtain explicit bounds.In particular, when J q e −βq with β > log √ 3, the second sum appearing in (2) stays finite for all Q, hence the bound assumes the form more in line with the bound (1).In such a case, letting Q go to infinity leads to the next result.
Example 7 (tanh/logistic).Of course, other forms of decay could be considered.For instance, for the hyperbolic tangent σ(t) = (e t − e −t )/(e t + e −t ) the rate of decay of the Hermite coefficients is of order exp(−C √ q) (see e.g.[10]), hence the result of Corollary 5 does not apply; the bounds in Corollary 2 obviously hold, but applying directly Theorem 1 and some algebra we obtain the finer bound The same bound holds also for the sigmoid/logistic activation function σ(t) = (1 + e −t ) −1 .

Discussion
The ideas in our proof are quite standard in the literature on Quantitative Central Limit Theorems, and can be extended directly to the functional setting that we consider here.As a first step, we partition our neural network into two processes, one corresponding to its projection onto the first Q Wiener chaoses, for Q an integer to be chosen below, and the other corresponding to the remainder.This remainder can be easily bounded in Wasserstein distance by standard L 2 arguments; for the leading term, following recent results by [6], we need a careful analysis of fourth moments and covariances for the L 2 norms of the Wiener projections.In particular, these bounds can be expressed in terms of multiple integrals of fourth-order cumulants, as by now standard in the literature on the so-called Stein-Malliavin method (see [21,6] and the references therein).The computation of these terms is technical, but the results are rather explicit; they are collected in dedicated propositions and lemmas.
One technical point that we shall address is the following.The convergence results by [6] require the limiting process to be nondegenerate; this condition is not always satisfied for arbitrary activation functions if one takes the corresponding Hilbert space to be L 2 (S d−1 ) (counter-examples being finite-order polynomials).However, we note that for activations for which the corresponding networks are dense in the space of continuous functions (such as the ReLu or the sigmoid and basically all non-polynomials, see for instance the classical universal approximation theorems in [7,13,14,17,22]), then the nondegeneracy condition is automatically satisfied.On the other hand, when the condition fails our results continue to hold, but the underlying functional space must be taken to be the reproducing kernel Hilbert space generated by the covariance operator, which is strictly included into L 2 (S d−1 ) when universal approximation fails (e.g., in the polynomial case).
Eldan et al. [10] Klukowski [15] This paper dimension and a log log factor in the number of neurons for ReLu and tanh networks; for smooth activations, the rate goes from n −1/6 to n −1/2 , and the constants lose the polynomial dependence on the dimension.The rate in [15] in the polynomial case is n −1/2 as ours, but with a factor growing in the input dimension d as d d/2 .In the ReLu setting, [15] displays the algebraic rate n − 3 4d−2 , which for fixed values of d decays faster than our logarithmic bound.However, interpreting these bounds from a "fixed d, growing n" perspective can be incomplete: when considering distances in probability metrics it is of interest to allow both d and n to vary.In particular, for neural networks applications, it is often the case that the input dimension and number of neurons are of comparable order; taking for instance d = d n ∼ n α , it is immediate to verify that for all α > 0 (no matter how small) one has lim n→∞ so that our bound in the d 2 metric decays faster that the one by [15] in W 2 under these circumstances.

Proof of the main results
Our main results, Theorems 1 and 4, are proved in Sections 4.1 and 4.4, respectively.The proofs use auxiliary propositions and lemmas, which are established in Sections 4.2 and 4.3.

Proof of Theorem 1
The main idea behind our proof is as follows.For some integer Q to be fixed later, write where and In words, as anticipated in the Section 2.1, we are partitioning our network into a component projected onto the Q lowest Wiener chaoses and the remainder projection on the highest chaoses.Now let us denote by Z a zero mean Gaussian process with covariance function Likewise, in the sequel we shall write {Z q } q∈N for a sequence of independent zero mean Gaussian with covariance function E [Z q (x 1 )Z q (x 2 )] := J 2 q x 1 , x 2 q .Our idea is to use Theorem 3.10 in [6] and hence to consider where Now, we have Moreover, Indeed, first note that the covariance operator can be written explicitly in coordinates as and hence Therefore, taking the standard basis of spherical harmonics {Y ℓm }, which are eigenfunctions of the covariance operators (see [18]), where n ℓ;d is the dimension of the ℓ-th eigenspace in dimension d and {C ℓ (q)} is the angular power spectrum of F q , see again [18] for more discussion and details (the discussion in this reference is restricted to d = 2, but the results can be extended to any dimension).We are left to bound M (F ≤Q ) and C(F ≤Q ).In Section 4.2 we will provide a bound for M (F ≤Q ).Under the conditon Q ≤ log 3 √ n, such bound reduces to On the other hand, in Section 4.3 we will show that This completes the proof.

Bounding M(F ≤Q )
The following proposition provides a bound on M (F ≤Q ).The proof relies on several technical lemmas, which are given below.
Proposition 8. We have Proof.We have In Lemma 9 we compute By Lemma 10 we get the bound max 0≤q1≤q−1 Υ q1,q (q!) 2 3 2q q , whereas Lemma 11 yields Therefore, Moreover, in view of Lemma 13, we have Collecting all the terms, we finally obtain the claim In the following, we collect the technical lemmas used in the proof of Proposition 8.
Lemma 9. We have Proof.We will write Cum(•, •, •, •) for the joint cumulant of four random variables, that is, We have Now note that, in view of the normalization we adopted for the volume of S d−1 , Moreover, Hence, Using the diagram formula for Hermite polynomials [18, Proposition 4.15] and then isotropy, for q 1 + q 2 + q 3 + q 4 = 2q we have where Υ q1q2q3q4 , Υ q1,q count the possible configurations of the diagrams.Precisely, Υ q1,q is the number of connected diagrams with no flat edges between four rows of q nodes each and q 1 < q connections between first and second row.
To compute this number explicitly, let us label the nodes of the diagram as Because there cannot be flat edges, the number of edges between x 1 and x ′ 1 is the same as the number of edges between x 2 and x ′ 2 .Indeed, assume that the former was larger than the latter; then there would be less edges starting from the pair (x 1 , x ′ 1 ) and reaching the pair (x 2 , x ′ 2 ) than the other way round, which is obviously absurd.There are q q1 ways to choose the nodes of the first row connected with the second, q q1 ways to choose the nodes of the second connected with the first, q q1 ways to choose the nodes of the third connected with the fourth, and q q1 ways to choose the nodes of the fourth connected with the third, which gives a term of cardinality q q1 4 ; the number of ways way to match the nodes between first and second row or third and fourth is (q 1 !) 2 .There are now (2q − 2q 1 ) nodes left in the first two rows, which can be matched in any arbitrary way with the (2q − 2q 1 ) remaining nodes of the third and the fourth row; the result follows immediately.
The result is proved by choosing ε such that We recall the standard definition of the Beta function B(α, β): Lemma 11.We have Proof.Fixing a pole and switching to spherical coordinates, we get which is smaller than 1 for all d, q.
Remark 12.The bound we obtain is actually uniform over d.It is likely that it could be further improved for growing numbers of d, because the Beta function decreases quickly as d diverges.
Lemma 13.We have Proof.It suffices to observe that, following the calculations of Lemma 9,

Bounding C(F ≤Q )
The following results reduces the problem of bounding C(F ≤Q ) to that of bounding M (F ≤Q ).
Proposition 14.We have Proof.We shall show that Indeed, , where By the orthogonality of the Hermite polynomials, the third term vanishes and we are left with Indeed, Moreover,

and hence
so that our previous bound on the fourth cumulant is sufficient, up to a factor it can then be readily checked that, for K = L 2 (S d−1 ) and r < (p 1 + 1 ∧ p 2 + 1) To complete the proof, it is then sufficient to exploit [6, Theorem 4.3] and to follow similar steps as in the proof of Theorem 1.

A Appendix
A.1 The quantitative functional central limit theorem by Bourguin and Campese (2020) In this paper, the probabilistic distance for the distance between the random fields we consider is the so-called d 2 -metric, which is given by here, C 2 b (K) denotes the space of continuous and bounded applications from the Hilbert space K into R endowed with two bounded Frechet derivatives h ′ , h ′′ ; that is, for each h ∈ C 2 b (K) there exist a bounded linear operator h and similarly for the second derivative.
We will use a simplified version of the results by Bourguin and Campese in [6], which we report below.Theorem 15. (A special case of Theorem 3.10 in [6]) Let F ≤Q ∈ L 2 (Ω, K) be a Hilbert-valued random element F ≤Q : Ω → K be a process with zero mean, covariance operator S ≤Q and such that it can be decomposed into a finite number of Wiener chaoses: Then, for Z a Gaussian process on the same structure with covariance operator S we have that where M (F ≤Q ) = 1 √ 3 p,q c p,q E F p 4 (E F q 4 − E Z q 4 ) C(F ≤Q ) = p,q p =q c p,q Cov( F p 2 , F q 2 ) , Z q a centred Gaussian process with the same covariance operator as F q (E [Z q (x 1 )Z q (x 2 )] = J 2 q (σ) x 1 , x 2 q ) and c p,q = 1 + √ 3 p = q p+q 2p p = q .
Remark 16.The general version of Theorem 3.10 in [6] covers a broader class of processes which can be expanded into the eigenfunctions of Markov operators.
We do not need this extra generality, and we refer to [6] for more discussion and details.
We will now review another result by [6], which holds under tighter smoothness conditions.We shall omit a number of details, for which we refer to classical references such as [21].
Given a Hilbert space H we recall the isonormal process is the collection of zero mean Gaussian random variables with covariance function E [X(h 1 )X(h 2 )] = h 1 , h 2 H .
In our case these random variables take values in the separable Hilbert space L 2 (Ω, S d−1 ).For smooth functions F : Ω → L 2 (Ω, S d−1 ) of the form In this setting, the Wiener chaos decompositions take the form where H ⊙p denotes the p-fold symmetrized tensor product of H, see [6], Subsection 4.1.2.The main result we are going to exploit is their Theorem 4.3, which we can recall as follows.
Remark 19.The corresponding covariance kernel is given by, for any for u = x 1 , x 2 , see also [2].
Remark 20.The rate for J q in Lemma 18 is consistent with the one obtained by [15].In [10], J 2 q = O(q −3 ) is given instead, yielding in [10,Theorem 3]  , which is the one we actually report in Table 1.

Table 1 :
Comparison of convergence rates established by different functional quantitative central limit theorems for several activation functions.Bear in mind that two different metrics d 2 ≤ W 2 are considered, W 2 for [10, 15], and d 2 for this paper.The parameters α and β must satisfy α > 1/2 and β > log √ 3.
the rate log d × log log n log n .