Asymptotic normality of modiﬁed LS estimator for mixture of nonlinear regressions

We consider a mixture with varying concentrations in which each component is described by a nonlinear regression model. A modiﬁed least squares estimator is used to estimate the regressions parameters. Asymptotic normality of the derived estimators is demonstrated. This result is applied to conﬁdence sets construction. Performance of the conﬁdence sets is assessed by simulations.


Introduction
Nonlinear regression models are widely used in analysis of statistical data [14,16]. In many applications the observed data are derived from a mixture of components with different dependencies between the variables in different components. In this case a finite mixture model can be used to describe the data [15,19,2]. If the concentrations of components in the mixture are different for different observations then the model of mixture with varying concentrations (MVC) can be applied [1,12,11]. Parametric models of nonlinear regression mixtures were considered in [5,4]. Estimation in linear regression MVC models was studied also in [6,8].
In this paper we adopt a semiparametric approach with the use of modified least squares (mLS) technique. The consistency of mLS estimators in regression MVC models was demonstrated in [9]. Our aim is to derive conditions of mLS estimators asymptotic normality and construct confidence sets for the true values of parameters.
The rest of the paper is organized as follows. In Section 2 we introduce the regression mixture model and the mLS estimator for the regression parameters. Asymptotic behavior of the estimator is discussed in Section 3. Confidence ellipsoids for the parameters are constructed in Section 4. Results of simulations are presented in Section 5. Conclusive remarks are made in Section 6.

The model and estimator
In this paper we consider regression technique application to data, which are described by the model of mixture with varying concentrations. It means that each observed subject belongs to one of M different sub-populations (i.e. components of the mixture). We observe n such subjects O 1 , . . . , O n . The true number of component which O j belongs to will be denoted by κ j . These numbers are not observed, but one knows the probabilities p k j ;n = P{κ j = k}. These probabilities are called the mixing probabilities or concentrations of the components at j -th observation.
For each subject O j , one observes a set of numerical variables ξ j ;n = ξ j = (Y j , X 1 j , . . . , X m j ), where Y is the response and X j = (X 1 j , . . . , X m j ) T is the vector of independent variables in the regression model where g is a known regression function, ϑ = (ϑ 1 , . . . , ϑ d ) T is a vector of unknown regression coefficients, ε is an unobservable regression error. In fact, the coefficients of the model can be different for different components: ϑ = ϑ (k) if κ j = k. The distribution of ε can also depend on κ j . These dependencies are described in the following model Here ϑ (k) ∈ (k) ⊆ R d is the vector of unknown regression coefficients corresponding to the k-th mixture component, ε j , j = 1, . . . , n, k = 1, . . . , M are independent random variables with distribution dependent on k but not on j .
We will assume that (The values of σ 2(k) are unknown.) The independent variables vectors X j are considered as random vectors with distribution possibly dependent on κ j . It is assumed that the error term ε (κ j ) j and X j are conditionally independent for given κ j . The vectors (ξ j , κ j ) are independent for different j .
In what follows we will frequently use expectations and probabilities connected with different mixture components. To present them in a compact form, we introduce formal random vectors (Y (m) , X (m) , ε (m) ) which have conditional distribution of (Y j , X j , ε (κ j ) j ) given κ j = m, i.e., the distribution of the m-th component. We will also denote by p ;n the matrix of all concentrations for all observations and all components: . . , p m n;n ) T . Similar notation is used for the weights matrix a ;n introduced below. We are interested in estimating the parameters ϑ (k) for different components. The considered estimators are based on the modified least squares (mLS) approach. Namely, we consider the weighted least squares functional where t ∈ (k) is a formal parameter, a k j ;n are some weights aimed to single out the k-th mixture component and suppress influence of all other components on the functional J (k) . In this presentation, we restrict ourselves by the minimax weights matrix defined as a ;n = −1 ;n p ;n , where ;n = p T ;n p ;n . (It is assumed here that ;n is nonsingular. See [11,12] for the minimax properties of these weights.) It is readily seen that (a k ;n ) T p m ;n = 1{m = k}.
(Here 1{A} is the indicator function of an event A.) So The minimum ofJ (t) is attained at t = ϑ (k) . Thus, if ϑ (k) is the unique minimum point, one expects that under suitable conditions J (k) (t) →J (t) by the law of large numbers, and argmin t∈ (k) J (t) → ϑ (k) as n → ∞. If g is smooth enough, the argmin can be found as a solution toJ In what follows we define the mLS estimatorθ (k) ;n for ϑ (k) as a statistic which is a solution toJ whereġ If there are many solutions to (4) thenθ ;n can be taken any of them, but it must be a measurable function from the observed data (Y j , X j ), j = 1, . . . , n.
Note that so defined mLS estimator can be a point of local minimum of J (k) (t). But we still call it mLS since in was obtained by a modification of the LS technique.

Asymptotic behavior of mLS estimators
In this section, we consider asymptotic behavior ofθ (k) ;n as the sample size n tends to infinity. Let us start with some general assumptions on the model.
In this paper we make no assumptions on connections between p ;n and p ;m , when n = m and don't assume that they tend to some limit as n → ∞. Some assumptions are made only on asymptotic behavior of some averaged characteristics of concentrations.
Note that if a significant fraction of p k j ;n is bounded away from zero, then entries of the matrix ;n = p T ;n p ;n are of order n as n → ∞. In what follows we will assume that the limit matrix lim n→∞ 1 n ;n = exists and is nonsingular. Then the weights a k j ;n are of order 1/n as n → ∞ and the sums of the form n j =1 a k j ;n a m j ;n p l j ;n p i j ;n are of order 1/n as well. We will assume that the limits Note that (8) is an unbiased generalized estimating equation (GEE, see [17], Section 5.4).

Conditions of theθ (k)
;n consistency are presented in the next statement. Theorem 1. Assume the following.
Proof. See [9]. Now consider the asymptotic normality ofθ (k) ;n . We will start with a result formulated in more general terms of GEE estimation.
Assume that ;n = (ξ j ;n , j = 1, . . . , n) are random observations in a measurable space X described by the model of mixture with varying concentrations (MVC), i.e., where (I.e., h is an unbiased estimating function.) Any statisticθ ;n is called a GEE-estimator for ϑ (k) if it is an a.s. solution to (8), i.e., ;n ) = 0 a.s.
Proof. The proof of the theorem is quite standard. Applying the Taylor expansion to the LHS of (10) one obtains √ n(θ (k) where ζ is an intermediate point betweenθ (k) ;n and ϑ (k) . In view of Assumptions 2-5 and 7 of the theorem, by the same way as in [17], Theorem 5.14 and Lemma 5.3., it can be shown that as n → ∞,Ḣ Then a straightforward calculation with (3) in mind yields

. , M).
This result is just the statement of the theorem.
Return to the regression mixture model (1). Obviously it is a partial case of the MVC model (9). How the matrices V (m) and Z (m,l) can be represented for the regression mixture model?
Assume that h is defined by (7) and the function g(x, t) has second derivatives by t:g Then

Confidence ellipsoids for regression parameters
Apply the results of Section 3 to the construction of asymptotic confidence sets for ϑ (k) . For any t ∈ R d and any nonsingular S ∈ R d×d , define ;n − t).
It is obvious that if Theorem 2 holds and S (k,k) is nonsingular, then where χ 2 d is the χ 2 -distribution with d degrees of freedom. Note that (16) holds also if S (k,k) is replaced by a consistent estimatorŜ is an asymptotic α-level confidence set for ϑ (k) in the sense that To accomplish the confidence set construction, we need convenient conditions for the S (k,k) nonsingularity and a consistent estimator of this matrix.

Nonsingularity of S (k,k)
Since we need conditions for the nonsingularity of V (k) and Z (k,k) .

Assumption I lk .
For all c ∈ R d such that c = 0, This assumption means that the functionṡ are linearly independent a.s. with respect to the distribution of X for the l-th component.

exists, is finite and assumption I lk holds. Then A is nonsingular.
Proof. Observe that A l,k is the Gram matrix of the set of functions G = (ġ 1 , . . . ,ġ d ) in the L 2 space of functions on R d with inner product (f, g) = E f (X (l) )g(X (l) ).
Assumption I lk implies that the functions in G are linearly independent in this space. So, their Gram matrix is nonsingular.

Theorem 3.
Assume that the matrix Z (k,k) exists, is finite, assumption I kk holds and σ 2(k) > 0. Then S (k,k) exists and is nonsingular.
Proof. From V (k) = −A k,k one readily obtains the nonsingularity of V (k,k) . Show nonsingularity of Z (k,k) .
In what follows ≥ means the Loewner order for matrices, i.e., A ≥ Z means that A − Z is a positive semidefinite matrix.
Since A k,k ≥ 0 and det A k,k = 0, to prove the theorem, it is enough to show that (a (k) ) 2 p k > 0.
To do this, observe that by (3) n j =1 a k j ;n p k j ;n = 1.

Estimation of S (k,k)
There are at least two ways to estimate S (k,k) . The first is based on the plug-in technique. Namely, we construct empirical counterparts to V (k) and Z (k,k) and substitute them into (13) to obtain an estimator for S (k,k) . Formula (15)  ;n )) T .
Estimation of Z (k,k) is more complicated. We can estimate ;n ), ;n ).
We also replace the limits (a (i) ) 2 p l and (a (i) ) 2 p l p m with their approximations (a i j ;n ) 2 p l j ;n , α(k, i, l) = n j =1 (a k j ;n ) 2 p i j ;n p l j ;n .
Then the estimator of Z (k,k) iŝ Now, the resulting plug-in estimator for S (k,k) is By the same methods as in Theorem 5.15 in [17] it can be shown that under assumptions of Theorem 2, this estimator is consistent. The second approach to estimation of S (k,k) is based on the jackknife technique. Consider the dataset ;−i,n = (ξ 1;n , . . . , ξ i−1;n , ξ i+1;n , . . . , ξ n;n ), which consists of all observations from ;n without the i-th one. Similarly, the matrix p ;−i,n contains all rows of p ;n except the ith one, ;−i,n = p ;−i,n p T ;−i,n and a ;−i,n = −1 ;−i,n p ;−i,n . Thenθ  ;n ) T .
Jackknife estimators for i.i.d. sample are considered in [18]. Consistency of jackknife is demonstrated in [8] for samples from mixtures with varying concentrations in which the components are described by linear erroros-in-variables regression models. We do not state the consistency conditions for jnŜ (k,k) ;n , but analyze its applicability to the confidence ellipsoids construction in a small simulation study.

Simulations results
In the simulation study, the performance of confidence ellipsoids constructed in Section 4 is tested on N = 1000 simulated samples in each experiment. In all the experiments, we constructed confidence ellipsoids for the regression parameters with nominal covering probability 95% and calculated the obtained covering frequencies, i.e., the percent of ellipsoids which cover the true parameter vector.
The data were generated from a mixture of two components (i.e., M = 2) with mixing probabilities which also were obtained by random generation: Here 1 ) T is the vector of unknown regression parameters for the mth component to be estimated. The true values of parameters with which the data were generated are presented in Table 1. ε (m) are zero mean regression errors independent of X (m) , and their distributions were different in different experiments.
In each experiment, we calculated (i) oracle 95% covering sets at which the true value of S is used; (ii) plug-in 95% confidence ellipsoids based on plugŜ (k,k) ;n ; (iii) jackknife ellipsoids based on jkŜ (k,k) ;n . The ellipsoids were constructed by 1000 simulated samples and covering frequencies were calculated. These frequencies are presented in the tables, for each experiment.

Experiment 1.
Here the error terms were zero mean normal with the variance σ 2(k) = 0.25 for k = 1, 2. The resulting covering frequencies are presented in Table 2. It seems that the accuracy of the plug-in confidence ellipsoids in this experiment is not high, but enough for the practical purposes for sample sizes larger then 1000. The plug-in ellipsoids accuracy is nearly the same as for the oracle covering sets, so the observed deviations of the covering frequencies from the nominal confidence probability can not be explained by errors in S estimation. The jackknife ellipsoids are almost as accurate as the plug-in ones.

Experiment 2.
Here we consider bounded regression errors, namely ε (k) are uniform on [−0.25, 0.25]. The resulting covering frequencies are presented in Table 3. It seems that the accuracy of plug-in and jackknife ellipsoids is nearly the same as in Experiment 1. Paradoxically, the oracle ellipsoids perform somewhat worse than the ones in Experiment 1.

Experiment 3.
Here we compare the ellipsoids accuracy on the regression with heavy-tailed errors. The errors are taken with distribution of η/10, where η has Student's t distribution with four degrees of freedom. The results are presented in the Table 4. In this case, the accuracy of jackknife ellipsoids seems significantly worse then in the Experiments 1 and 2. The plug-in ellipsoids show nearly the same performance as in the previous experiments.

Conclusion
We presented theoretical results on the asymptotic normality of the modified least squares estimators for mixtures of nonlinear regressions. These results were applied to construction of confidence ellipsoids for the regression coefficients. Simulation results show that the proposed ellipsoids can be used for large enough samples.