Confidence ellipsoids for linear regression coefficients are constructed by observations from a mixture with varying concentrations. Two approaches are discussed. The first one is the nonparametric approach based on the weighted least squares technique. The second one is an approximate maximum likelihood estimation with application of the EM-algorithm for the estimates calculation.
Finite mixture modellinear regressionmixture with varying concentrationsnonparametric estimationmaximum likelihoodconfidence ellipsoidEM-algorithm62J0562G20Introduction
This paper is devoted to the technique of construction of confidence ellipsoids for coefficients of linear regression in the case, when statistical data are derived from a mixture with finite number of components and the mixing probabilities (the concentrations of components) are different for different observations. These mixing probabilities are assumed to be known, but the distributions of the components are unknown. (Such mixture models are also known as mixtures with varying concentrations, see [1] and [7]).
The problem of estimation of regression coefficients by such mixed observations was considered in the parametric setting in [4] and [5]. The authors of these papers assume that the distributions of regression errors and regressors are known up to some unknown parameters. The models considered in these papers are called finite mixtures of regression models. Some versions of maximum likelihood estimates are used for the estimation of unknown parameters of distributions and regression coefficients under these models. The EM-algorithm is used for computation of the estimates. (This algorithm is also implemented in R package mixtools, [2]. See [13] for the general theory of EM-algorithm and its application to mixture models).
In [6] a nonparametric approach was proposed under which no parametric models on the distributions of components are assumed. A weighted least squares technique is used to derive estimates for regression coefficients. Consistency and asymptotic normality of the estimates are demonstrated.
Note that in [6, 10, 11] a nonparametric approach to the analysis of mixtures with varying concentrations was developed in the case when the concentrations of components (mixing probabilities) are known. Some examples of real life data analysis under this assumption were considered in [10, 11].
Namely, in [11] an application to the analysis of the Ukrainian parliamentary elections-2006 was considered. Here the observed subjects were respondents of the Four Wave World Values survey (conduced in Ukraine in 2006) and the mixture components were the populations of different electoral behavior adherents. The mixing probabilities were obtained from the official results of voting in 27 regions of Ukraine.
In [10] an application to DNA microarray data analysis was presented. Here the subjects were nearly 3000 of genes of the human genome. They were divided into two components by the difference of their expression in two types of malignantly transformed tissues. The concentrations were defined as a posteriori probabilities for the genes to belong to a given component, calculated by observations on the genes expression in sample tissues.
In this paper we will show how to construct confidence sets (ellipsoids) for regression coefficients under both parametric and nonparametric approaches. Quality of obtained ellipsoids is compared via simulations.
The rest of the paper in organized as follows. In Section 2 we present a formal description of the model. Nonparametric and parametric estimates of regression coefficients and their asymptotic properties are discussed in Sections 3 and 4. Estimation of asymptotic covariances of these estimates is considered in Section 5. The confidence ellipsoids are constructed in Section 6. Results of simulations are presented in Section 7. In Section 8 we present a toy example of application to a real life sociological data. Section 9 contains concluding remarks.
The model
We consider the model of mixture with varying concentrations. It means that each observed subject O belongs to one of M different subpopulations (mixture components). The number of component which the subject belongs to is denoted by κ(O)∈{1,2,…,M}. This characteristic of the subject is not observed. The vector of observed variables of O will be denoted by ξ(O). It is considered as a random vector with the distribution depending on the subpopulation which O belongs to. A structural linear regression model will be used to describe these distributions. (See [14] for general theory of linear regression).
That is, we consider one variable Y=Y(O) in ξ(O)=(Y(O),X1(O),…,Xd(O))T as a response and all other ones X(O)=(X1(O),…,Xd(O))T as regressors in the model
Y(O)=∑i=1dbiκ(O)Xi(O)+ε(O),
where bik, i=1,…,d, k=1,…,M are unknown regression coefficients for the k-th component of the mixture, ε(O) is the error term. Denote by bk=(b1k,…,bdk)T the vector of the k-th component’s coefficients. We consider ε(O) as a random variable and assume that
E[ε(O)|κ(O)=m]=0,m=1,…,M,
and
σm2=Var[ε(O)|κ(O)=m]<∞.
(σm2 are unknown).
It is also assumed that the regression error term ε(O) and regressors X(O) are conditionally independent for fixed κ(O)=m, m=1,…,M.
The observed sample Ξn consists of values ξj=(Yj,XjT)T=ξ(Oj), j=1,…,n, where O1,…, On are independent subjects which can belong to different components with probabilities
pjm=P{κ(Oj)=m},m=1,…,M;j=1,…,n.
(all mixing probabilities pjm are known).
To describe completely the probabilistic behavior of the observed data we need to introduce the distributions of ε(O) and X(O) for different components. Let us denote
FX,m(A)=P{X(O)∈A|κ(O)=m}for any measurableA⊆Rd,
and
Fε,m(A)=P{ε(O)∈A|κ(O)=m}for any measurableA⊆R.
The corresponding probability densities fX,m and fε,m(x) are defined by
FX,m(A)=∫AfX,m(x)dx,Fε,m(A)=∫Afε,m(x)dx
(for all measurable A).
The distribution of observed ξj is a mixture of distributions of components with the mixing probabilities pjm, e.g.
P{Xj∈A}=∑m=1MpjmFXj,m(A)
and the probability density fj(y,x) of ξj=(Yj,XjT)T at a point (y,xT)T∈Rd+1 is
fj(y,x)=∑m=1MpjmfX,m(x)fε,m(y−xTbm).
In what follows we will discuss two approaches to the estimation of the parameters of interest bk, for a fixed k∈{1,…,M}.
The first one is the nonparametric approach. Under this approach we do not need to know the densities fX,m and fε,m. Moreover we even do not assume the existence of these densities. The estimates are based on some modification of the least squares tehnique proposed in [6].
In the second, parametric approach we assume that the densities of components are known up to some unknown nuisance parameters ϑm∈Θ⊆RL:
fX,m(x)=f(x;ϑm),fε,m(x)=fε(x;ϑm).
In the most popular parametric normal mixture model these densities are normal, i.e.
fε,m∼N(0,σm2),fX,m(x)∼N(μm,Σm),
where μm∈Rd is the mean of X for the m-th component and Σm∈Rd×d is its covariance matrix. All the parameters are usually unknown. So, in this case the unknown nusance parameters are
ϑm=(μm,Σm,σm2),m=1,…,M.
Generalized least squares estimator
Let us consider the nonparametric approach to the estimation of the regression coefficients developed in [6]. It is based on the minimization of weighted least squares
Jk;n(b)=def1n∑j=1naj;nk(Yj−∑i=1dbiXji)2,
over all possible b=(b1,…,bd)T∈Rd.
Here ak=(a1;nk,…,an;nk) are the minimax weights for estimation of the k-th component’s distribution. They are defined by
aj;nk=1detΓn∑m=1M(−1)k+mγmk;npjm,
where γmk;n is the (mk)-th minor of the matrix
Γn=(1n∑j=1npjlpji)l,i=1M,
see [6, 10] for details.
Define X=def(Xji)j=1,…,n;i=1,…,d to be the n×d-matrix of observed regressors, Y=def(Y1,…,YN)T be the vector of observed responses, A=defdiag(a1;nk,…,an;nk) be the diagonal weights matrix for estimation of k-th component. Then the stationarity condition
∂Jk;n(b)∂b=0
has the unique solution in b,
bˆLS(k,n)=def(XTAX)−1XTAY,
if the matrix XTAX is nonsingular.
Note that the weight vector ak defined by (4) contains negative weights, so bˆLS(k,n) is not allways the point of minimum of Jk;n(b). But in what follows we will consider bˆLS(k,n) as a generalized least squares estimate for bk.
The asymptotic behavior of bˆLS(k,n) as n→∞ was investigated in [6]. To describe it we will need some additional notation.
Let us denote by
D(m)=defE[X(O)XT(O)|κ(O)=m]
the matrix of second moments of regressors for the m-th component.
The consistency conditions for the estimator bˆLS(k,n) are given by the following theorem.
(Theorem 1 in [<xref ref-type="bibr" rid="j_vmsta105_ref_006">6</xref>]).
Assume that
D(m)andσm2are finite for allm=1,…,M.
D(k)is nonsingular.
There existsC>00$]]>such thatdetΓn>CC$]]>for all n large enough.
ThenbˆLS(k,n)⟶Pb(k)asn→∞.
Let us introduce the following notation.
Dis(m)=defE[Xi(O)Xs(O)|κ(O)=m],Lis(m)=def(E[Xi(O)Xs(O)Xq(O)Xl(O)|κ(O)=m])l,q=1d,Mis(m,p)=def(Dil(m)Dsq(p))l,q=1d,αs,q(k)=limn→∞1n∑j=1n(aj;nk)2pjspjq
(if this limit exists),
αs(k)=limn→∞1n∑j=1n(aj;nk)2pjs=∑q=1Mαs,q(k).
The following theorem provides conditions for the asymptotic normality and describes the dispersion matrix of the estimator bˆLS(k,n). (Theorem 2 in [<xref ref-type="bibr" rid="j_vmsta105_ref_006">6</xref>]).
In this section we discuss the parametric approach to the estimation of bk based on papers [4, 5]. We will assume that the representation (2) holds with some unknown ϑm∈Θ⊆RL, m=1,…,M.
Then the set of all unknown parameters τ=(bk,β,ϑ) consists of
β=(bm,m=1,…,M,m≠k)
and
ϑ=(ϑ1,…,ϑM).
Here bk is our parameter of interest, β and ϑ are the nuisance parameters.
In this model the log-likelihood for the unknown τ by the sample Ξn can be defined as
L(τ)=∑j=1nL(ξj,pj,τ),
where pj=(pj1,…,pjM)T,
L(ξj,pj,τ)=ln(∑m=1MpjmfX,m(Xj;ϑm)fε,m(Yj−XjTbm)).
The general maximum likelihood estimator (MLE) τˆnMLE=(bˆk,MLE,βˆMLE,ϑˆMLE) for τ is defined as
τˆnMLE=argmaxτL(τ),
where the maximum is taken over all possible values of τ. Unfortunately, this estimator is not applicable to most common parametric mixture models, since the log-likelihood L(τ) usually is not bounded on the set of all possible τ.
For example, it is so in the normal mixture model (3). Really, in this model L(τ)→∞ as σ12→0 and Y1−X1Tb1=0 with all other parameters being arbitrary fixed.
The usual way to cope with this problem is to use the one-step MLE, which can be considered as one iteration of the Newton–Raphson algorithm of approximate calculation of MLE, starting from some pilot estimate (see [15], section 4.5.3). Namely, let τˆn(0) be some pilot estimate for τ. Let us consider τ as a vector of dimension P=d×M+M×L and denote its entries by τi:
τ=(τ1,…,τP).
Denote the gradient of L(τ) by
sn(τ)=∂L(τ)∂τ=(∂L(τ)∂τ1,…,∂L(τ)∂τP)T
and the Hessian of L(τ) by
γn(τ)=(∂L(τ)∂τiτl)i,l=1P.
Then the one-step estimator for τ starting from τ(0)ˆ is defined as
τˆnOS=τˆ(0)−(γn(τˆ(0)))−1sn(τˆ(0)).
Theorem 4.19 in [15] provides general conditions under which τˆnOS constructed by an i.i.d. sample is consistent, asymptotically normal and asymptotically efficient.1
Note that in our setting the sample Ξn is not an i.i.d. sample. But one can consider it as i.i.d. if the vectors of concentrations (pj1,…,pjM) are generated by some stochastic mechanism as i.i.d. vectors. See [12] for an example of such stochastic concentrations models.
The limit distribution of the normalized one-step estimate is the same as of the consistent version of MLE.
So, if the assumptions of theorem 4.19 (or other analogous statement) hold, there is no need to use an iterative procedure to derive an estimate with asymptotically optimal performance. But on samples of moderate size τˆnOS can be not good enough.
Another popular way to obtain a stable estimate for τ is to use some version of EM-algorithm. A general EM-algorithm is an iterative procedure for approximate calculation of maximum likelihood estimates when information on some variables is missed. We describe here only the algorithm which calculates EM estimates τˆnEM under the normal mixture model assumptions (3), cf. [4, 5].
The algorithm starts from some pilot estimate
τˆ(0)=(bˆm(0),σˆm2(0),μˆm(0),Σˆm(0),m=1,…,M)
for the full set of the model parameters.
Then for i=1,2,… the estimates are iteratively recalculated in the following way.
Assume that on the i-iteration estimates bˆ(i), σˆm2(i), μˆm(i), Σˆm(i), m=1,…,M are obtained. Then the i-th stage weights are defined as
wjm(i)=wjm(ξj,τˆ(i))=pjmfX,m(Xj;μˆm(i),Σˆm(i))fε,m(Yj−XjTbˆm(i);σˆm2(i))∑l=1MpjlfX,l(Xj;μˆl(i),Σˆl(i))fε,l(Yj−XjTbˆl(i);σˆl2(i))
(note that wjm(i) is the posterior probability P{κj=m|ξj} calculated for τ=τˆ(i)).
Let w¯m=∑j=1nwjm(i). Then the estimators of the i+1 iteration are defined as
μˆm(i+1)=1w¯m∑j=1nwjm(i)Xj,Σˆm(i+1)=1w¯m∑j=1nwjm(i)(Xj−μˆm(i))(Xj−μˆm(i))T,bˆm(i+1)=(∑j=1nwjm(i)XjXjT)−1∑j=1nwjm(i)YjXj,σˆm2(i+1)=1w¯m∑j=1nwjm(i)(Yj−XjTbˆm(i))2.
The iterations are stopped when some stopping condition is fulfilled. For example, it can be
‖τˆ(i+1)−τˆ(i)‖<δ,
where δ is a prescribed target accuracy.
It is known that this procedure provide stable estimates which (for sample large enough) converge to the point of local minimum of L(τ) which is the closest to the pilot estimator τˆ(0).
So, this estimator can be considered as an approximate version of a root of likelihood equation estimator (RLE).
The asymptotic behavior of τˆnOS and τˆnEM can be described in terms of Fisher’s information matrix I∗(n,τ)=(Iil∗(n,τ))i,l=1P, where
Iil∗(n,τ)=∑j=1nIil(pj,τ),Iil(p,τ)=E∂L(ξp,p,τ)∂τi∂L(ξp,p,τ)∂τl,
where p=(p1,…,pm), ξp is a random vector with the pdf
fp(y,x;τ)=∑m=1Mpmf(x;ϑm)fε(y−xTbm;ϑm).
Under the regularity assumptions (RR) of theorem 70.5 in [3],
(I∗(n,τ))1/2(τˆnMLE−τ)⟶WN(0,E),
where E is the R×R unit matrix.
Assumptions (RR) include the assumption of likelihood boundedness, so they do not hold for the normal mixture model. But if the pilot estimate τˆn(0) is n-consistent, one needs only a local version of (RR) to derive asymptotic normality of τˆnOS and τˆnEM, i.e. (RR) must hold in some neighborhood of the true value of estimated τ. These local (RR) hold for the normal mixture model if σm2>00$]]> and Σm are nonsingular for all m=1,…,M.
To use all these results for construction of an estimator for bk we need n-consistent pilot estimators for the parameter of interest and nuisance parameters. They can be derived by the nonparametric technique considered in Section 3. To construct confidence ellipsoids we will also need estimators for the dispersion matrix V from (7) in the nonparametric case and estimators for the information matrix I∗(n,τ) in the parametric case. These estimators are discussed in the next section.
Estimators for nuisance parameters and normalizing matrices
Let us start with the estimation of the dispersion matrix V in Theorem 2. In fact, we need to estimate consistently the matrices D and Σ.
Note that D=D(k)=(Dis(k))i,s=1d, where
Dis(k)=E[Xi(O)Xs(O)|κ(O)=k].
By theorem 4.2 in [10],
Dˆnis(k)=1n∑j=1najkXjiXjs
is a consistent estimate for Dis(k) if E[‖X(O)‖2|κ(O)=m]<∞ for all m=1,…,M and assumption 3 of Theorem 1 holds.
So one can use Dˆn(k)=(Dˆnis(k))i,s=1d as a consistent estimate for D if the assumptions of Theorem 1 hold.
Similarly, Lis(m) can be estimated consistently by
Lˆnis(m)=1n∑j=1najmXjiXjsXjXjT
under the assumptions of Theorem 2.
The same idea can be used to estimate σm2 by
σˆm;n2(0)=1n∑j=1najm(Yj−XjTbˆLS(s,n))2.
The coefficients αs,q(k) can be approximated by
αˆs,q(k)=1n∑j=1n(aj;nk)2pjspjq.
Now replacing true D(m), Lis(m), bm, σm2 and αs,q(k) in formula (8) by their estimators Dˆn(m), Lˆnis(m), bˆLS(m,n), σˆm,n2 and αˆs,q(k), one obtains a consistent estimator Σˆn for Σ.
Then
Vˆn=Dˆn−1ΣˆnDˆn−1
is a consistent estimator for V.
To get the pilot estimators for the normal mixture model one can use the same approach. Namely, we define
μˆm,n(0)=1n∑j=1najmXj,Σˆm,n(0)=1n∑j=1najm(Xj−μˆm,n)(Xj−μˆm,n)T
as estimates for μm and Σm.
By theorem 4.3 from [10], μˆm,n(0), Σˆm,n(0) and σˆm,n2(0) are n-consistent estimators for the corresponding parameters of the normal mixture model. This allows one to use them as pilot estimators for the one-step and EM estimators.
Now let us consider estimation of the Fisher information matrix in the case of normal mixture model. Define Iˆil(n,τ)=∑j=1n∂L(ξj,pj,τ)∂τi∂L(ξj,pj,τ)∂τl,Iˆ(n,τ)=(Iˆil(n,τ))i,l=1R,Iˆ(n)=Iˆ(n,τˆ) where τˆ can be any consistent estimator for τ (e.g. τˆnOS or τˆnEM).
In the normal mixture model we will denote τ(l)=(b(l),μl,Σl,σl2), i.e. the set of all unknown parameters which describe the l-th mixture component.
Assume that the normal mixture model is taken and
σm2>00$]]>,Σmare nonsingular for allm=1,…,M;
There existc>00$]]>such that for allj=1,…,n,m=1,…,M,n=1,2,…pjm>c.c.\]]]>
τ(l)≠τ(m)for alll≠m,l,m=1,…,M.
Then
There exist0<c0<C1<∞such thatcon≤‖I∗(n,τ)‖≤C1nfor alln=1,2,….
1n‖I∗(n,τ)−Iˆ(n)‖→0in probability asn→∞.
Here and below for any square matix I the symbol ‖I‖ means the operator norm of I, i.e.
‖I‖=supu:‖u‖=1|uTIu|.
1. At first we will show that
uTI(p,τ)u>00\]]]>
for any τ and any u∈RP with ‖u‖=1 and for any p=(p1,…,pM) with pm>cc$]]> for all m=1,…,M.
Recall that τ=(τ(1),…,τ(M)), where τ(m) corresponds to the parameters describing the m-th component. Let us divide u into analogous blocks u=(u(1)T,…,u(M)T)T.
Then
uTI(p,τ)u=EuT∂∂τL(ξp,p,τ)(∂∂τL(ξp,p,τ))Tu=E(uT∂∂τL(ξp,p,τ))2=E(∑m=1Mu(m)T∂∂τ(m)L(ξp,p,τ))2.
Note that ∂∂τ(m)L(ξp,p,τ) can be represented as
∂∂τ(m)L(ξp,p,τ)=A(τ(m),ξp)pmφτ(m)(ξp)fp(ξp;τ)
where φτ(m) is the normal pdf of the observation ξ from the m-th component, fp is the pdf of the mixture defined by (10), A(τ(m),ξp) is a vector with entries which are polynomial functions from the entries of ξp.
Then
uTI(p,τ)u=E(∑m=1Mpmu(m)TA(τ(m),ξp)φτ(m)(ξp))21fp(ξp;τ)2.
Note that by the assumptions 1 and 2 of the theorem fp(ξ;τ)>00$]]> for all ξ∈Rd+1. Then u(m)TA(τ(m),ξp) are polynomials of ξp and φτ(m)(ξp) are exponentials of different (due to assumption 3) and nonsingular (due to assumption 1) quadratic forms of ξp.
Suppose that uTI(p,τ)u for some u with ‖u‖=1. Then (20) implies
u(m)TA(τ(m),ξp)=0a.s.
for all m=1,…,M.
On the other hand, (21) implies
E(u(m)TA(τ(m),ξp)φτ(m)(ξp))2=u(m)TIτ(m)u(m)=0,
where Iτ(m) is the Fisher information matrix for the unknown τ(m) by one observation from the m-th component. By the assumption 1, Iτ(m) is nonsingular, so u(m)=0 for all m=1,…,M. This contradicts the assumption ‖u‖=1.
So, by contradiction, (18) holds. Since uTI(p,τ)u is a continuous function on the compact set of u:‖u‖=1 and p satisfying assumption 2, from (18) we obtain uTI(p,τ)u>c0{c_{0}}$]]> for some c0>00$]]>. On the other hand, the representation (19) implies ‖I(p,τ)‖<C1
Then from I∗(n,τ)=∑j=1nI(pj,τ) we obtain the first statement of the theorem.
2. To prove the second statement note that by the law of large numbers
Δn(τ)=1n(Iˆ(n,τ)−I∗(n,τ))=1n∑j=1n[∂L(ξj,pj,τ)∂τ(∂L(ξj,pj,τ)∂τ)T−E∂L(ξj,pj,τ)∂τ(∂L(ξj,pj,τ)∂τ)T]⟶P0,asn→∞,
since
E‖∂L(ξj,pj,τ)∂τ‖4≤C<∞
for all j.
Let B⊆RP be any open bounded neighborhood of τ. Note that
Esupτ∈B‖∂∂τ∂L(ξj,pj,τ)∂τ(∂L(ξj,pj,τ)∂τ)T‖<C2<∞.
From this together with Δn(τ)⟶P0 we obtain
supτ∈B‖Δn(τ)‖⟶P0
(applying the same technique as in lemma 5.3 from [15]).
The last equation together with τˆn⟶Pτ implies the second statement of the Theorem. □
Confidence ellipsoids for <inline-formula id="j_vmsta105_ineq_210"><alternatives>
<mml:math><mml:msup><mml:mrow><mml:mi mathvariant="bold">b</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">k</mml:mi></mml:mrow></mml:msup></mml:math>
<tex-math><![CDATA[${\mathbf{b}}^{k}$]]></tex-math></alternatives></inline-formula>
Let Ξn be any random dataset of size n with distribution dependent of an unknown parameter b∈Rd. Recall that a set Bα=Bα(Ξn)⊂Rd is called an asymptotic confidence set of the significance level α if
limn→∞P{b∉Bα(Ξn)}=α.
We will construct confidence sets for the vector of regression coefficients b=bk by the sample from a mixture Ξn described in Section 2. In the nonparametric case the set will be defined by statistics of the form
SLS(β)=n(β−bˆLS(k,n))TVˆn−1(β−bˆLS(k,n)).
In the parametric case we take the matrix Iˆ(n) defined by (17) and consider its inverse matrix Iˆ(n)−1=(I−(i,m))i,m=1R.
Note that by (16) and (17) the elements Iˆim of Iˆ(n) correspond to coordinates τi and τm of the vector of unknown parameters τ. Let us take the set of indices lm, m=1,…,d such that τlm=bmk and consider the matrix
[Iˆ(n)−1](k)=(I−(li,lm))i,m=1d.
So, the matrix [Iˆ(n)−1](k) contains the elements of Iˆ(n)−1 corresponding to b(k) only.
Then we invert this matrix once more:
Iˆk(n)+=([Iˆ(n)−1](k))−1.
This matrix is used to construct the statistics which defines the confidence set:
SOS(β)=(β−bˆOS(k,n))TIˆk(n)+(β−bˆOS(k,n))
or
SEM(β)=(β−bˆEM(k,n))TIˆk(n)+(β−bˆEM(k,n)).
Here bˆOS(k,n) and bˆEM(k,n) are the parts of the estimators τˆnOS and τˆnEM which esitmate bk.
In what follows the symbol ⋆ means any of symbols LS, OS or EM. The confidence set Bα⋆(Ξn) is defined by
Bα⋆(Ξn)={β∈Rd:S⋆(β)≤Qχd2(1−α)},
where Qχd2(1−α) is the (1−α)-quantile of χ2 distribution with d degrees of freedom.
In the parametric case Iˆk(n)+ is a positively defined matrice, so Bα⋆(Ξn) defined by (22) is the interior of an ellipsoid centered at bˆ⋆(k,n).
In the nonparametric case the matrix Vˆn can be not positively defined for small n, so the set BαLS(Ξn) can be unbounded. We will discuss some remedial actions for this problem in Section 7.
Under the assumptions of Theorem2,limn→∞P{bk∉BαLS(Ξn)}=α.
Theorem 2 and consistency of Vˆn imply that SLS(bk)⟶Wχd2, so
P{bk∉BαLS(Ξn)}=P{SLS(bk)>Qχd2(1−α)}→α{Q}^{{\chi _{d}^{2}}}(1-\alpha )\big\}\to \alpha \]]]>
as n→∞. □
Under the assumptions of Theorem3,limn→∞P{bk∉BαOS(Ξn)}=α.
By theorem 70.5 from [3] one obtains the asymptotic normality of the local MLE estimate
τˆnlMLE=argmaxτ∈DL(τ),
where D is a sufficiently small neighborhood of the true τ. Then the convergence
I∗(n,τ)−1/2(τˆnOS−τ)⟶WN(0,E)
can be obtained from the asymptotic normality of τˆnlMLE by the technique of theorem 14.19 from [15].
Let us denote
S0OS(β)=(β−bˆOS(k,n))TIk(n)+(β−bˆOS(k,n)),
where Ik(n)+ is the theoretical counterpart of Iˆk(n)+:
Ik(n)+=([I∗(n,τ)−1](k))−1.
Then by (23), S0OS(bk)⟶Wχd2.
Note that (23) and the first statement of Theorem 3 imply
ζn=bˆOS(k,n)−bk=Op(n−1/2).
The second statement of Theorem 3 implies
1n‖Iˆk(n)+−Ik(n)+‖⟶P0.
So
S0OS(bk)−SOS(bk)=ζnT1n(Iˆk(n)+−Ik(n)+)ζn⟶P0
and SOS(bk)⟶Wχd2.
This completes the proof. □
Results of simulations
We carried out a small simulation study to assess performance of the parametric and nonperametric confidence intervals described above. A two component mixture (M=2) of simple regressions was simulated. The regression models were of the form
Y=b0κ+b1κX+εκ,
where Xk∼N(μk,Σk2) and Y are the observed regressor and response, κ is the unobserved number of components, εk is the regression error. The error εk has zero mean and variance σk2.
The mixing probabilities were simulated by the following stochastic model:
pj;Nm=ujmΣs=1Mujs,
where ujm are independent uniformly distributed on [0,1].
For each sample size n we generated 1000 samples. Parametric (EM) and nonparametric (LS) confidence ellipsoids were constructed by each sample. The parametric ellipsoids were based on EM-estimates which used the LS-estimates as the pilot ones and Iˆk(n)+ as the matrix for the quadratic form in SEM.
The nonparametric confidence ellipsoids were based on the LS-estimates. As it was mentioned in Section 6, the matrix Vˆn can be not positively defined. Then the corresponding confidence set will be unbounded. In the case of simple regression (24) this drawback can be cured by the use of improved weights bj+ defined in [8] instead of ajk in (12)–(14). This technique was used in our simulation study.
All the ellipsoids were constructed with the nominal confidence level α=0.05. The frequencies of covering true bk by the constructed ellipsoids and their mean volume were calculated in each simulation experiment.
The values of parameters for this experiment are presented in Table 1. The errors εk were Gaussian. This is a “totally separated” model in which the observations can be visually divided into two groups corresponding to different mixture components (see the left panel at Fig. 1).
Typical scatterplots of data in Experiment 1 (left) and Experiment 2 (right)
Parameters for simulation in Experiments 1 and 2
k
1
2
μk
−2
4
Σk
3
2
σk
1
1
b0k
−3
0.5
b1k
−0.5
2
Experiment 1 results (k is the number of component)
Covering frequencies
n
LS
EM
k=1
k=2
k=1
k=2
100
0.954
0.821
0.946
0.951
103
0.975
0.914
0.947
0.952
104
0.988
0.951
0.95
0.95
105
0.951
0.963
0.952
0.953
106
0.936
0.949
0.936
0.951
Average volume of ellipsoids
n
LS
EM
k=1
k=2
k=1
k=2
100
298∗106
262∗106
2.177543
0.243553
103
1364
394
0.186303
0.021234
104
0.476327
0.317320
0.018314
0.002062
105
0.041646
0.030047
0.001845
0.000207
106
0.004121
0.002988
0.000185
0.000021
Covering frequencies and mean volumes of the ellipsoids for different sample sizes n are presented in Table 2. They demonstrate sufficient accordance with the nominal significance level for sample sizes greater then 1000. Extremely large mean volumes for the LS-ellipsoids are due to poor performance of the estimates Vˆn for small and moderate sample sizes n.
The parametric confidence sets are significantly smaller then the nonparametric ones.
To see how the standard deviations of regression errors affect the performance of our algorithms we reduced them to σk=0.25 in the second experiment, keeping all other parameters unchanged. A typical scatterplot of such data is presented on the right panel of Fig. 1.
The results of this experiment are presented in Table 3. They are compared graphically to the results of Experiment 1 in Fig. 2. The covering frequencies are not significantly changed. In comparison to Experiment 1, the average volumes decreased significantly for EM-ellipsoids but not for the LS ones.
Experiment 2 results (k is the number of component)
Covering frequencies
n
LS
EM
k=1
k=2
k=1
k=2
100
0.886
0.922
0.950
0.943
103
0.942
0.910
0.945
0.948
104
0.954
0.946
0.951
0.955
105
0.962
0.958
0.943
0.950
106
0.955
0.937
0.961
0.942
Average volume of ellipsoids
n
LS
EM
k=1
k=2
k=1
k=2
100
6560085466
43879747920
0.01022148
0.01199164
103
98.92863
182799.39295
0.0008286214
0.0011644647
104
0.1061491
0.2465760
7.846928×10−05
1.196875×10−04
105
0.009584066
0.021603033
7.870776×10−06
1.191609×10−05
106
0.0009045894
0.0021206426
7.875141×10−07
1.189581×10−06
Average volumes of ellipsoids in Experiment 1 (■) and Experiment 2 (▲). Solid lines for EM, dashed lines for LS (First component)
Here we consider another set of parameters (see Table 4). The regression errors are Gaussian. In this model the subjects cannot be classified uniquely by their observed variables (see the left panel in Fig. 3).
Typical scatterplots of data in Experiment 3 (left) and Experiment 4 (right)
Parameters for simulation in Experiments 3 and 4
k
1
2
μk
0
1
Σk
2
2
σk
0.5
0.5
b0k
0.5
−0.5
b1k
2
−13
The results are presented in Table 5. Again, the EM-ellipsoids outperform the LS ones.
Experiment 3 results (k is the number of component)
Covering frequencies
n
LS
EM
k=1
k=2
k=1
k=2
100
0.920
0.928
0.949
0.935
103
0.953
0.943
0.948
0.946
104
0.951
0.957
0.954
0.945
105
0.947
0.963
0.942
0.961
106
0.945
0.951
0.948
0.939
Average volume of ellipsoids
n
LS
EM
k=1
k=2
k=1
k=2
100
294.5016
28837.3340
0.05897494
0.06088698
103
0.6088472
0.6274452
0.005250821
0.004937218
104
0.05837274
0.05594969
0.0005024635
0.0004993278
105
0.005604424
0.00551257
4.987135×10−05
5.024126×10−05
106
0.0005625693
0.0005550716
4.978973×10−06
5.029275×10−06
In this experiment the parameters are the same as in Experiment 3, but the regression errors are not Gaussian. We let εk=3/5σkη, where η has the Student-T distribution with 5 degrees of freedom. So the errors here have the same variances as in Experiment 3, but their distributions are heavy-tailed. Note that 5 is the minimal number of degrees of freedom for which the assumption E(εk)4 of Theorem 2 holds.
A typical data scatterplot for this model is presented on the right panel of Fig. 3. It is visually indistinguishable from the typical pattern of the Gaussian model from Experiment 3, presented on the left panel.
Results of this experiment are presented in Table 6. Note that in this case the covering proportion of the EM-ellipsoids does not tend to the nominal 1−α=0.95 for large n. The covering proportion of LS-ellipsoids is much nearer to 0.95. So the heavy tails of distributions of the regression errors deteriorate performance of (Gaussian model based) EM-ellipsoids but not of nonparametric LS-ellipsoids.
An application to sociological data analysis
To demonstrate possibilities of the developed technique, we present a toy example of construction of confidence ellipsoids in statistical analysis of dependence between school performance of students and political attitudes of their adult environment. The analysis was based on two data sets. The first one contains results of the External independent testing in Ukraine in 2016 – EIT-2016. EIT is a a set of exams for high schools graduates for admission to universities. Data on EIT-20162
Taken from the official site of Ukrainian Center for Educational Quality Assessmenthttps://zno.testportal.com.ua/stat/2016.
contain individual scores of examinees with some additional information including the region of Ukraine at which the examinee’s school was located. The scores range from 100 to 200 points.
Experiment 4 results (k is the number of component)
Covering frequencies
n
LS
EM
k=1
k=2
k=1
k=2
100
0.912
0.915
0.943
0.937
103
0.948
0.945
0.949
0.959
104
0.937
0.945
0.929
0.953
105
0.947
0.951
0.915
0.930
106
0.961
0.953
0.634
0.763
Average volume of ellipsoids
n
LS
EM
k=1
k=2
k=1
k=2
100
997.1288
584.8507
0.06740671
0.06419959
103
0.7006367
0.6127510
0.005262779
0.004971307
104
0.05798962
0.05624429
0.0004850319
0.0004884329
105
0.005621060
0.005574176
4.667732×10−05
4.746252×10−05
106
0.0005616700
0.0005566846
4.666926×10−06
4.745760×10−06
We considered the information on the scores on two subjects: Ukrainian language and literature (Ukr) and on Mathematics (Math). EIT-2016 contains data on these scores for nearly 246 000 examinees. It is obvious that Ukr and Math scores should be dependent and the simplest way to model this dependency is the linear regression:
Ukr=b0+b1Mat+ε.
We suppose that the coefficients b0 and b1 may depend on the political attitudes of the adult environment in which the student was brought up. Say, in a family of Ukrainian independence adherents one expects more interest to Ukrainian language than in an environment critical toward the Ukrainian state existence.
Of course EIT-2016 does not contain any information on political issues. So we used the second data set with the official data on the results3
See the site of Central Election Commission (Ukraine)http://www.cvk.gov.ua/vnd_2014/.
of the Ukrainian Parliament elections-2014 to get approximate proportions of adherents of different political choices in different regions of Ukraine.
29 political parties and blocks took part in the elections. The voters were able also to vote against all or not to take part in the voting. We divided all the population of voters into three components:
(1) Persons who voted for parties which then created the ruling coalition (BPP, People’s front, Fatherland, Radical party, Self Reliance). This is the component of persons with positive attitudes to the pro-European Ukrainian state.
(2) Persons who voted for the Opposition block, voters against all, and voters for small parties which where under 5% threshold at these elections. These are voters critical to the pro-European line of Ukraine but taking part in the political life of the state.
(3) Persons who did not take part in the voting. These are persons who did not consider Ukrainian state as their own one or are not interested in politics at all.
We used the results of elections to calculate the proportions of each component in each region of Ukraine where the voting was held. These proportions were taken as estimates for the probabilities that a student from a corresponding region was brought up in the environment of a corresponding component. That is, they were considered as the mixing probabilities.
The LS- and EM-ellipsoids for b0 and b1 obtained by these data are presented on Fig. 4. The ellipsoids were constructed with the significance level α=0.05/3≈0.01667, so by the Bonferroni rule, they are unilateral confidence sets with α=0.05. Since the ellipsoids are not intersecting in both cases, one concludes that the vectors of regression coefficients (b0i,b1i), i=1,…,3 are significantly different for different components.
Note that the EM approach leads to estimates significantly different from the LS ones. This may suggest that the normal mixture model (3) does not hold for the data. Does the nonparametric model hold for them? Analysis of this problem and meaningful sociological interpretation of these results lie beyond the scope of this article.
LS (left) and EM (right) confidence ellipsoids for the EIT data. Components: (1) dotted line, (2) dashed line, (3) solid line. b0 on the horizontal axis, b1 on the vertical axis
Concluding remarks
We considered two approaches to the construction of confidence sets for coefficients of the linear regression in the mixture model with varying mixing probabilities. Both approaches demonstrate sufficient agreement of nominal and real significance levels for sufficiently large samples when the data satisfy underlying assumptions of the confidence set construction technique. The parametric approach needs a significant additional a priori information in comparison with the nonparametric one. But it utilizes this information providing much smaller confidence sets than in the nonparametric case.
On the other hand, the nonparametric estimators proved to be a good initial approximation for the construction of parametric estimators via the EM-algorithm. Nonparametric confidence sets also perform adequately in the cases when the assumptions of parametric model are broken.
Acknowledgments
We are thankful to the unknown referees for their attention to our work and fruitful comments.
The research was supported in part by the Taras Shevchenko National University of Kyiv scientific grant N 16ВФ038-02.
ReferencesAutin, F., Pouet, Ch.: Test on the components of mixture densities. Benaglia, T., Chauveau, D., Hunter, D.R., mixtools, Y.D.S.: An R Package for Analyzing Finite Mixture Models. Borovkov, A.A.: Faria, S.: Soromenhob Gilda Fitting mixtures of linear regressions. Grün, B., Friedrich, L.: Fitting finite mixtures of linear regression models with varying & fixed effects in R. In: Rizzi, A., Vichi, M. (eds.) Liubashenko, D., Maiboroda, R.: Linear regression by observations from mixture with varying concentrations. Maiboroda, R.: Maiboroda, R., Kubaichuk, O.: Asymptotic normality of improved weighted empirical distribution functions. Maiboroda, R.E., Sugakova, O.V.: Maiboroda, R., Sugakova, O.: Statistics of mixtures with varying concentrations with application to DNA microarray data analysis. Maiboroda, R.E., Sugakova, O.V., Doronin, A.V.: Generalized estimating equations for mixtures with varying concentrations. Maiboroda, R., Sugakova, O.: Sampling bias correction in the model of mixtures with varying concentrations. McLachlan, G.: Seber, G.A.F., Lee, A.J.: Shao, J.: Titterington, D.M., Smith, A.F., Makov, U.E.: