Confidence ellipsoids for regression coefficients by observations from a mixture

Confidence ellipsoids for linear regression coefficients are constructed by observations from a mixture with varying concentrations. Two approaches are discussed. The first one is the nonparametric approach based on the weighted least squares technique. The second one is an approximate maximum likelihood estimation with application of the EM-algorithm for the estimates calculation.


Introduction
This paper is devoted to the technique of construction of confidence ellipsoids for coefficients of linear regression in the case, when statistical data are derived from a mixture with finite number of components and the mixing probabilities (the concentrations of components) are different for different observations. These mixing probabilities are assumed to be known, but the distributions of the components are unknown. (Such mixture models are also known as mixtures with varying concentrations, see [1] and [7]).
The problem of estimation of regression coefficients by such mixed observations was considered in the parametric setting in [4] and [5]. The authors of these papers assume that the distributions of regression errors and regressors are known up to some unknown parameters. The models considered in these papers are called finite mixtures of regression models. Some versions of maximum likelihood estimates are used for the estimation of unknown parameters of distributions and regression coefficients under these models. The EM-algorithm is used for computation of the estimates. (This algorithm is also implemented in R package mixtools, [2]. See [13] for the general theory of EM-algorithm and its application to mixture models).
In [6] a nonparametric approach was proposed under which no parametric models on the distributions of components are assumed. A weighted least squares technique is used to derive estimates for regression coefficients. Consistency and asymptotic normality of the estimates are demonstrated.
Note that in [6,10,11] a nonparametric approach to the analysis of mixtures with varying concentrations was developed in the case when the concentrations of components (mixing probabilities) are known. Some examples of real life data analysis under this assumption were considered in [10,11].
Namely, in [11] an application to the analysis of the Ukrainian parliamentary elections-2006 was considered. Here the observed subjects were respondents of the Four Wave World Values survey (conduced in Ukraine in 2006) and the mixture components were the populations of different electoral behavior adherents. The mixing probabilities were obtained from the official results of voting in 27 regions of Ukraine.
In [10] an application to DNA microarray data analysis was presented. Here the subjects were nearly 3000 of genes of the human genome. They were divided into two components by the difference of their expression in two types of malignantly transformed tissues. The concentrations were defined as a posteriori probabilities for the genes to belong to a given component, calculated by observations on the genes expression in sample tissues.
In this paper we will show how to construct confidence sets (ellipsoids) for regression coefficients under both parametric and nonparametric approaches. Quality of obtained ellipsoids is compared via simulations.
The rest of the paper in organized as follows. In Section 2 we present a formal description of the model. Nonparametric and parametric estimates of regression coefficients and their asymptotic properties are discussed in Sections 3 and 4. Estimation of asymptotic covariances of these estimates is considered in Section 5. The confidence ellipsoids are constructed in Section 6. Results of simulations are presented in Section 7. In Section 8 we present a toy example of application to a real life sociological data. Section 9 contains concluding remarks.

The model
We consider the model of mixture with varying concentrations. It means that each observed subject O belongs to one of M different subpopulations (mixture components). The number of component which the subject belongs to is denoted by κ(O) ∈ {1, 2, . . . , M }. This characteristic of the subject is not observed. The vector of observed variables of O will be denoted by ξ(O). It is considered as a random vector with the distribution depending on the subpopulation which O belongs to.
A structural linear regression model will be used to describe these distributions. (See [14] for general theory of linear regression).
That is, we consider one variable . . , X d (O)) T as a response and all other ones X(O) = (X 1 (O), . . . , X d (O)) T as regressors in the model where b k i , i = 1, . . . , d, k = 1, . . . , M are unknown regression coefficients for the k-th component of the mixture, ε(O) is the error term. Denote by b k = (b k 1 , . . . , b k d ) T the vector of the k-th component's coefficients. We consider ε(O) as a random variable and assume that It is also assumed that the regression error term ε(O) and regressors X(O) are conditionally independent for fixed κ(O) = m, m = 1, . . . , M .
The observed sample Ξ n consists of values (all mixing probabilities p m j are known). To describe completely the probabilistic behavior of the observed data we need to introduce the distributions of ε(O) and X(O) for different components. Let us denote The corresponding probability densities f X,m and f ε,m (x) are defined by The distribution of observed ξ j is a mixture of distributions of components with the mixing probabilities p m j , e.g.
and the probability density In what follows we will discuss two approaches to the estimation of the parameters of interest b k , for a fixed k ∈ {1, . . . , M }. The first one is the nonparametric approach. Under this approach we do not need to know the densities f X,m and f ε,m . Moreover we even do not assume the existence of these densities. The estimates are based on some modification of the least squares tehnique proposed in [6].
In the second, parametric approach we assume that the densities of components are known up to some unknown nuisance parameters ϑ m ∈ Θ ⊆ R L : In the most popular parametric normal mixture model these densities are normal, i.e.
where µ m ∈ R d is the mean of X for the m-th component and Σ m ∈ R d×d is its covariance matrix. All the parameters are usually unknown. So, in this case the unknown nusance parameters are

Generalized least squares estimator
Let us consider the nonparametric approach to the estimation of the regression coefficients developed in [6]. It is based on the minimization of weighted least squares Here a k = (a k 1;n , . . . , a k n;n ) are the minimax weights for estimation of the k-th component's distribution. They are defined by where γ mk;n is the (mk)-th minor of the matrix if the matrix X T AX is nonsingular. Note that the weight vector a k defined by (4) contains negative weights, sô b LS (k, n) is not allways the point of minimum of J k;n (b). But in what follows we will considerb LS (k, n) as a generalized least squares estimate for b k .
The asymptotic behavior ofb LS (k, n) as n → ∞ was investigated in [6]. To describe it we will need some additional notation.
Let us denote by the matrix of second moments of regressors for the m-th component. The consistency conditions for the estimatorb LS (k, n) are given by the following theorem. 2. D (k) is nonsingular.
3. There exists C > 0 such that det Γ n > C for all n large enough.
Thenb LS (k, n) Let us introduce the following notation. The following theorem provides conditions for the asymptotic normality and describes the dispersion matrix of the estimatorb LS (k, n).
Theorem 2 (Theorem 2 in [6]). Assume that 3. There exists C > 0 such that det Γ n > C for all n large enough.

Parametric approach
In this section we discuss the parametric approach to the estimation of b k based on papers [4,5]. We will assume that the representation (2)  Here b k is our parameter of interest, β and ϑ are the nuisance parameters.
In this model the log-likelihood for the unknown τ by the sample Ξ n can be defined as The general maximum likelihood estimator (MLE)τ MLE where the maximum is taken over all possible values of τ . Unfortunately, this estimator is not applicable to most common parametric mixture models, since the loglikelihood L(τ ) usually is not bounded on the set of all possible τ .
For example, it is so in the normal mixture model (3). Really, in this model L(τ ) → ∞ as σ 2 1 → 0 and Y 1 − X T 1 b 1 = 0 with all other parameters being arbitrary fixed.
The usual way to cope with this problem is to use the one-step MLE, which can be considered as one iteration of the Newton-Raphson algorithm of approximate calculation of MLE, starting from some pilot estimate (see [15], section 4.5.3). Namely, letτ (0) n be some pilot estimate for τ . Let us consider τ as a vector of dimension P = d × M + M × L and denote its entries by τ i : Denote the gradient of L(τ ) by .
Then the one-step estimator for τ starting fromτ (0) is defined aŝ Theorem 4.19 in [15] provides general conditions under whichτ OS n constructed by an i.i.d. sample is consistent, asymptotically normal and asymptotically efficient. 1 The limit distribution of the normalized one-step estimate is the same as of the consistent version of MLE.
So, if the assumptions of theorem 4.19 (or other analogous statement) hold, there is no need to use an iterative procedure to derive an estimate with asymptotically optimal performance. But on samples of moderate sizeτ OS n can be not good enough. Another popular way to obtain a stable estimate for τ is to use some version of EM-algorithm. A general EM-algorithm is an iterative procedure for approximate calculation of maximum likelihood estimates when information on some variables is missed. We describe here only the algorithm which calculates EM estimatesτ EM n under the normal mixture model assumptions (3), cf. [4,5].
The algorithm starts from some pilot estimatê for the full set of the model parameters. 1 Note that in our setting the sample Ξn is not an i.i.d. sample. But one can consider it as i.i.d. if the vectors of concentrations (p 1 j , . . . , p M j ) are generated by some stochastic mechanism as i.i.d. vectors. See [12] for an example of such stochastic concentrations models.
Then for i = 1, 2, . . . the estimates are iteratively recalculated in the following way.
Assume that on the i-iteration estimatesb (i) ,σ . . , M are obtained. Then the i-th stage weights are defined as . Then the estimators of the i + 1 iteration are defined aŝ The iterations are stopped when some stopping condition is fulfilled. For example, it can be where δ is a prescribed target accuracy. It is known that this procedure provide stable estimates which (for sample large enough) converge to the point of local minimum of L(τ ) which is the closest to the pilot estimatorτ (0) . So, this estimator can be considered as an approximate version of a root of likelihood equation estimator (RLE).
The asymptotic behavior ofτ OS n andτ EM n can be described in terms of Fisher's information matrix I * (n, τ ) = (I * il (n, τ )) P i,l=1 , where where p = (p 1 , . . . , p m ), ξ p is a random vector with the pdf Under the regularity assumptions (RR) of theorem 70.5 in [3], where E is the R × R unit matrix. Assumptions (RR) include the assumption of likelihood boundedness, so they do not hold for the normal mixture model. But if the pilot estimateτ To use all these results for construction of an estimator for b k we need √ nconsistent pilot estimators for the parameter of interest and nuisance parameters. They can be derived by the nonparametric technique considered in Section 3. To construct confidence ellipsoids we will also need estimators for the dispersion matrix V from (7) in the nonparametric case and estimators for the information matrix I * (n, τ ) in the parametric case. These estimators are discussed in the next section.

Estimators for nuisance parameters and normalizing matrices
Let us start with the estimation of the dispersion matrix V in Theorem 2. In fact, we need to estimate consistently the matrices D and Σ. Similarly, L is(m) can be estimated consistently bŷ under the assumptions of Theorem 2. The same idea can be used to estimate σ 2 m bŷ The coefficients α s,q , one obtains a consistent estimatorΣ n for Σ. ThenV is a consistent estimator for V.
To get the pilot estimators for the normal mixture model one can use the same approach. Namely, we definê whereτ can be any consistent estimator for τ (e.g.τ OS n orτ EM n ). In the normal mixture model we will denote τ (l) = (b (l) , µ l , Σ l , σ 2 l ), i.e. the set of all unknown parameters which describe the l-th mixture component.
Note that ∂ ∂τ (m) L(ξ p , p, τ ) can be represented as where ϕ τ (m) is the normal pdf of the observation ξ from the m-th component, f p is the pdf of the mixture defined by (10), A(τ (m) , ξ p ) is a vector with entries which are polynomial functions from the entries of ξ p . Then Note that by the assumptions 1 and 2 of the theorem f p (ξ; τ ) > 0 for all ξ ∈ R d+1 . Then u T (m) A(τ (m) , ξ p ) are polynomials of ξ p and ϕ τ (m) (ξ p ) are exponentials of different (due to assumption 3) and nonsingular (due to assumption 1) quadratic forms of ξ p .
Suppose that u T I(p, τ )u for some u with u = 1. Then (20) implies for all m = 1, . . . , M . On the other hand, (21) implies where I τ (m) is the Fisher information matrix for the unknown τ (m) by one observation from the m-th component. By the assumption 1, I τ (m) is nonsingular, so u (m) = 0 for all m = 1, . . . , M . This contradicts the assumption u = 1. So, by contradiction, (18) holds. Since u T I(p, τ )u is a continuous function on the compact set of u : u = 1 and p satisfying assumption 2, from (18) we obtain u T I(p, τ )u > c 0 for some c 0 > 0. On the other hand, the representation (19) implies I(p, τ ) < C 1 Then from I * (n, τ ) = n j=1 I(p j , τ ) we obtain the first statement of the theorem. 2. To prove the second statement note that by the law of large numbers  Let Ξ n be any random dataset of size n with distribution dependent of an unknown parameter b ∈ R d . Recall that a set B α = B α (Ξ n ) ⊂ R d is called an asymptotic confidence set of the significance level α if We will construct confidence sets for the vector of regression coefficients b = b k by the sample from a mixture Ξ n described in Section 2. In the nonparametric case the set will be defined by statistics of the form In the parametric case we take the matrixÎ(n) defined by (17) and consider its inverse matrixÎ(n) −1 = (I − (i, m)) R i,m=1 . Note that by (16) and (17) the elementsÎ im ofÎ(n) correspond to coordinates τ i and τ m of the vector of unknown parameters τ . Let us take the set of indices l m , m = 1, . . . , d such that τ lm = b k m and consider the matrix So, the matrix [Î(n) −1 ] (k) contains the elements ofÎ(n) −1 corresponding to b (k) only.
Then we invert this matrix once more: This matrix is used to construct the statistics which defines the confidence set: Hereb OS (k, n) andb EM (k, n) are the parts of the estimatorsτ OS In what follows the symbol ⋆ means any of symbols LS , OS or EM . The confidence set B ⋆ α (Ξ n ) is defined by where Q χ 2 d (1−α) is the (1−α)-quantile of χ 2 distribution with d degrees of freedom. In the parametric caseÎ k (n) + is a positively defined matrice, so B ⋆ α (Ξ n ) defined by (22) is the interior of an ellipsoid centered atb ⋆ (k, n).
In the nonparametric case the matrixV n can be not positively defined for small n, so the set B LS α (Ξ n ) can be unbounded. We will discuss some remedial actions for this problem in Section 7.
Proof. Theorem 2 and consistency ofV n imply that Theorem 5. Under the assumptions of Theorem 3, Sketch proof. By theorem 70.5 from [3] one obtains the asymptotic normality of the local MLE estimateτ where D is a sufficiently small neighborhood of the true τ . Then the convergence can be obtained from the asymptotic normality ofτ lMLE n by the technique of theorem 14.19 from [15].
Let us denote where I k (n) + is the theoretical counterpart ofÎ k (n) + : Then by (23), Note that (23) and the first statement of Theorem 3 imply The second statement of Theorem 3 implies This completes the proof.

Results of simulations
We carried out a small simulation study to assess performance of the parametric and nonperametric confidence intervals described above. A two component mixture (M = 2) of simple regressions was simulated. The regression models were of the form where X k ∼ N (µ k , Σ 2 k ) and Y are the observed regressor and response, κ is the unobserved number of components, ε k is the regression error. The error ε k has zero mean and variance σ 2 k .
The mixing probabilities were simulated by the following stochastic model: where u m j are independent uniformly distributed on [0, 1]. For each sample size n we generated 1000 samples. Parametric (EM) and nonparametric (LS) confidence ellipsoids were constructed by each sample. The parametric ellipsoids were based on EM-estimates which used the LS-estimates as the pilot ones andÎ k (n) + as the matrix for the quadratic form in S EM .
The nonparametric confidence ellipsoids were based on the LS-estimates. As it was mentioned in Section 6, the matrixV n can be not positively defined. Then the corresponding confidence set will be unbounded. In the case of simple regression (24) this drawback can be cured by the use of improved weights b + j defined in [8] instead of a k j in (12)- (14). This technique was used in our simulation study. All the ellipsoids were constructed with the nominal confidence level α = 0.05. The frequencies of covering true b k by the constructed ellipsoids and their mean volume were calculated in each simulation experiment.  Table 1. The errors ε k were Gaussian. This is a "totally separated" model in which the observations can be visually divided into two groups corresponding to different mixture components (see the left panel at Fig. 1).
Covering frequencies and mean volumes of the ellipsoids for different sample sizes n are presented in Table 2. They demonstrate sufficient accordance with the nominal significance level for sample sizes greater then 1000. Extremely large mean volumes for the LS-ellipsoids are due to poor performance of the estimatesV n for small and moderate sample sizes n.  The parametric confidence sets are significantly smaller then the nonparametric ones.

Experiment 2.
To see how the standard deviations of regression errors affect the performance of our algorithms we reduced them to σ k = 0.25 in the second experiment, keeping all other parameters unchanged. A typical scatterplot of such data is presented on the right panel of Fig. 1.
The results of this experiment are presented in Table 3. They are compared graphically to the results of Experiment 1 in Fig. 2. The covering frequencies are not significantly changed. In comparison to Experiment 1, the average volumes decreased significantly for EM-ellipsoids but not for the LS ones.

Experiment 3.
Here we consider another set of parameters (see Table 4). The regression errors are Gaussian. In this model the subjects cannot be classified uniquely by their observed variables (see the left panel in Fig. 3).
The results are presented in Table 5. Again, the EM-ellipsoids outperform the LS ones.  Experiment 4. In this experiment the parameters are the same as in Experiment 3, but the regression errors are not Gaussian. We let ε k = 3/5σ k η, where η has the Student-T distribution with 5 degrees of freedom. So the errors here have the same variances as in Experiment 3, but their distributions are heavy-tailed. Note that 5 is the minimal number of degrees of freedom for which the assumption E(ε k ) 4 of Theorem 2 holds. A typical data scatterplot for this model is presented on the right panel of Fig. 3. It is visually indistinguishable from the typical pattern of the Gaussian model from Experiment 3, presented on the left panel.
Results of this experiment are presented in Table 6. Note that in this case the covering proportion of the EM-ellipsoids does not tend to the nominal 1 − α = 0.95 for large n. The covering proportion of LS-ellipsoids is much nearer to 0.95. So the heavy tails of distributions of the regression errors deteriorate performance of (Gaussian model based) EM-ellipsoids but not of nonparametric LS-ellipsoids.

An application to sociological data analysis
To demonstrate possibilities of the developed technique, we present a toy example of construction of confidence ellipsoids in statistical analysis of dependence between Table 4. Parameters for simulation in Experiments 3 and 4 school performance of students and political attitudes of their adult environment. The analysis was based on two data sets. The first one contains results of the External independent testing in Ukraine in 2016 -EIT-2016. EIT is a a set of exams for high schools graduates for admission to universities. Data on EIT-2016 2 contain individual scores of examinees with some additional information including the region of Ukraine at which the examinee's school was located. The scores range from 100 to 200 points. We considered the information on the scores on two subjects: Ukrainian language and literature (Ukr) and on Mathematics (Math). EIT-2016 contains data on these scores for nearly 246 000 examinees. It is obvious that Ukr and Math scores should be dependent and the simplest way to model this dependency is the linear regression: We suppose that the coefficients b 0 and b 1 may depend on the political attitudes of the adult environment in which the student was brought up. Say, in a family of Ukrainian independence adherents one expects more interest to Ukrainian language than in an environment critical toward the Ukrainian state existence. Of course EIT-2016 does not contain any information on political issues. So we used the second data set with the official data on the results 3 of the Ukrainian Parliament elections-2014 to get approximate proportions of adherents of different political choices in different regions of Ukraine.
29 political parties and blocks took part in the elections. The voters were able also to vote against all or not to take part in the voting. We divided all the population of voters into three components: (1) Persons who voted for parties which then created the ruling coalition (BPP, People's front, Fatherland, Radical party, Self Reliance). This is the component of persons with positive attitudes to the pro-European Ukrainian state.
(2) Persons who voted for the Opposition block, voters against all, and voters for small parties which where under 5% threshold at these elections. These are voters critical to the pro-European line of Ukraine but taking part in the political life of the state.
(3) Persons who did not take part in the voting. These are persons who did not consider Ukrainian state as their own one or are not interested in politics at all.
We used the results of elections to calculate the proportions of each component in each region of Ukraine where the voting was held. These proportions were taken as estimates for the probabilities that a student from a corresponding region was brought up in the environment of a corresponding component. That is, they were considered as the mixing probabilities.
The LS-and EM-ellipsoids for b 0 and b 1 obtained by these data are presented on Fig. 4. The ellipsoids were constructed with the significance level α = 0.05/3 ≈ 0.01667, so by the Bonferroni rule, they are unilateral confidence sets with α = 0.05. Since the ellipsoids are not intersecting in both cases, one concludes that the vectors of regression coefficients (b i 0 , b i 1 ), i = 1, . . . , 3 are significantly different for different components. Note that the EM approach leads to estimates significantly different from the LS ones. This may suggest that the normal mixture model (3) does not hold for the data. Does the nonparametric model hold for them? Analysis of this problem and meaningful sociological interpretation of these results lie beyond the scope of this article.

Concluding remarks
We considered two approaches to the construction of confidence sets for coefficients of the linear regression in the mixture model with varying mixing probabilities. Both approaches demonstrate sufficient agreement of nominal and real significance levels for sufficiently large samples when the data satisfy underlying assumptions of the confidence set construction technique. The parametric approach needs a significant additional a priori information in comparison with the nonparametric one. But it utilizes this information providing much smaller confidence sets than in the nonparametric case.
On the other hand, the nonparametric estimators proved to be a good initial approximation for the construction of parametric estimators via the EM-algorithm. Nonparametric confidence sets also perform adequately in the cases when the assumptions of parametric model are broken.