Jackknife covariance matrix estimation for observations from mixture

A general jackknife estimator for the asymptotic covariance of moment estimators is considered in the case when the sample is taken from a mixture with varying concentrations of components. Consistency of the estimator is demonstrated. A fast algorithm for its calculation is described. The estimator is applied to construction of confidence sets for regression parameters in the linear regression with errors in variables. An application to sociological data analysis is considered.


Introduction
Finite Mixture Models (FMM) are widely used in the analysis of biological, economic and sociological data. For a comprehensive survey of different statistical techniques based on FMMs, see [9]. Mixtures with Varying Concentrations (MVC) is a subclass of these models in which the mixing probabilities are not constant, but vary for different observations (see [4,5]).
In this paper we consider application of the jackknife technique to the estimation of asymptotic covariance matrix (the covariance matrix for asymptotically normal estimator, ACM) in the case when the data are described by the MVC model. The jackknife is a well-known resampling technique usually applied to i.i.d. samples (see Section 5.5.2 in [11], Chapter 4 in [15], Chapter 2 in [12]). On the jackknife estimates of variance for censored and dependent data, see [14]. Its modification to the case of the MVC model in which the observations are still independent but not identically distributed needs some efforts.
We obtained a general theorem on consistency of the jackknife estimators for ACM for moment estimators in the MVC models and apply this result to construct confidence sets for regression coefficients in linear errors-in-variables models for MVC data. On general errors-in-variables models, see [2,3,8]. The model and the estimators for the regression coefficients considered in this paper was proposed in [6], where the asymptotic normality of these estimates is shown.
The rest of the paper is organized as follows. In Section 2 we introduce the MVC model and describe the estimation technique for these models based on weighted moments. In Section 3 the jackknife estimates for the ACM are introduced and conditions of their consistency formulated. Section 4 is devoted to the algorithm of fast computation of the jackknife estimates. In Section 5 we apply the previous results to construct confidence sets for linear regression coefficients in errors-in-variables models with MVC. In Section 6 results of simulations are presented. In Section 7 we present results of application of the proposed technique to analyze sociological data. Proofs are placed in Section 8. Section 9 contains concluding remarks.

Mixtures with varying concentrations
We observe variables of n independent subjects ξ j = ξ(O j ). The probability to obtain j -th subject from m-th component can be considered as the concentration of the m-th component in the mixture when the j -th observation was made. The concentrations are known and can vary for different observations.
So, the distribution of ξ j is described by the model of mixture with varying concentrations: We will denote by where a j ;n are some weights dependent on components' concentrations, but not on the observed ξ j = ξ j ;n . (In what follows we denote by the subscript ; n that the corresponding quantity is considered for the sample size n. In most cases this subscript is dropped to simplify notations.) To obtain unbiased estimates in (2) one needs to select the weights satisfying the assumption Then j ) T and the same notation is used for the matrix a.
In this notation the unbiasedness condition (3) reads where E means the M × M unit matrix. There can be many choices of a satisfying (4). In [4,5] minimax weights are considered defined by 2 a = p −1 , 2 In fact, in [4] and [5] the weights are defined asã = na andξ (k) = 1 n n j =1ã (k) j ξ j . In this paper we adopt notation which allows to simplify formulas for fast estimator calculation in Section 4. where = ;n = p T p is the Gram matrix of the set of concentration vectors p (1) , . . . , p (M) . In what follows, we assume that these vectors are linearly independent, so det > 0 and −1 exists. See [5] on the optimal properties of the estimates for concentration distributions based on the minimax weights (5).
To describe the asymptotic behavior ofξ (k) ;n as n → ∞, we will calculate its covariance matrix.
Notice that Assume that the limits ;n (6) exist. Then the limits ;n exist also, since due to (3) we have M l=1 p l j = 1 for all j . So, under this assumption, where . This theorem is a simple corollary of Theorem 4.3 in [5].

Jackknife estimation of ACM of moment estimators
In what follows, we will consider unknown parameters of the component distribution F (k) ξ , which can be represented in the form where H : R d → R q is some known function. A natural estimator for such parameter by the sample ξ 1 , . . . , ξ n isθ =θ (k) Then asymptotic behavior of this estimator is described by the following theorem.
This theorem is a simple implication from our Theorem 1 and Theorem 3 in Section 5, Chapter 1 of [1].
So, V ∞ defined by (10) is the ACM of the estimatorθ (k) (the covariance matrix of the limit normal distribution of the normalized difference between the estimator and the estimated parameter). If it was known one could use it to construct tests for hypotheses on ϑ (k) or to derive confidence set for ϑ (k) . In fact, for most estimators the ACM is unknown. Usually some estimate of V ∞ is used to replace its true value in statistical algorithms.
The jackknife is one of most general techniques of ACM estimation. Letθ be any estimator of ϑ by the data ξ 1 , . . . , ξ n : Consider estimates of the same form which are calculated by all observations without oneθ Then the jackknife estimator for V ∞ is defined bŷ In our caseθ = H (ξ (k) ), soθ Here Notice that 0 is placed at the i-th row of p i− as a placeholder only, to preserve the numbering of the rows in p i− and a i− , which corresponds to the numbering of subjects in the sample.

Assumption
∞ . For proof see Section 8.

Fast calculation algorithm for jackknife estimator
Direct calculation ofV ;n by (11)-(15) needs ∼ Cn 2 elementary operations. Here we consider an algorithm which reduces the computational complexity to ∼ Cn operations (linear complexity). Notice where (Formula (16) can be demonstrated directly by checking −1 i− i− = E. It is also a corollary to the Serman-Morrison-Woodbury formula, see A.9.4 in [13].) Let us denotē (Zero at the i-th row is a placeholder as in the matrix p i− .) Applying (16) one obtains Then the following algorithm allows one to calculateV (m) for all m = 1, . . . , M at once by ∼ Cn operations.

Regression with errors in variables
In this section we consider a mixture of simple linear regressions with errors in variables. A modification of orthogonal regression estimation technique was proposed for the regression coefficients estimation in [6]. We will show how the jackknife ACM estimators from Section 3 can be applied in this case to construct confidence sets for the regression coefficients.
Recall the errors-in-variables regression model in the context of mixture with varying concentrations.
We consider the case when each subject O has two variables of interest: x(O) and y(O). These variables are related by a strict linear dependence with coefficients depending on the component that O belongs to: where b (m) 1 are the regression coefficients for the m-th component.
The true values of x(O) and y(O) are unobservable. These variables are observed with measurement errors Here we assume that the errors ε X (O) and ε Y (O) are conditionally independent given for all m = 1, . . . , M. So the distributions of ε X (O) and ε Y (O) can be different, but their variances are the same for a given subject. We assume that σ 2 (m) > 0, m = 1, . . . , M, and are unknown.
As in Section 2 we observe a sample ( In the case of homogeneous sample, when there is no mixture, the classical way to estimate b 0 and b 1 is orthogonal regression. That is, the estimator is taken as the minimizer of the total least squares functional which is the sum of squares of minimal Euclidean distances from the observation points to the regression line. The modification of this technique for mixtures with varying concentrations proposed in [6] leads to the following estimators for b (k) Conditions of consistency and asymptotic normality of these estimators are given in Theorems 5.1 and 5.2 from [6]. For example, under the assumptions of Theorem 4 we obtain √ n(θ (k) 1 ) T . The ACM V ∞ of the estimator is given by formula (21) in [6]. This formula is rather complicated and involves theoretical moments of unobservable variables x(O), ε X (O) and ε Y (O). So it is natural to estimate V ∞ by the jackknife technique, which doesn't need to know or estimate these moments.
1 ) T can be represented in terms of Section 3 if we expand the space of observable variables including quadratic terms. That is, we consider the sample Then the estimatorθ ;n = (θ 1 ;n ,θ 2 (23) can be represented in form (9)  ;n for V (k) ∞ by (11).
Theorem 4. Assume that the following conditions hold.

Assumption
This theorem is a simple combination of Theorem 3 and Theorem 5.2 from [6].
In what follows we assume that V We can construct a confidence set (ellipsoid) for the unknown parameter ϑ (k) applying the Theorem 4 by the usual way. Namely, let for any t ∈ R 2 ;n ).

Then in the assumptions of Theorem 4, if det V (k)
where η is a random variable (r.v.) with chi-square distribution with 2 degrees of freedom. Consider B α;n = {t ∈ R 2 : T ;n (t) ≤ Q η (1−α)}, where Q η (α) means the quantile of level α for the r.v. η. By (24) so B α;n is an asymptotic confidence set for ϑ (k) of level α.

Results of simulation
To assess performance of the proposed technique we performed a small simulation study. In the following three experiments we calculated covering frequencies of confidence sets for regression coefficients in the model (20)-(22) constructed by (25) and corresponding one-dimensional confidence intervals.
In all experiments for sample size n = 100 through 5000 we generated B = 1000 samples and calculated estimates for the parameters and corresponding confidence sets. The one-dimensional confidence intervals for b (k) i were calculated by the standard formula ⎡ ii;n is the i-th diagonal entry of the matrixV (k) ;n , λ α/2 is the quantile of level 1 − α/2 for the standard normal distribution. The confidence level for the sets and intervals was taken α = 0.05.
Then the numbers of cases when the confidence set covers the true value of the estimated parameter were calculated and divided by B. These are the covering frequencies reported in the tables below.
In all the experiments we considered two-component mixture (M = 2) with the concentrations of components p 1 j ;n = j/n, p 2 j ;n = 1 − j/n.
The regression coefficients were taken as b (1)

Experiment 1.
In this experiment we let ε X and ε Y ∼ N(0, 0.25). The variance of the errors is so small that the regression coefficients can be estimated with no difficulties even for small sample sizes. The covering frequencies for confidence sets are presented in Table 1. It seems that they approach the nominal covering probability 0.95 with satisfactory accuracy for sample sizes n ≥ 1000.

Experiment 2.
In this experiment we enlarged the variance of the error terms taking it as σ 2 = 2. All other parameters were the same as in Experiment 1. The results are presented in Table 2.  It seems that the increase of errors dispersion doesn't deteriorate covering accuracy of the confidence sets.

Experiment 3.
Here we consider the case when the errors distributions are heavy tailed. We generate the data with ε X and ε Y having Student-T distribution with df = 14 degrees of freedom. (This is the smallest df for which assumptions of Theorem 4 hold.) Covering frequencies are presented in Table 3. Table 3. Covering frequencies for confidence sets in Experiment 3 It seems that the accuracy of covering slightly decreased but this decrease is insignificant for practical purposes.

Sociologic analysis of EIT data
We would like to demonstrate advantages of the proposed technique by application to the analysis of the External Independent Testing (EIT) data (see [7]). EIT is a set of exams for high school graduates in Ukraine which must be passed for admission to universities. We use data on EIT-2016 from the official site of Ukrainian Center for Educational Quality Assessment. 3 In this presentation we consider only the data on scores on two subjects: Ukrainian language and literature (Ukr) and on Mathematics (Math). The scores range from 100 to 200 points. (We have excluded the data on persons who failed on one of the exams or didn't pass these exams at all.) EIT-2016 contain such data on 246 thousands of examinees. The information on the region (Oblast) of Ukraine in which each examinee attended the high school is also available in EIT-2016.
Our aim is to investigate how dependence between Ukr and Math scores differs for examinees grown up in different environments There can be, e.g. an environment of adherents of Ukrainian culture and Ukrainian state, or in the environment of persons critical toward the Ukrainan independence. EIT-2016 doesn't contain information on such issues. So we use data on Ukrainian Parliament (Verhovna Rada) election results to deduce approximate proportions of adherents of different political choices in different regions of Ukraine.
We divided adherents of 29 parties and blocks that took part in the elections into three large groups, which are the components of our mixture: (1) Pro-Ukrainian persons, voting for the parties that then created the ruling coalition (BPP, Batkivschyna, Narodny Front, Radicals and Samopomich) (2) Contra-Ukrainian persons who voted for the Opposition block, voted against all or voted for small parties which where under 5% threshold on these elections.
(3) Neutral persons who did not took part in the voting.
Combining these data with EIT-2016 we obtain the sample (X j , Y j ), j = 1, . . . , n, where X j is the Math score of the j -th examinee, Y j is his/her Ukr score. The concentrations of components (p 1 j , p 2 j , p 3 j ) are taken as frequencies of adherents of corresponding political choice at the region where the j -th examinee attended the high school.
In [7] the authors propose to use classical linear regression model (in which the error appears in the response only) to describe dependence between X j and Y j in these data. But the errors-in-variables model can be more adequate since the causes which deteriorate ideal functional dependence Ukr = b 0 + b 1 Math can affect both Math and Ukr scores causing random deviations of, maybe, the same dispersion for each variable.
So, in this presentation, we assumed that the data are described by the model In this model we calculated the confidence intervals of the level α = 0.05/3 ≈ 0.0167 to derive the unilateral level α = 0.05 in comparisons of three intervals derived for three different components. The results are presented in Table 4. We observe that the obtained intervals are rather narrow. They don't intersect for different  Surely, these explanations are too simple to be absolutely correct. We consider them only as examples of hypotheses which can be deduced from the data by the proposed technique.

Proofs
To demonstrate Theorem 3 we need three lemmas. Below the symbols C and c mean finite constants, maybe different. Proof. By definition, 1 n ;n → ∞ , so there exists c > 0 such that det ;n > c for all n large enough. This together with |p m j ;n | < 1 imply (Here · means the operator norm.) Taking into account that a ;n = p ;n −1 ;n , we obtain the first statement of the lemma.
Proof. Let η 1 , . . . , η n be independent random variables with E η i = 0. Let us denote B n = n j =1 E(η j ) 2 . Then the last formula in the proof of Theorem 7.2 and Theorem 7.3 in [10] imply the following proposition.

Proposition 1. If
for n large enough.
Let η j = ±na (k) j ;n ξ l j . Then B n = n 2 n j =1 (a (k) j ;n ) 2 Var ξ l j ∼ Cn by Lemma 1. Assumption 2 implies that the assumption of Proposition 1 holds. So, This implies the statement of the lemma.
Proof. By the Chebyshov inequality we obtain that for some 0 < R < ∞, Then for αβ > 1, Lemma is proved.
Proof of Theorem 3. Let ξ j = ξ j − E ξ j . Then Let us denote Let us denoteṼ We will show thatṼ ;n → V ∞ as n → ∞ in probability (30) and n V ;n −Ṽ ;n → 0 as n → ∞ in probability.
These two equations imply the statement of the theorem. We start from (30). Let us calculate EṼ ;n . Notice that By Assumption 2 of the theorem, sup i E ξ i (ξ i ) T < C, and by Lemma 1, By the same way as in (9), we obtain EṼ ;n → H (μ (k) ) (k) ∞ (H (μ (k) )) T = V ∞ .

Conclusions
We introduced a modification of the jackknife technique for the ACM estimation for moment estimators by observations from mixtures with varying concentrations. A fast algorithm is proposed which implements this technique. Consistency of derived estimator is demonstrated. Results of simulations demonstrate its practical applicability for sample sizes n > 1000.