Testing hypotheses on moments by observations from a mixture with varying concentrations

A mixture with varying concentrations is a modification of a finite mixture model in which the mixing probabilities (concentrations of mixture components) may be different for different observations. In the paper, we assume that the concentrations are known and the distributions of components are completely unknown. Nonparametric technique is proposed for testing hypotheses on functional moments of components.


Introduction
Finite mixture models (FMMs) arise naturally in statistical analysis of biological and sociological data [11,13]. The model of mixture with varying concentrations (MVC) is a modification of the FMM where the mixing probabilities may be different for different observations. Namely, we consider a sample of subjects O 1 , . . . , O N where each subject belongs to one of subpopulations (mixture components) P 1 , . . . , P M . The true subpopulation to which the subject O j belongs is unknown, but we know the probabilities p m j;N = P[O j ∈ P m ] (mixing probabilities, concentrations of P m in the mixture at the jth observation, j = 1, . . . , N , m = 1, . . . , M ). For each subject O, a variable ξ(O) is observed, which is considered as a random element in a measurable space X equipped by a σ-algebra F. Let The observations ξ j;N are assumed to be independent for j = 1, . . . , N . We consider the nonparametric MVC model where the concentrations p m j;N are known but the component distributions F m are completely unknown. Such models were applied to analyze gene expression level data [8] and data on sensitive questions in sociology [12]. An example of sociological data analysis based on MVC is presented in [9]. In this paper, we consider adherents of different political parties in Ukraine as subpopulations P i . Their concentrations are deduced from 2006 parliament election results in different regions of Ukraine. Individual voters are considered as subjects; their observed characteristics are taken from the Four-Wave Values Survey held in Ukraine in 2006. (Note that the political choices of the surveyed individuals were unknown. So, each subject must be considered as selected from mixture of different P i .) For example, one of the observed characteristics is the satisfaction of personal income (in points from 1 to 10).
A natural question in the analysis of such data is homogeneity testing for different components. For example, if X = R, then we may ask if the means or variances (or both) of the distributions F i and F k are the same for some fixed i and k or if the variances of all the components are the same.
In [8], a test is proposed for the hypothesis of two-means homogeneity. In this paper, we generalize the approach from [8] to a much richer class of hypotheses, including different statements on means, variances, and other generalized functional moments of component distributions.
Hypotheses of equality of MVC component distributions, that is, F i ≡ F k , were considered in [6] (a Kolmogorov-Smirnov-type test is proposed) and [1] (tests based on wavelet density estimation). The technique of our paper also allows testing such hypotheses using the "grouped χ 2 "-approach.
Parametric tests for different hypotheses on mixture components were also considered in [4,5,13].
The rest of the paper is organized as follows. We describe the considered hypotheses formally and discuss the test construction in Section 2. Section 3 contains auxiliary information on the functional moments estimation in MVC models. In Section 4, the test is described formally. Section 5 contains results of the test performance analysis by a simulation study and an example of real-life data analysis. Technical proofs are given in Appendix A.

Problem setting
In the rest of the paper, we use the following notation.
The zero vector from R k is denoted by O k . The unit k × k-matrix is denoted by I k×k , and the k × m-zero matrix by O k×m . Convergences in probability and in distribution are denoted P −→ and d −→, respectively.
Angle brackets with subscript N denote averaging of an array over all the observations, for example, Multiplication, summation, and other similar operations inside the angle brackets are applied to the arrays componentwise, so that and so on. Angle brackets without subscript mean the limit of the corresponding averages as N → ∞ (assuming that this limit exists): We introduce formally random elements η m ∈ X with distributions F m , m = 1, . . . , M . Consider a set of K ≤ M measurable functions g k : X → R d k , k = 1, . . . , K. Letḡ m k be the (vector-valued) functional moment of the mth component with moment function g k , that is,ḡ Fix a measurable function T : R d1 × R d2 × · · · × R dK → R L . For data described by the MVC model (1) we consider testing a null-hypothesis of the form against the general alternative T (ḡ 1 1 , . . . ,ḡ K K ) = O L . Example 1. Consider a three-component mixture (M = 3) with X = R. We would like to test the hypothesis H σ 0 : Var η 1 = Var η 2 (i.e., the variances of the first and second components are the same). This hypothesis can be reformulated in the form (3) by letting g 1 (x) = g 2 (x) = (x, x 2 ) T and T ((y 11 , y 12 ) T , (y 21 , y 22 ) T ) = (y 12 − (y 11 ) 2 , y 22 − (y 21 ) 2 ) T . Example 2. Let X = R. Consider the hypothesis of mean homogeneity H µ 0 : E η 1 = · · · = E η M . Then the choice of g m (x) = x, T (y 1 , . . . , y M ) = (y 1 − y 2 , y 2 − y 3 , . . . , y M−1 − y M ) T reduces H µ 0 to the form (3). Example 3. Let X be a finite discrete space: X = {x 1 , . . . , x r }. Consider the distribution homogeneity hypothesis H ≡ 0 : F 1 ≡ F 2 . To present it in the form (3), we can use g i (x) = (1{x = x i }, k = 1, . . . , r − 1) T and T (y 1 , y 2 ) = y 1 − y 2 (y i ∈ R r−1 for i = 1, 2). In the case of continuous distributions, H ≡ 0 can be discretized by data grouping.
To test H 0 defined by (3), we adopt the following approach. Let there be some consistent estimatorsĝ m k;N forḡ m k . Assume that T is continuous. Consider the statistiĉ T N = T (ĝ 1 1;N , . . . ,ĝ K K;N ). Then, under H 0 ,T N ≈ O L , and a far departure ofT N from zero will evidence in favor of the alternative.
To measure this departure, we use a Mahalanobis-type distance. If √ NT N is asymptotically normal with a nonsingular asymptotic covariance matrix D, then, under H 0 , NT T N D −1T N is asymptotically χ 2 -distributed. In fact, D depends on unknown component distributions F i , so we replace it by its consistent estimatorD N .
, where α is the significance level, and Q G (α) denotes the quantile of level α for distribution G.
Possible candidates for the role of estimatorsĝ m k;N andD N are considered in the next section.

Estimation of functional moments
Let us start with the nonparametric estimation of F m by the weighted empirical distribution of the formF where a m j;N are some nonrandom weights to be selected "in the best way." Denote e m = (1{k = m}, k = 1, . . . , M ) T and Assume that Γ N is nonsingular. It is shown in [8] that, in this case, the weight array a m ·;N = p · ·;N Γ −1 N e m yields the unbiased estimator with minimal assured quadratic risk.
The simple estimatorĝ m i;N forḡ m i is defined aŝ To formulate the asymptotic normality result for the simple moment estimators, we need some additional notation.
We consider the set of all momentsḡ k k , k = 1, . . . , K, as one long vector belonging to R d , d := d 1 + · · · + d K : The corresponding estimators also form a long vector We denote the matrices of mixed second moments of g k (x), k = 1, . . . , K, and the corresponding estimators as We consider the function T as a function of d-dimensional argument, that is, . Let us define the following matrices (assuming that the limits exist): Then the asymptotic covariance matrix of the normalized estimate Theorem 2. Assume that: (i) The functional momentsḡ m k ,ḡ m k,l exist and are finite for k, l = 1, . . . , K, m = 1, . . . , M .
Thus, to construct a test for H 0 , we need a consistent estimator for Σ. The matrices α r,s;N and β m;N are natural estimators for α r,s and β m . It is also natural to estimateḡ m k,l byĝ m k,l;N defined in (5b). In view of Theorem 1, these estimators are consistent under the assumptions of Theorem 2. But they can possess undesirable properties for moderate sample size. Indeed, note thatF m;N is not a probability distribution itself since the weights a m j;N are negative for some j. Therefore, for example, the simple estimator of the second moment of some component can be negative, estimator (5b) for the positive semidefinite matrixḡ m k,k can be not positive semidefinite matrix, and so on. Due to the asymptotic normality result, this is not too troublesome for estimation ofḡ. But it causes serious difficulties when one uses an estimator of the asymptotic covariance matrix D based onĝ m k,l;N in order to calculateŝ N . In [10], a technique is developed ofF m;N andĥ m N improvement that allows one to derive estimators with more adequate finite sample properties if X = R.
So, assume that ξ(O) ∈ R and consider the weighted empirical CDF It is not a nondecreasing function, and it can attain values outside [0, 1] since some a m j;N are negative. The transform yields a monotone functionF m;N (x), but it still can be greater than 1 at some x. So, defineF Any CDF that lies betweenF − m;N (x) andF + m;N (x) can be considered as an improved version ofF m;N (x). We will use only one such improvement, which combineŝ F − m;N (x) andF + m;N (x): Note that all the three considered estimatorsF * m;N ( * means any symbol from +, −, or ±) are piecewise constants on intervals between successive order statistics of the data. Thus, they can be represented aŝ where b m * j;N are some random weights that depend on the data. The corresponding improved estimator forḡ m i iŝ Let h : R → R be a measurable function. (II) Assume that: Thenĥ m± N →h m in probability.

Construction of the test
We first state an asymptotic normality result forT N . Denote Theorem 4. Assume that: (ii) The assumptions of Theorem 2 hold.
For the proof, see Appendix. Note that (iii) implies the nonsingularity of Σ. Now, to estimate D, we can usê whereg N is any consistent estimator forḡ, whereg m k,l;N is any consistent estimator forḡ m k,l;N . For example, we can use g m k,l;N =ĝ m± k,l;N = 1 N N j=1 b m± j;N g k (ξ j;N )g l (ξ j;N ) T if X = R and the assumptions of Theorem 3 hold for all h(x) = g i l (x)g n k (x), i, k = 1, . . . , K, i = 1, . . . , d l , n = 1, . . . , d k , g l (x) = (g 1 l (x), . . . , g d l l (x)) T . Now let the test statistic beŝ N = N (T N ) TD−1 NT N . For a given significance level α, the test π N,α accepts H 0 ifŝ N ≤ Q ξL (1 − α) and rejects H 0 otherwise.
The p-level of the test (i.e., the attained significance level) can be calculated as p = 1 − G(ŝ N ), where G means the CDF of χ 2 L -distribution. Theorem 5. Let the assumptions of Theorem 4 hold. Moreover, assume the following: To ensure assumption (ii) of Theorem 2, we need E[|η m | 2+δ ] < ∞ for some δ > 0 and all m = 1, . . . , M . In view of Theorem 1, this assumption also implies the consistency ofĝ N andĝ m kl;N . If one usesĝ ± N andĝ m± kl;N as estimatorsg N andg m kl;N in D N , then a more restrictive assumption E[|η m | 4+δ ] < ∞ is needed to ensure their consistency by Theorem 3.

Simulation study
To access the proposed test performance on samples of moderate size, we conducted a small simulation study. Three  N = 50, 100, 250, 500, 750, 1000, 2000, and 5000. Three modifications of π N ;α test were applied to each sample. In the first modification, (ss), simple estimators were used to calculate bothT N andD N . In the second modification, (si), simple estimators were used inT N , and the improved ones were used inD N . In the last modification (ii), improved estimators were used inT N and D N . Note that the modification (ii) has no theoretical justification since, as far as we know, there are no results on the limit distribution of √ N (ĝ ± N −ḡ). All tests were used with the nominal significance level α = 0.05. In the figures, frequencies of errors of the tests are presented. In the plots, corresponds to (ss), △ to (si), and • to (ii) modification.

Experiment A1.
In this experiment, we consider testing the mean homogeneity hypothesis H µ 0 . The means were taken µ m = 0, m = 1, 2, 3, so H µ 0 holds. To shadow the equality of means, different variances of components were taken, namely σ 2 1 = 1, σ 2 2 = 4, and σ 2 3 = 9. The resulting first-type error frequencies are presented on the left panel of Fig. 1. For the (ss) test, for small N , there were 1.4% cases of incorrect covariance matrix estimates (D N was not positive definite). Incorrect estimates were absent for large N .

Experiment A2.
Here we also tested H µ 0 for components with the same variances as in A1. But µ 1 = 2 and µ 2 = µ 3 = 0, so H µ 0 does not hold. The frequencies of the second-type error are presented on the right panel of Fig. 1. The percent of incorrect estimatesD N is 1.6% for (ss) and small N . Experiment B1. In this and the next experiment, we tested H σ 0 : σ 2 1 = σ 2 2 . The data were generated with µ 1 = 0, µ 2 = 3, µ 3 = −2, σ 2 1 = σ 2 2 = 1, and σ 2 2 = 4, so H σ 0 holds. The frequencies of the first -type error are presented on the left panel of Fig. 2. The percent of incorrectD N in (ss) varies from 19.4% for small N to 0% for large N . Experiment B2. Now µ m and σ 2 3 are the same as in B1, but σ 2 1 = 1 and σ 2 2 = 4, so H σ 0 does not hold. The frequencies of the second-type error are presented on the left panel of Fig. 2. The percent of incorrectD N in (ss) was 15.5% for small N and decreases to 0% for large N .
The presented results show reasonable agreement of the observed significance levels of the tests to their nominal level 0.05 when the sample sizes were larger then 500. The power of the tests increases to 1 as the sample sizes grow. It is interesting to note that the (ii) modification, although theoretically not justified, demonstrates the least first-type error and rather good power. From these results the (si) modification of the test seems the most prudent one.

Example of a sociological data analysis
Consider the data discussed in [9]. It consists of two parts. The first part is the data from the Four-Wave World Values Survey (FWWVS) held in Ukraine by the European Values Study Foundation (www.europeanvalues.nl) and World Values Survey  Association (www.worldvaluessurvey.org) in 2006. They contain answers of N = 4006 Ukrainian respondents on different questions about their social status and attitudes to different human values. We consider here the level of satisfaction of personal income (subjective income) as our variable of interest ξ, so ξ j;N is the subjective income of the jth respondent.
Our aim is to analyze differences in the distribution of ξ on populations of adherents of different political parties. Namely, we use the data on results of Ukrainian Parliament elections held in 2006. 46 parties took part in the elections. The voters could also vote against all or not to take part in the voting. We divided all the population of Ukrainian voters into three large groups (political subpopulations): P 1 which contains adherents of the Party of Regions (PR, 32.14% of votes), P 2 of Orange Coalition supporters (OC which consisted of "BJUT" and "NU" parties, 36.24%), and P 3 of all others, including the persons who voted against all or did not take part in the pool (Other).
Political preferences of respondents are not available in the FWWVS data, so we used official results of the elections by 27 regions of Ukraine (see the site of Ukrainian Central Elections Commission www.cvk.gov.ua) to estimate the concentrations p m j;N of the considered political subpopulations in the region where the jth respondent voted.
Means and variances of ξ over different subpopulations were estimated by the data (see Table 1). Different tests were performed to test their differences. The results are presented in the Table 2. Here µ m means the expectation, and σ 2 m means the variance of ξ over the mth subpopulation. Degrees of freedom for the limit χ 2 distribution are placed in the "df" column.
These results show that the hypothesis of homogeneity of all variances must be definitely rejected. The variances of ξ for PR and OC adherents are different, but the tests failed to observe significant differences in the pairs of variances PR-Other and OC-Other. For the means, all the tests agree that PR and OC has the same mean ξ, whereas the mean of Other is different from the common mean of PR and OC.

Concluding remarks
We developed a technique that allows one to construct testing procedures for different hypotheses on functional moments of mixtures with varying concentrations. This technique can be applied to test the homogeneity of means or variances (or both) of some components of the mixture. Performance of different modifications of the test procedure is compared in a small simulation study. The (ss) modification showed the worst first-type error and the highest power. The (ii) test has the best first-type error and the worst power. It seems that the (si) modification can be recommended as a golden mean.

Acknowledgement
The authors are thankful to the anonymous referee for fruitful comments.

A Appendix
Proof of Theorem 2. Note that We will apply the CLT with the Lyapunov condition (see Theorem 11 from Chapter 8 and Remark 4 in Section 4.8 in [2]) to S N . It is readily seen that ζ j:N , j = 1, . . . , N , are independent for fixed N and E[ζ j;N ] = 0.
Let Σ j;N = Cov(ζ j;N ). Then Σ j;N consists of the blocks Σ (k,l) It is readily seen that To apply the CLT, we only need to verify the Lyapunov condition Note that assumption (iii) implies sup 1≤j≤N,1<≤m≤M,N >N0 a m j;N < C 1 for some N 0 and C 1 < ∞; thus, where g(x) = (g 1 (x) T , . . . , g k (x) T ). Since |g(ξ j;N )| 2 = K k=1 |g k (ξ j;N )| 2 and, by the Hölder inequality, we obtain where the constant C 2 does not depend on j and N . This, together with (13), yields (12). By the CLT we obtain S N  Therefore, By the Glivenko-Cantelli-type theorem for weighted empirical distributions (which can be derived, e.g., as a corollary of Theorem 2.4.2 in [7]) For any h : R → R and any interval A ⊆ R, let V A (h) be the variation of h on A. Take A = (c − , c + ). Then, under the assumptions of the theorem,

Part (II).
Note that if the assumptions of this part hold for some A = (c − , c + ), then they will also hold for any new c − , c + such that A ⊂ (c − , c + ). Thus, we may assume that F m (c − ) < 1/4 and F m (c + ) > 3/4.
Consider the random event B 1 where Then J 2 P −→ 0 as in Part (I). Let us assume that the event B 1 N occurred and bound If h(x) is bounded as x → −∞, then we can take c − = −∞ and obtain J 1 = 0. Consider the case of unbounded h. Since h is monotone, we have h(x) → +∞ or h(x) → +∞ as x → −∞. We will consider the first case; the reasoning for the second one is analogous. Thus, h(x) → +∞ as x → −∞, and we can take h(x) > 0 for x < c − . By the inequality (16) in [10], whereF (x) = M m=1 F m (x), C 1 < ∞. Let us take x 0 , . . . , x n , . . . such that h(x j ) = 2 j h(c − ). By assumption (ii) and the Chebyshev inequality, Proof of Theorem 4. This theorem is a simple corollary of Theorem 2 and the continuity theorem for weak convergence (Theorem 3B in Chapter 1 of [3]).
Therefore,ŝ N converges in distribution to the same limit ass N , that is, to χ 2 L .