## 1 Introduction

Mixture distributions are used in many fields of science for modeling data taken from different subpopulations. An important medical application is clustering of gene expression data to discover novel subgroups of a disease. This is a high-dimensional problem and it is common to do a variable selection to obtain a subset of genes whose expression contribute in separating the subgroups. For two-class problems this may be carried out by fitting a univariate mixture distribution to each gene and single out variables for which the overlap between the component distributions is small enough [27]. There are also multivariate methods to do the variable selection, which are more computationally demanding but take into account the possible correlations between the genes and therefore reduce the loss of information in the univariate approach where each gene is modeled separately [9]. Further applications of mixtures can be found in image analysis [25], outlier detection [16], remote sensing [18], and epidemiology [24].

Karl Pearson [22] used the method of moments as a first attempt to estimate the parameters of a mixture distribution. Since then the computational difficulty of the problem and the increasing number of applications have sparked a vast amount of theoretical and applied research. Maximum likelihood inference was simplified with the introduction of the expectation-maximization (EM) algorithm in 1970s [7] and is up to now the most applied and studied approach, see [19] and references therein. Various modifications of the basic likelihood method have been proposed, aiming to overcome drawbacks resulting from the unbounded likelihood function and the sensitivity of outliers. For example, in [11] a family of divergences was introduced that is used as a generalization of the likelihood. There are also several variants of the EM-algorithm available, such as stochastic versions [4] and constrained formulations [15]. Minimum distance estimators is another family of parametric methods that has been applied extensively for mixtures, in particular due to its robustness against imprecise distributional assumptions [28, 5, 6]. Semiparametric techniques have also attracted much interest in this field, for example, estimation of location mixtures where only symmetry is imposed on the density components [2, 17]. For a comprehensive introduction to inference for mixtures, we refer to the monograph [26].

In this paper, we propose a hybrid approach for estimating five parameters of a mixture of two densities which are symmetric about their means. The approach combines the method of moments with a minimum distance estimator based on a quadratic measure of deviation between the fitted and empirical distribution functions. The motivation behind our approach is to develop a robust algorithm that produces accurate estimates also when the parametric shape of the mixture distribution is misspecified, which is common in practice.

The paper is organized as follows. In Section 2, we introduce the hybrid estimator and describe how it is obtained from empirical data. Section 3 is devoted to a simulation study where the proposed estimator is evaluated and compared to a conventional maximum likelihood estimator obtained via the EM-algorithm. We consider the methods’ performance in estimating the unknown partition of a data set containing observations from two populations (model-based clustering), which is an important application of mixture distributions. We also evaluate the methods’s ability to estimate the mixing proportion. In Section 4, we report the results of a case study where the methods are applied on gene expression data from patients with acute leukemia. In Section 5, we discuss the results and draw some conclusions.

## 2 The moment-distance hybrid method

In this section we present the novel moment-distance hybrid estimator (HM-estimator) and describe how it can be used for model-based clustering. We consider the problem where the real-valued random variable

*X*has a two-component mixture distribution $F(\cdot )$ with density where $0<p<1$ is the mixing proportion and $f_{i}(\cdot )=f_{i}(\cdot |\mu _{i},{\sigma _{i}^{2}})$ is the density of a random variable $X_{i}$ completely specified by its mean $\mu _{i}$ and variance ${\sigma _{i}^{2}}$, $i=1,2$. We assume that the third moment $E|X{|}^{3}$ is finite, and that the component densities $f_{1}(\cdot )$ and $f_{2}(\cdot )$ are*symmetric*about their means. A mixture of two bounded and symmetric densities has these properties, for example a two-component normal mixture. Let $\theta =(p,\mu _{1},\mu _{2},{\sigma _{1}^{2}},{\sigma _{2}^{2}})$ denote the parameter vector for $F(\cdot )$.The HM-estimator, denoted $\hat{\theta }_{\mathit{HM}}$, is an estimator of

*θ*that combines the method of moments and the minimum distance method. The method of moments is used to reduce the parameter space and the minimum distance approach, aiming to minimize the distance between the fitted model and the empirical distribution, is used to obtain the estimator $\hat{\theta }_{\mathit{HM}}$.### 2.1 Definition of the HM-estimator

An estimate $\hat{\theta }=(\hat{p},\hat{\mu }_{1},\hat{\mu }_{2},{\hat{\sigma }_{1}^{2}},{\hat{\sigma }_{2}^{2}})$ of where the last equality relies on the symmetry of the component densities $f_{1}(\cdot )$ and $f_{2}(\cdot )$, see Appendix A.1 for details. Following the method of moments, we replace the parameters in (1)–(3) by their estimators while equating the theoretical moments $\{\nu _{i}\}$ with their sample counterparts The set ${\varOmega ^{\prime }}\subseteq \varOmega $ consists of all relevant estimates which solve the system (4)–(6). We define the HM-estimator as an element of ${\varOmega ^{\prime }}$ with a minimum distance criteria as where the function $d(F(\cdot |\hat{\theta }),F_{n}(\cdot ))$ measures the distance between the fitted model distribution $F(\cdot |\hat{\theta })$ and the empirical distribution $F_{n}(\cdot )$ of the sample. In this paper we use an $L_{2}$-type measure given by

*θ*based on a sample $x_{1},\dots ,x_{n}$ is called*relevant*if \[\begin{array}{r@{\hskip0pt}l}& \displaystyle 0<\hat{p}<1,\\{} & \displaystyle {\hat{\sigma }_{1}^{2}},{\hat{\sigma }_{2}^{2}}>0,\\{} & \displaystyle \min (x_{1},\dots ,x_{n})\le \hat{\mu }_{i}\le \max (x_{1},\dots ,x_{n}),\hspace{1em}i=1,2.\end{array}\]

Let *Ω*denote the set of all relevant estimates of*θ*. The method of moments is applied to reduce*Ω*to a subset ${\varOmega ^{\prime }}$ of lower dimension. The first three moments of*X*can be expressed as##### (1)

\[\begin{array}{r@{\hskip0pt}l}\displaystyle \nu _{1}& \displaystyle :=E(X)=p\mu _{1}+(1-p)\mu _{2},\end{array}\] \[ \hat{\nu }_{1}=\frac{1}{n}{\sum \limits_{i=1}^{n}}x_{i},\hspace{2em}\hat{\nu }_{2}=\frac{1}{n}{\sum \limits_{i=1}^{n}}{x_{i}^{2}},\hspace{2em}\hat{\nu }_{3}=\frac{1}{n}{\sum \limits_{i=1}^{n}}{x_{i}^{3}}.\]

These sample moments can be highly variable so we suggest below to replace them with the more robust trimmed means, denoted $\{{\hat{\nu }_{k}^{\ast }}\},k=1,2,3$, see Section 3.1 for further details. We get the following undetermined system of equations: ##### (4)

\[\begin{array}{r@{\hskip0pt}l}\displaystyle {\hat{\nu }_{1}^{\ast }}& \displaystyle =\hat{p}\hat{\mu }_{1}+(1-\hat{p})\hat{\mu }_{2},\end{array}\]##### (7)

\[ \hat{\theta }_{\mathit{HM}}=\underset{\hat{\theta }\in {\varOmega ^{\prime }}}{\operatorname{arg\,min}}d\big(F(\cdot |\hat{\theta }),F_{n}(\cdot )\big),\] \[ d\big(F(\cdot |\theta ),F_{n}(\cdot )\big)=\frac{1}{n}{\sum \limits_{i=1}^{n}}{\big(F(x_{i}|\theta )-F_{n}(x_{i})\big)}^{2}=\frac{1}{n}{\sum \limits_{i=1}^{n}}{\big(F(x_{(i)}|\theta )-i/n\big)}^{2},\]

where $x_{(1)},\dots ,x_{(n)}$ is the ordered sample. It should be noted that this choice of distance is mainly due to its computational simplicity and that a number of different measures can be considered (see [26], chap. 4)Next we describe how the HM-estimator is obtained in practice, via a reformulation of definition (7) that is more useful for computation.

### 2.2 How to compute the HM-estimator

In this subsection we describe how the HM-estimator (7) can be obtained in practice. To get a representation of the solutions of the system (4)–(6), we reparametrize the problem by introducing the proportion $\hat{r}$, defined by Equations (4), (5), and (8) can be used to eliminate $\hat{\mu }_{2}$, ${\hat{\sigma }_{1}^{2}}$, ${\hat{\sigma }_{2}^{2}}$ in equation (6), and as a result we obtain where the coefficients are functions of $\hat{p}$ and $\hat{r}$ such that If $\hat{p}$ and $\hat{r}$ are given, we see that (9) is a cubic equation for $\hat{\mu }_{1}$ and so the reparameterized system has at most three solutions that correspond to relevant estimates. Define where the function

##### (9)

\[\begin{array}{r@{\hskip0pt}l}\displaystyle \beta _{3}& \displaystyle {\hat{\mu }_{1}^{3}}+\beta _{2}{\hat{\mu }_{1}^{2}}+\beta _{1}\hat{\mu }_{1}+\beta _{0}=0,\end{array}\] \[\begin{array}{r@{\hskip0pt}l}\displaystyle \beta _{0}& \displaystyle =-{\hat{\nu }_{3}^{\ast }}+\frac{3{\hat{\nu }_{1}^{\ast }}{\hat{\nu }_{2}^{\ast }}\hat{r}}{\hat{p}+\hat{r}-\hat{p}\hat{r}}+{\big({\hat{\nu }_{1}^{\ast }}\big)}^{3}\frac{3\hat{p}-2(\hat{p}+\hat{r}-\hat{p}\hat{r})}{{(1-\hat{p})}^{2}(\hat{p}+\hat{r}-\hat{p}\hat{r})},\\{} \displaystyle \beta _{1}& \displaystyle =3{\hat{\nu }_{2}^{\ast }}\frac{\hat{r}-\hat{p}+\hat{r}-\hat{p}\hat{r}}{(1-\hat{p})(\hat{p}+\hat{r}-\hat{p}\hat{r})}-{\big({\hat{\nu }_{1}^{\ast }}\big)}^{2}\frac{3\hat{p}(2\hat{p}-2(\hat{p}+\hat{r}-\hat{p}\hat{r})+1)}{{(1-\hat{p})}^{2}(\hat{p}+\hat{r}-\hat{p}\hat{r})},\\{} \displaystyle \beta _{2}& \displaystyle =3{\hat{\nu }_{1}^{\ast }}\frac{\hat{p}(2\hat{p}-{\hat{p}}^{2}+{\hat{p}}^{2}\hat{r}-\hat{r})}{{(1-\hat{p})}^{2}(\hat{p}+\hat{r}-\hat{p}\hat{r})},\\{} \displaystyle \beta _{3}& \displaystyle =-\frac{\hat{p}(2\hat{p}-{\hat{p}}^{2}+{\hat{p}}^{2}\hat{r}-\hat{r})}{{(1-\hat{p})}^{2}(\hat{p}+\hat{r}-\hat{p}\hat{r})}.\end{array}\]

Furthermore, by combining (4), (5), and (8), we get that the estimated parameters $\hat{\mu }_{2}$ and ${\hat{\sigma }_{1}^{2}}$ are obtained as *M*to be the set of all pairs $(\hat{p},\hat{r})$ for which at least one relevant estimate exists, and let $T(\hat{p},\hat{r})$ contain all relevant estimates corresponding to the pair $(\hat{p},\hat{r})\in \hspace{2.5pt}M$. From the definitions of ${\varOmega ^{\prime }}$,*M*, and $T(\hat{p},\hat{r})$ it follows that##### (12)

\[ \underset{\hat{\theta }\in {\varOmega ^{\prime }}}{\min }d\big(F(\cdot |\hat{\theta }),F_{n}(\cdot )\big)=\underset{(\hat{p},\hat{r})\in M}{\min }g(\hat{p},\hat{r}),\] \[ g(\hat{p},\hat{r})=\underset{\hat{\theta }\in T(\hat{p},\hat{r})}{\min }d\big(F(\cdot |\hat{\theta }),F_{n}(\cdot )\big),\hspace{1em}(\hat{p},\hat{r})\in M,\]

is straightforward to compute since $T(\hat{p},\hat{r})$ contains at most three elements and these are solutions of the polynomial equations (8)–(11). Equation (12) reformulates the problem of deriving the HM-estimator (7) to a minimization problem for the bivariate function $g(\hat{p},\hat{r}),(\hat{p},\hat{r})\in M$. Let $(\hat{p}_{\mathit{HM}},\hat{r}_{\mathit{HM}})$ denote the point that minimizes the function $g(\hat{p},\hat{r}),g(\hat{p},\hat{r})\in M$. Then the estimator is given as The minimization of $g(\hat{p},\hat{r})$ can be obtained using a numerical optimization algorithm. Here we use the simplex algorithm in [21], which is implemented in the optim routine in

`R`[23]. The starting value is found as the minimizer of $g(\hat{p},\hat{r})$ over a finite grid of values. A schematic description of how the HM-estimator is computed is given in Figure 1.##### Fig. 1.

A schematic description of how the HM-estimator is obtained. The user provides a grid with values $(\hat{p},\hat{r})$. The method of moments is used to obtain all relevant estimates corresponding to the grid-points. The minimum distance approach is used to select the “best” of those estimates, which gives the starting point $({\hat{p}}^{(0)},{\hat{r}}^{(0)})$. Local grid optimization, minimizing the function $g(\hat{p},\hat{r})$, with $({\hat{p}}^{(0)},{\hat{r}}^{(0)})$ as the starting value gives the HM-estimator. Note that the optimization step may be unnecessary if the grid is very dense

### 2.3 Model-based clustering

The mixture density $f(\cdot )$ is typically used to model a data set $x_{1},\dots ,x_{n}$ where $n_{1}$ of the values are observations from component $f_{1}(\cdot )$ and the remaining $n_{2}=n-n_{1}$ are observations from $f_{2}(\cdot )$. For such a sample, we can introduce a 0–1 vector $\mathbf{z}=(z_{1},\dots ,z_{n})$ that correctly assigns each observation to either $f_{1}(\cdot )$ (1’s) or $f_{2}(\cdot )$ (0’s). This vector defines a true partition of the observations with respect to the density components. Usually the components represent distinct subpopulations.

The true partition defined by $\mathbf{z}=(z_{1},\dots ,z_{n})$ is unobservable but can be estimated with the posterior membership probabilites, also known as the The value $\hat{z}_{i}$ is in the interval $[0,1]$ and estimates the probability that $x_{i}$ is an observation from component $f_{1}(\cdot )$. The vector $\hat{\mathbf{z}}=(\hat{z}_{1},\dots ,\hat{z}_{n})$ defines a so-called

*responsibilites*, denoted by $\hat{\mathbf{z}}=(\hat{z}_{1},\dots ,\hat{z}_{n})$, where##### (13)

\[ \hat{z}_{i}=\frac{\hat{p}f_{1}(x_{i}|\hat{\theta }_{1})}{f(x_{i}|\hat{\theta })}=\frac{\hat{p}f_{1}(x_{i}|\hat{\theta }_{1})}{\hat{p}\hat{f}_{1}(x_{i}|\hat{\theta }_{1})+(1-\hat{p})\hat{f}_{2}(x_{i}|\hat{\theta }_{2})},\hspace{1em}i=1,\dots ,n.\]*soft*partition of the data $x_{1},\dots ,x_{n}$ which serves as an approximation of the true partition given by $\mathbf{z}=(z_{1},\dots ,z_{n})$. The responsibilities in (13) can be obtained for any estimator $\hat{\theta }$ of*θ*, e.g. the proposed HM-estimator or the maximum likelihood (ML) estimator used in the numerical studies in Sections 3 and 4.## 3 Simulation study

This section presents a simulation study where the proposed hybrid method (HM) is compared with a conventional ML-estimator derived via the EM-algorithm. We investigate the methods’ performances in model-based clustering and their accuracy for estimating the mixing proportion. The consequences of calculating the estimators under incorrect model assumptions are getting particular attention.

### 3.1 Data and estimation

In the simulations, we restrict ourselves to the case where the component densities $f_{1}(\cdot )$ and $f_{2}(\cdot )$ belong to the same family of distributions. The estimators are calculated under the assumption that $f_{1}(\cdot )$ and $f_{2}(\cdot )$ are normal densities, which is a common assumption in practice. The data are generated from normal mixtures, for which the assumption is true, and also from mixtures of Laplace, logistic and contaminated Gaussian distributions (for details, see Appendix A.2). For the contaminated Gaussian distribution, we set the larger prior probability to $\alpha =0.9$ and the variance proportion parameter to $\eta =16$. The experiment thus includes both modest and large departures from the normal mixture assumption, allowing us to analyze the robustness of the methods with respect to imprecise model specifications.

Besides varying the family of the component densities, we consider six configurations of the parameter vector

*θ*which correspond to a variety in shape of the mixture distributions. In addition to these configurations a negative control with a non-mixture distribution was added. The values are given in Table 1 and displayed graphically in Figure 2. Three sample sizes are considered; $n=50,100$, and 500.##### Table 1.

The configurations (i)–(vii) of the parameter vector

*θ*used in the simulationsConfiguration | |||||||

Parameter | (i) | (ii) | (iii) | (iv) | (v) | (vi) | (vii) |

$\mu _{1}$ | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

$\mu _{2}$ | 2 | 3 | 3 | 4 | 4 | 3 | 0 |

${\sigma _{1}^{2}}$ | 1 | 1 | 1 | 4 | 4 | 9 | 1 |

${\sigma _{2}^{2}}$ | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

p | 0.50 | 0.50 | 0.25 | 0.50 | 0.25 | 0.50 | 0.10–0.50 |

##### Fig. 2.

The mixture distributions used in the simulations: four distribution families – mixtures of normal, logistic, Laplace, and contaminated normal distributions – and six parameter configurations (i)–(vi)

The mixture data were generated as follows: first we simulated the (true) partition vector $\mathbf{z}=(z_{1},\dots ,z_{n})$ from the 0–1 variable

*Z*with $P(Z=1)=p$ and $P(Z=0)=1-p$. Then ${x_{1}^{(1)}},\dots ,{x_{n}^{(1)}}$ and ${x_{1}^{(2)}},\dots ,{x_{n}^{(2)}}$ were simulated from the component densities $f_{1}(\cdot )$ and $f_{2}(\cdot )$, respectively. Finally a sample $x_{1},\dots ,x_{n}$ from the mixture density $f(\cdot )$ was obtained as We generated $N=500$ data sets for each considered scenario (mixture family, parameter configuration, and sample size), and for each scenario we obtained 500 independent realizations ${\hat{\theta }}^{(1)},\dots ,{\hat{\theta }}^{(500)}$ of an estimator $\hat{\theta }$, which were used for statistical evaluation of the performance of the considered methods in clustering and in estimation of the mixing proportion.### Computing the estimators

The hybrid estimator $\hat{\theta }_{\mathit{HM}}$ was obtained as described in the previous section. The starting value for the simplex method was found as the minimizer of $g(\hat{p},\hat{r})$ over a two-dimensional grid constructed from 10 values of $\hat{p}$ and 200 values of $\hat{r}$. The values of $\hat{p}$ and $\hat{r}$ in the grid were evenly distributed in the intervals $(0,1)$ and $[0,20]$, respectively. We used trimmed versions of the sample moments $\hat{\nu }_{k},k=1,2,3$. The 2.5% smallest and 2.5% largest of values in ${x_{1}^{k}},\dots ,{x_{n}^{k}}$ were removed and the mean ${\hat{\nu }_{k}^{\ast }}$ of the resulting trimmed sample was used as the estimator of the

*k*th moment $\nu _{k}$.The maximum likelihood estimator $\hat{\theta }_{\mathit{ML}}$ was calculated with the EM-algorithm using the mixtools package in R [1]. The EM-algorithm converges at a local maximum of the likelihood function that depends on its starting value. We chose ten starting values randomly as described in [14], and the estimator $\hat{\theta }_{\mathit{ML}}$ was taken as the maximizer of the likelihood among the corresponding points of convergence.

### 3.2 Evaluation of the methods’ clustering performance

For a simulated dataset $x_{1},\dots ,x_{n}$ the true partition $\mathbf{z}=(z_{1},\dots ,z_{n})$ was known. The true partition was approximated by the soft partitions $\hat{\mathbf{z}}_{\mathit{HM}}$ and $\hat{\mathbf{z}}_{\mathit{ML}}$, calculated from the corresponding estimates $\hat{\theta }_{\mathit{HM}}$ and $\hat{\theta }_{\mathit{ML}}$, respectively, using Equation (13). To quantify the accuracy of an approximate partition, we used the

*Fuzzy Adjusted Rand Index*(FARI) proposed in [3]. The FARI for**z**and its approximation $\hat{\mathbf{z}}$ – written as FARI($\hat{\mathbf{z}},\mathbf{z}$) – is a number in the interval $[-1,1]$ measuring their closeness; the higher the index the better is the approximation $\hat{\mathbf{z}}$. A brief description of this index is given in Appendix A.3.Let

\[ \varDelta _{\mathit{FARI}}=\mathrm{FARI}(\mathbf{z},\hat{\mathbf{z}}_{\mathit{HM}})-\mathrm{FARI}(\mathbf{z},\hat{\mathbf{z}}_{\mathit{ML}})\]

denote the difference between the indices. Note that a positive difference $\varDelta _{\mathit{FARI}}>0$ implies that the partition obtained via the HM-estimator was more accurate than the partition obtained via the ML-estimator.To determine if there was a significant difference between the methods’ clustering performance, we applied the

*t*-test and the sign-test to the pairwise differences ${\varDelta _{\mathit{FARI}}^{(1)}},\dots ,{\varDelta _{\mathit{FARI}}^{(500)}}$ for the 500 simulated samples. We also made a comparison of the methods given that a difference in FARI under a certain threshold was considered as negligible, which was achieved by applying the sign-test to the differences that satisfied $|{\varDelta _{\mathit{FARI}}^{(i)}}|>0.1$.The considered scenarios corresponded to problems that were more or less difficult with respect to clustering and as part of our evaluations we quantified these difficulties. Here $\hat{\mathbf{z}}_{\mathit{opt}}$ denotes the optimal partition obtained when the true component densities and parameter values in (13) were used. The index corresponded to the clustering performance obtained under correct model assumptions and a perfect estimator of

*θ*. For each scenario, we used the mean of ${\mathrm{FARI}_{\mathit{opt}}^{(1)}},\dots ,{\mathrm{FARI}_{\mathit{opt}}^{(500)}}$ for the 500 simulated samples to measure the difficulty of the problem and as a reference value for the corresponding FARI obtained by the HM- and ML-estimators.### 3.3 Evaluation of the methods’ ability to estimate the mixing proportion

We compared the methods in terms of their accuracy for estimating the mixing proportion

*p*. Details on how we defined the point estimators of*p*are given in Appendix A.4.The following standard characteristics for evaluating an estimator $\hat{p}$ of These characteristics were calculated for the simulated estimates ${\{{\hat{p}_{\mathit{HM}}^{(i)}}\}_{i=1}^{N}}$ and ${\{{\hat{p}_{\mathit{ML}}^{(i)}}\}_{i=1}^{N}}$ obtained from the HM- and ML-estimators, respectively.

*p*based on*N*simulations were used:##### (14)

\[\begin{array}{r@{\hskip0pt}l}\displaystyle \mathit{mean}& \displaystyle =\frac{1}{N}{\sum \limits_{i=1}^{N}}{\hat{p}}^{(i)},\\{} \displaystyle \hat{\mathit{bias}}& \displaystyle =\frac{1}{N}{\sum \limits_{i=1}^{N}}\big({\hat{p}}^{(i)}-p\big),\\{} \displaystyle \hat{\mathit{MSE}}& \displaystyle =\frac{1}{N}{\sum \limits_{i=1}^{N}}{\big({\hat{p}}^{(i)}-p\big)}^{2}.\end{array}\]To determine if there was a significant difference in efficiency between the methods, we applied the

*t*-test to the difference in estimated mean squared error $\varDelta _{\mathit{MSE}}=\hat{\mathit{MSE}}_{\mathit{ML}}-\hat{\mathit{MSE}}_{\mathit{HM}}$, where the subscripts HM and ML refer to the methods used in (14). Note that a positive value of $\varDelta _{\mathit{MSE}}$ suggested that the hybrid method was more efficient as an estimator of*p*.### 3.4 Results

This section includes a detailed treatment of the results for sample size $n=50$. The results for sample sizes $n=100$ and $n=500$ led to similar conclusions, and are given in Appendix A.5.

The HM- and ML-estimators were evaluated mainly by their ability to cluster the samples in agreement with the true partition vector

**z**and their ability to estimate the proportion parameter*p*. The corresponding estimators for the difference in mean $\mu _{2}-\mu _{1}$ were compared in a similar way as for the mixing proportion*p*. This was done for the case when the data were generated from a normal mixture distribution (i.e. under correct model assumptions), a logistic mixture distribution (modest violation of model assumptions), a Laplace mixture distribution and contaminated Gaussian distribution (serious violation of model assumptions). For each case six values of the parameter vector were evaluated, Table 1 and Figure 2.#### 3.4.1 Clustering performance

The relative performance of the methods was evaluated by considering the mean difference in FARI ($\overline{\varDelta }_{\mathit{FARI}}$), where a positive value indicated an advantage of the HM-estimator, the proportion of samples that were more accurately clustered by the HM-estimator than by the ML-estimator ($\mathit{prop}_{\mathit{HM}}$), and the proportion of the observed considerable differences in FARI which were in favor of the HM-estimator ($\mathit{propC}_{\mathit{HM}}$), see Section 3.2 for further details.

Scenarios (i) and (vi) were hard clustering problems in the sense that the mean optimal FARI was low in all the cases: $\mathit{mean}_{\mathit{opt}}\in [0.30,0.58]$, whereas the other scenarios (ii, iii, iv, v) corresponded to relatively easy clustering problems: $\mathit{mean}_{\mathit{opt}}\in [0.60,0.81]$, Table 2.

In the case where data were generated from normal mixtures most of the observed differences were significant but of varying magnitude: $\overline{\varDelta }_{\mathit{FARI}}\in [-0.02,0.08]$, $\mathit{prop}_{\mathit{HM}}\hspace{0.1667em}\in [0.35,0.68]$ and $\mathit{propC}_{\mathit{HM}}\hspace{0.1667em}\in [0.40,0.92]$. The HM-estimator performed significantly better than the ML-estimator for scenarios (i, ii, iv, vi) and significantly worse for (iii, v), but the differences were rather small for (iii, v, vi), Figure 3 and Table 3.

##### Table 2.

The average clustering performance of the hybrid method (HM) and the maximum likelihood (ML) method. 500 samples with 50 observations each were generated from four mixture distributions (normal, logistic, Laplace and contaminated Gaussian) with the parameter configurations (i)–(vi). The fuzzy adjusted Rand index (FARI) was obtained for each sample and estimator. The mean FARI was observed for each scenario, and the mean of the optimal FARI (opt.) obtained using the true mixture distribution serves as a reference

Mean of fuzzy adjusted Rand index, $n=50$ | ||||

Mixture | $\mathit{HM}$ | $\mathit{ML}$ | $\mathit{opt}.$ | |

Normal | (i) | 0.28 | 0.21 | 0.35 |

(ii) | 0.59 | 0.52 | 0.68 | |

(iii) | 0.54 | 0.56 | 0.70 | |

(iv) | 0.57 | 0.53 | 0.60 | |

(v) | 0.64 | 0.65 | 0.70 | |

(vi) | 0.29 | 0.27 | 0.30 | |

Logistic | (i) | 0.41 | 0.24 | 0.46 |

(ii) | 0.71 | 0.60 | 0.71 | |

(iii) | 0.65 | 0.52 | 0.71 | |

(iv) | 0.64 | 0.53 | 0.65 | |

(v) | 0.67 | 0.57 | 0.73 | |

(vi) | 0.34 | 0.26 | 0.39 | |

Laplace | (i) | 0.32 | 0.22 | 0.40 |

(ii) | 0.66 | 0.57 | 0.70 | |

(iii) | 0.59 | 0.54 | 0.70 | |

(iv) | 0.60 | 0.53 | 0.63 | |

(v) | 0.67 | 0.64 | 0.72 | |

(vi) | 0.30 | 0.27 | 0.33 | |

Contaminated | (i) | 0.44 | 0.26 | 0.58 |

(ii) | 0.78 | 0.64 | 0.80 | |

(iii) | 0.74 | 0.65 | 0.81 | |

(iv) | 0.73 | 0.60 | 0.71 | |

(v) | 0.75 | 0.67 | 0.70 | |

(vi) | 0.39 | 0.33 | 0.42 |

With the data generated from logistic mixtures, the HM-estimator outperformed the ML-estimator for all parameter configurations, and most of the observed differences and evaluation measures were significant; $\overline{\varDelta }_{\mathit{FARI}}\in [0.03,0.10],\hspace{0.1667em}\mathit{prop}_{\mathit{HM}}\in [0.50,0.76]$, and $\mathit{propC}_{\mathit{HM}}\in [0.61,0.94]$, Figure 4 and Table 3. The largest differences were observed for scenarios (i, ii, iv), whereas the differences in scenarios (iii, v) were quite moderate. The magnitude of the differences were similar to those obtained for normal mixture data, but in this case all of them indicated an advantage of the HM-estimator.

##### Fig. 3.

The FARI observed for the hybrid and maximum likelihood methods. 500 samples each with 50 observations, are generated from normal mixture distributions with the parameter configurations (i)–(vi). Samples for which the hybrid method performs considerably better (worse) than the maximum likelihood estimator are in the upper (lower) shaded area. Points inside the white area mark samples that correspond to inconsiderable differences. A difference is regarded as considerable if the absolute difference in the methods’ FARI exceeds 0.1

##### Table 3.

The relative clustering performance of the hybrid method (HM) and the maximum likelihood (ML) method. 500 samples each with 50 observations, are generated from four mixture distributions (normal, logistic, Laplace, and contaminated Gaussian) with the parameter configurations (i)–(vi). The fuzzy adjusted Rand index (FARI) is observed for each sample and estimator. For each scenario we observe: the mean of the differences between the observed average FARI values for the HM- and ML-estimators ($\overline{\varDelta }_{\mathit{FARI}}$) and the proportion of times the HM-estimator have a higher FARI value than the ML-estimator ($\mathit{prop}_{\mathit{HM}}$). A positive value of $\overline{\varDelta }_{\mathit{FARI}}$ indicates a mean difference in favor of the HM-estimator. In the third column we observe the number of times $n_{\mathit{HM}}$ ($n_{\mathit{ML}}$) the hybrid method performs considerably better (worse) than the maximum likelihood estimator. Here $\mathit{propC}_{\mathit{HM}}$ denotes the proportion of the samples with considerable differences for which the HM-estimator is superior. A difference is defined to be considerable if the distance between the methods’ FARI is larger than 0.1. For each evaluation measure we test if the methods have the same average performance, the p-values relate to those tests

Comparison of HM and ML for soft clustering, $n=50$ | |||||||||

Mixture | $\overline{\varDelta }_{\mathit{FARI}}$ | p-value | $\mathit{prop}_{\mathit{HM}}$ | p-value | $\mathit{propC}_{\mathit{HM}}$ | p-value | $n_{\mathit{HM}}$ | $n_{\mathit{ML}}$ | |

Normal | (i) | 0.07 | 0.00 | 0.67 | 0.00 | 0.92 | 0.00 | 143 | 13 |

(ii) | 0.08 | 0.00 | 0.68 | 0.00 | 0.86 | 0.00 | 126 | 20 | |

(iii) | −0.02 | 0.02 | 0.42 | 0.00 | 0.40 | 0.01 | 84 | 124 | |

(iv) | 0.04 | 0.00 | 0.50 | 0.96 | 0.81 | 0.00 | 82 | 19 | |

(v) | −0.01 | 0.04 | 0.35 | 0.00 | 0.47 | 0.63 | 51 | 57 | |

(vi) | 0.02 | 0.00 | 0.55 | 0.02 | 0.83 | 0.00 | 69 | 14 | |

Logistic | (i) | 0.10 | 0.00 | 0.76 | 0.00 | 0.94 | 0.00 | 200 | 14 |

(ii) | 0.08 | 0.00 | 0.68 | 0.00 | 0.91 | 0.00 | 127 | 12 | |

(iii) | 0.05 | 0.00 | 0.50 | 0.89 | 0.61 | 0.00 | 138 | 89 | |

(iv) | 0.07 | 0.00 | 0.65 | 0.00 | 0.94 | 0.00 | 136 | 8 | |

(v) | 0.03 | 0.00 | 0.56 | 0.01 | 0.67 | 0.00 | 111 | 55 | |

(vi) | 0.03 | 0.00 | 0.62 | 0.00 | 0.80 | 0.00 | 73 | 18 | |

Laplace | (i) | 0.17 | 0.00 | 0.85 | 0.00 | 0.98 | 0.00 | 268 | 5 |

(ii) | 0.12 | 0.00 | 0.72 | 0.00 | 0.97 | 0.00 | 147 | 4 | |

(iii) | 0.13 | 0.00 | 0.62 | 0.00 | 0.79 | 0.00 | 184 | 49 | |

(iv) | 0.12 | 0.00 | 0.73 | 0.00 | 0.98 | 0.00 | 175 | 4 | |

(v) | 0.10 | 0.00 | 0.67 | 0.00 | 0.81 | 0.00 | 192 | 45 | |

(vi) | 0.08 | 0.00 | 0.78 | 0.00 | 0.89 | 0.00 | 161 | 21 | |

Contaminated | (i) | 0.17 | 0.00 | 0.81 | 0.00 | 0.96 | 0.00 | 264 | 11 |

(ii) | 0.14 | 0.00 | 0.71 | 0.00 | 0.93 | 0.00 | 174 | 13 | |

(iii) | 0.08 | 0.00 | 0.55 | 0.04 | 0.72 | 0.00 | 161 | 61 | |

(iv) | 0.12 | 0.00 | 0.75 | 0.00 | 0.94 | 0.00 | 212 | 14 | |

(v) | 0.08 | 0.00 | 0.66 | 0.00 | 0.85 | 0.00 | 182 | 31 | |

(vi) | 0.06 | 0.00 | 0.65 | 0.00 | 0.82 | 0.00 | 157 | 34 |

In the case where the data were simulated from a mixture of Laplace or contaminated Gaussian distributions the HM-estimator outperformed the ML-estimator and all the observed differences were significant: $\overline{\varDelta }_{\mathit{FARI}}\in [0.06,0.17]$, $\mathit{prop}_{\mathit{HM}}\in [0.55,0.85]$ and $\mathit{propC}_{\mathit{HM}}\in [0.72,0.98]$, Figures 5–6 and Table 3. Overall the differences were more distinct than in the cases of normal and logistic mixtures.

Configuration (vii) defined a non-mixture distribution for which the desired result would be an average FARI value around zero and few high FARI values. Overall, both methods performed as expected and no clear differences between the methods were observed, with the exception that the ML method was more variable in the case $n=500$, see Figures 8–10 in Appendix A.7.

##### Fig. 4.

The FARI observed for the hybrid and maximum likelihood methods. 500 samples each with 50 observations, are generated from logistic mixture distributions with the parameter configurations (i)–(vi). Samples for which the hybrid method performs considerably better (worse) than the maximum likelihood estimator are in the upper (lower) shaded area. Points inside the white area mark samples that correspond to inconsiderable differences. A difference is regarded as considerable if the absolute difference in the methods’ FARI exceeds 0.1

##### Fig. 5.

The FARI observed for the hybrid and maximum likelihood methods. 500 samples each with 50 observations, are generated from Laplace mixture distributions with the parameter configurations (i)–(vi). Samples for which the hybrid method performs considerably better (worse) than the maximum likelihood estimator are in the upper (lower) shaded area. Points inside the white area mark samples that correspond to inconsiderable differences. A difference is regarded as considerable if the absolute difference in the methods’ FARI exceeds 0.1

##### Fig. 6.

The FARI observed for the hybrid and maximum likelihood methods. 500 samples each with 50 observations, are generated from contaminated Gaussian mixture distributions with the parameter configurations (i)–(vi). Samples for which the hybrid method performs considerably better (worse) than the maximum likelihood estimator are in the upper (lower) shaded area. Points inside the white area mark samples that correspond to inconsiderable differences. A difference is regarded as considerable if the absolute difference in the methods’ FARI exceeds 0.1

#### 3.4.2 Estimation of the proportion parameter *p*

The methods ability to estimate the mixing proportion

*p*was evaluated using the mean, bias and mean squared error (MSE) for the corresponding point estimators of*p*, and their relative efficiency was analyzed via the estimated difference in MSE, see Section 3.3 for further details.The HM-estimator (of the proportion parameter

*p*) had lower MSE than the ML-estimator for almost all considered scenarios and most of the observed differences were significant, Table 4. The largest differences were observed when the model assumptions were seriously violated and the data were generated by a Laplace mixture and the smallest differences where observed for data generated by a normal mixture, Table 4. The observed MSE varied between the six parameter vectors, where scenarios (i) and (vi) had the highest MSE-values. Recall that these scenarios were the most difficult ones in terms of clustering. Furthermore, the advantage of the HM-estimator was most prominent for scenario (i) which also is in agreement with the clustering results. Investigating the precision of the methods via the magnitude of the observed bias revealed that the ML-estimator was more precise than the HM-estimator when the distributional assumption was valid and that the methods had similar precision when the assumptions were violated, Table 4. The results obtained for estimating the difference in mean $\mu _{2}-\mu _{1}$ resembled the results obtained for the mixing proportion*p*, see Tables 11–13 in Appendix A.6.##### Table 4.

The accuracy of the HM- and ML-estimators with regard to estimating the proportion parameter

*p*. 500 samples each with 50 observations, are generated from four mixture distributions (normal, logistic, Laplace and contaminated Gaussian) with the parameter configurations (i)–(vi). The parameter*p*is estimated for each sample. For each scenario we observe: the true parameter value (*p*), the average estimate (mean), the average deviation from the true value (bias), the Mean Squared Error (MSE), the difference between the ML-MSE and HM-MSE values ($\varDelta _{\hat{\mathit{MSE}}}$) and a*p*-value for the test that this difference is significant. A positive (negative) value of $\varDelta _{\hat{\mathit{MSE}}}$ indicates that the HM-estimator is more (less) efficient than the ML-estimator of*p*Estimation of the mixing proportion, $n=50$ | ||||||||||

Data | True | Mean | $\hat{\mathit{bias}}$ | $\hat{\mathit{MSE}}$ | $\varDelta _{\hat{\mathit{MSE}}}$ | p-val | ||||

p | HM | ML | HM | ML | HM | ML | ||||

Normal | (i) | 0.50 | 0.543 | 0.483 | 0.043 | −0.017 | 0.052 | 0.084 | 0.033 | 0.000 |

(ii) | 0.50 | 0.517 | 0.508 | 0.017 | 0.008 | 0.017 | 0.036 | 0.019 | 0.000 | |

(iii) | 0.25 | 0.331 | 0.280 | 0.081 | 0.030 | 0.028 | 0.027 | −0.000 | 0.889 | |

(iv) | 0.50 | 0.471 | 0.481 | −0.029 | −0.019 | 0.014 | 0.023 | 0.009 | 0.000 | |

(v) | 0.25 | 0.238 | 0.246 | −0.012 | −0.004 | 0.011 | 0.013 | 0.002 | 0.104 | |

(vi) | 0.50 | 0.362 | 0.423 | −0.138 | −0.077 | 0.040 | 0.041 | 0.001 | 0.617 | |

Logistic | (i) | 0.50 | 0.553 | 0.518 | 0.053 | 0.018 | 0.043 | 0.081 | 0.037 | 0.000 |

(ii) | 0.50 | 0.504 | 0.503 | 0.004 | 0.003 | 0.013 | 0.029 | 0.016 | 0.000 | |

(iii) | 0.25 | 0.314 | 0.338 | 0.064 | 0.088 | 0.022 | 0.040 | 0.017 | 0.000 | |

(iv) | 0.50 | 0.488 | 0.528 | −0.012 | 0.028 | 0.012 | 0.024 | 0.012 | 0.000 | |

(v) | 0.25 | 0.254 | 0.297 | 0.004 | 0.047 | 0.010 | 0.020 | 0.011 | 0.000 | |

(vi) | 0.50 | 0.400 | 0.489 | −0.100 | −0.011 | 0.034 | 0.035 | 0.001 | 0.627 | |

Laplace | (i) | 0.50 | 0.533 | 0.511 | 0.033 | 0.011 | 0.029 | 0.074 | 0.044 | 0.000 |

(ii) | 0.50 | 0.484 | 0.498 | −0.016 | −0.002 | 0.011 | 0.025 | 0.015 | 0.000 | |

(iii) | 0.25 | 0.295 | 0.383 | 0.045 | 0.133 | 0.014 | 0.048 | 0.034 | 0.000 | |

(iv) | 0.50 | 0.491 | 0.561 | −0.009 | 0.061 | 0.011 | 0.025 | 0.014 | 0.000 | |

(v) | 0.25 | 0.280 | 0.371 | 0.030 | 0.121 | 0.012 | 0.035 | 0.024 | 0.000 | |

(vi) | 0.50 | 0.445 | 0.536 | −0.055 | 0.036 | 0.029 | 0.050 | 0.021 | 0.000 | |

Contaminated | (i) | 0.50 | 0.517 | 0.491 | 0.017 | −0.009 | 0.046 | 0.083 | 0.037 | 0.000 |

(ii) | 0.50 | 0.488 | 0.502 | −0.012 | 0.002 | 0.008 | 0.025 | 0.016 | 0.000 | |

(iii) | 0.25 | 0.276 | 0.362 | 0.026 | 0.112 | 0.011 | 0.036 | 0.024 | 0.000 | |

(iv) | 0.50 | 0.491 | 0.563 | −0.009 | 0.063 | 0.011 | 0.024 | 0.014 | 0.000 | |

(v) | 0.25 | 0.257 | 0.359 | 0.007 | 0.109 | 0.009 | 0.024 | 0.014 | 0.000 | |

(vi) | 0.50 | 0.426 | 0.527 | −0.074 | 0.027 | 0.033 | 0.040 | 0.006 | 0.051 |

## 4 Case study: clustering of acute leukemia data

### 4.1 Description of the data

In [13] a microarray experiment on human mRNA samples for measuring gene expression levels in two subtypes of acute leukemia is described, namely acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The experiment contained $n=72$ samples, of which 47 were of type ALL and 25 were of type AML, and the expression levels of 6,817 genes were observed. In this study we used the preprocessed and filtered version of this data considered in [8], which contains the expression values of 3,571 genes.

### 4.2 Identification of differentially expressed genes

We applied a supervised procedure to get a subset of the 3,571 genes which were expressed differently with respect to the ALL/AML grouping. For each gene

*i*we calculated the normal mixture clustering ${\hat{\mathbf{z}}}^{(i)}$ in (13) with parameter estimates obtained under known group memberships (i.e. the sample means and variances). Then we used (the measure) $\mathrm{FARI}(\mathbf{z},{\hat{\mathbf{z}}}^{(i)})$, where**z**is the true ALL/AML grouping expressed as a 0–1 vector, to quantify the extent to which the mean expression of gene*i*differs between the groups. The 342 genes that met the criterion were regarded as truly differently expressed genes and included in our test set. Alternatively, we could have used the two-sample*t*-test to make this selection.We applied the HM and ML clustering methods to each of the 342 test variables to compare their ability to divide the 72 cancer samples into the ALL and AML groups. The analysis was carried out as described in Section 3.2.

### 4.3 Results

Independent of method the overall performance was rather poor; most of the clusterings had a FARI between 0 and 0.4, Figure 7. This implies that it is hard to cluster the samples accurately based on single genes.

The observed differences between the HM and ML methods were all significant and in favor of the hybrid method. For 213 of the 342 test genes the HM method clustered the samples more accurately than the ML method (i.e. $\mathit{prop}_{\mathit{HM}}=0.623$, p-value < ${10}^{-5}$). The mean difference in FARI (i.e. $\overline{\varDelta }_{\mathit{FARI}}$) was 0.020 (p-value $<{10}^{-5}$). Moreover, of the 38 genes for which there was a

*considerable difference*between the methods (i.e. an absolute difference larger than 0.1), the HM clustering was superior over the ML clustering in 32 of the cases (i.e. $\mathit{propC}_{\mathit{HM}}=0.842$, p-value $<{10}^{-4}$). For notation, see Section 3.4.1.##### Fig. 7.

Clustering results for the cancer data. The fuzzy adjusted Rand indices (FARI) observed for the hybrid and maximum likelihood methods. Data were taken from a microarray experiment on gene expression levels in two types of acute leukemia: ALL and AML. 342 genes were measured across 72 samples. Genes for which the hybrid method performed considerably better (worse) than the maximum likelihood estimator are in the upper (lower) shaded area. Here a difference was defined to be considerable if the absolute difference in FARI between the methods was larger than 0.1

## 5 Discussion and conclusion

We consider a univariate cluster problem, which arises in many applications, where the data are generated from a mixture distribution with two components and where the aim is to group samples of the same type. This problem is commonly solved using the EM-algorithm based on the assumption that the observations are generated by a mixture of two normal densities. Although this is a powerful method it is also sensitive to incorrectly specified distributions. Furthermore, the assumption that data approximately follow a normal mixture is rather restrictive, which makes the EM-approach unfeasible in many applications.

The use of hybrid methods in mixture problems is, to the best of our knowledge, rather unexplored. The variant we propose can be motivated as follows: the method of moments is general in the sense that the parametric family can be left unspecified, it is enough to assume that the component densities are symmetric and have finite moments, and the minimum distance method is robust against symmetric departures from the assumed normal mixture distribution.

The results suggest that the proposed HM-estimator has a considerably better ability to cluster the samples than the ML-estimator, in particular if the assumption of a normal mixture is incorrect. This result is observed for both simulated and real data, and holds independently of the sample size. A slight advantage for the HM-estimator is also observed in the case where the Gaussian mixture assumption is valid.

We also consider estimation of the mixing proportion

*p*, a problem that has attracted much interest in the literature [20]. Our results show that the HM-estimator is more robust and efficient than the ML-estimator for estimating*p*for a wide range of mixture distributions and sample sizes. This is consistent with several related studies on minimum distance inference for*p*[6, 5].It should be noted that the HM-estimator can easily be adapted to any parametric mixture of symmetric densities, not just the normal mixture distribution. Furthermore, we can consider a less restrictive assumption that allows the components distributions to be of several types. For example, we may use the composite assumption that the data are generated by a mixture of two normal distributions, the mixture of two Laplace distributions, or the mixture of one normal distribution and one Laplace distribution. In this case parameter estimates can be obtained via the proposed hybrid method, either by extending the distance function, or by deriving the HM-estimator for each assumed mixture distribution and take the estimator with the best fit to the empirical data to be used as the final estimator. Further studies are needed to show that this approach is reasonable.

A general drawback with the method of moments is that the estimating equations sometimes lack solutions, and our variant is not an exception. However, this problem is usually overlooked and does not seem to be of practical importance, see [26] for a discussion. Another concern in our case is when there are no relevant estimates close to the true parameter vector. For example, there are no solutions of the moment equations with $\hat{p}=1/2$ and $\hat{\sigma }_{1}=\hat{\sigma }_{2}$, regardless of the data values. This issue did not seem to have a major impact in our simulation study where cases with $p=1/2$ and $\sigma _{1}=\sigma _{2}$ are included, but should be considered in future studies.

We use the FARI to evaluate the performance of the clustering methods, the reason for this is that FARI has a higher resolution than the ordinary adjusted Rand index, and is therefore better to separate approaches for which the clustering performance is relatively similar.

We propose to robustify the hybrid estimator by using trimmed (5% removed) versions of the sample moments. This is to enable high performance also in the presence of outliers, which are often encountered in real datasets and modeled here by the Laplace and contaminated Gaussian mixtures. For some of our simulations we have applied the HM method without trimming. Overall the results are usually better when we apply the 5% trimming, but there are some exceptions (data not shown). Moreover, one could argue that the ML-estimator may perform better if some of the extreme observations are removed prior to the estimation. The 5% trimming used in our simulations is merely for illustration and should not be taken as a general recommendation; how to choose the trimming level on the basis of data is a topic for future research.

In most applications several variables are observed and the common practice is to base the clustering on all, or at least several, of the observed variables. For high-dimensional genomic data this type of approaches has been shown to be difficult and non-informative variables need to be removed in order to have success [10]. It would be interesting to derive a variable selection procedure that utilizes the robustness of the hybrid approach for selecting informative variables in high-dimensional unsupervised classification problems. An interesting generalization of this work is to adopt it to the case were several variables are observed.

To conclude, the proposed moment-distance hybrid method has good clustering performance, is robust against incorrect model assumptions and can easily be applied to a wide range of problems.