Consistency of the total least squares estimator in the linear errors-in-variables regression

This paper deals with a homoskedastic errors-in-variables linear regression model and properties of the total least squares (TLS) estimator. We partly revise the consistency results for the TLS estimator previously obtained by the author [18]. We present complete and comprehensive proofs of consistency theorems. A theoretical foundation for construction of the TLS estimator and its relation to the generalized eigenvalue problem is explained. Particularly, the uniqueness of the estimate is proved. The Frobenius norm in the definition of the estimator can be substituted by the spectral norm, or by any other unitarily invariant norm; then the consistency results are still valid.


Introduction
We consider a functional linear error-in-variables model. Let {a 0 i , i ≥ 1} be a sequence of unobserved nonrandom n-dimensional vectors. The elements of the vectors are true explanatory variables or (in other terminology) true regressors. We observe m n-dimensional random vectors a 1 , . . . , a m and m d-dimensional random vectors b 1 , . . . , b m . They are thought to be true vectors a 0 i and X ⊤ 0 a 0 i , respectively, plus additive errors: Preprint submitted to VTeX / Modern Stochastics: Theory and Applications <October 24, 2018> www.vmsta.org whereã i andb i are random measurement errors in the regressor and in the response. A nonrandom matrix X 0 is estimated based on observations a i , b i , i = 1, . . . , m. This problem is related to finding an approximate solution to incompatible linear equations ("overdetermined" linear equation, because the number of equations exceeds the number of variables) where A = [a 1 , . . . , a m ] ⊤ is an m × n matrix and B = [b 1 , . . . , b m ] ⊤ is an m × d matrix. Here X is an unknown n × d matrix.
In the linear error-in-variables regression model (1), the Total Least Squares (TLS) estimator in widely used. It is a multivariate equivalent to the orthogonal regression estimator. We are looking for conditions that provide consistency or strong consistency of the estimator. It is assumed (for granted) that the measurement errors c i = (ã ĩ bi ), i = 1, 2, . . ., are independent and have the same covariance matrix Σ. It may be singular. In particular, some of regressors may be observed without errors. (If the matrix Σ is nonsingular, the proofs can be simplified.) An intercept can be introduced into (1) by augmenting the model and inserting a constant error-free regressor.
Sufficient conditions for consistency of the estimator are presented in Gleser [5], Gallo [4], Kukush and Van Huffel [10]. In [18], the consistency results are obtained under less restrictive conditions than in [10]. In particular, there is no requirement that where A 0 = [a 0 1 , . . . , a 0 m ] ⊤ is the matrix A without measurement errors. Hereafter, λ min and λ max denotes the minimum and maximum eigenvalues of a matrix if all the eigenvalues are real numbers. The matrix A ⊤ 0 A 0 is symmetric (and positive semidefinite). Hence, its eigenvalues are real (and nonnegative).
The model where some variables are explanatory and the other are response is called explicit. The alternative is the implicit model, where all the variables are treated equally. In the implicit model, the n-dimensional linear subspace in R n+d is fitted to an observed set of points. Some n-dimensional subspaces can be represented in a form {(a, b) ∈ R n+d : b = X ⊤ a} for some n × d matrix X; such subspaces are called generic. The other subspaces are called non-generic. The true points lie on a generic subspace {(a, b) : b = X ⊤ 0 a}. A consistently estimated subspace must be generic with high probability. We state our results for the explicit model, but use the ideas of the implicit model in the definition of the estimator, as well as in proofs.
We allow errors in different variables to correlate. Our problem is a minor generalization of the mixed LS-TLS problem, which is studied in [20,Section 3.5]. In the latter problem, some explanatory variables are observed without errors; the other explanatory variables and all the response variables are observed with errors. The errors have the same variance and are uncorrelated. The basic LS model (where the explanatory variables are error-free, and the response variables are error-ridden) and the basic TLS model (where all the variables are observed with error, and the errors are uncorrelated) are marginal cases of the mixed LS-TLS problem. By a linear transformation of variables our model can be transformed into either a mixed LS-TLS or basic LS or basic TLS problem. (We do not handle the case where there are more error-free variables than explanatory variables.) Such a transformation does not always preserve the sets of generic and non-generic subspaces. The mixed LS-TLS problem can be transformed into the basic TLS problem as it is shown in [6].
The Weighted TLS and Structured TLS estimators are generalizations of the TLS estimator for the cases where the error covariance matrices do not coincide for different observations or where the errors for different observations are dependent; more precisely, the independence condition is replaced with the condition on the "structure of the errors". The consistency of these estimators is proved in Kukush and Van Huffel [10] and Kukush et al. [9]. Relaxing conditions for consistency of the Weighted TLS and Structured TLS estimators is an interesting topic for a future research. For generalizations of the TLS problem, see the monograph [13] and the review [12].
In the present paper, for a multivariate regression model with multiple response variables we consider two versions of the TLS estimator. In these estimators, different norms of the weighted residual matrix are minimized. (These estimators coincide for the univariate regression model.) The common way to construct the estimator is to minimize the Frobenius norm. The estimator that minimizes the Frobenius norm also minimizes the spectral norm. Any estimator that minimizes the spectral norm is consistent under conditions of our consistency theorems (see Theorems 3.5-3.7 in Section 3). We also provide a sufficient condition for uniqueness of the estimator that minimizes the Frobenius norm.
In this paper, for the results on consistency of the TLS estimator which are stated in paper [18], we provide complete and comprehensive proofs and present all necessary auxiliary and complementary results. For convenience of the reader we first present the sketch of proof. Detailed proofs are postponed to the appendix. Moreover, the paper contains new results on the relation between the TLS estimator and the generalized eigenvalue problem.
The structure of the paper is as follows. In Section 2 we introduce the model and define the TLS estimator. The consistency theorems for different moment conditions on the errors and for different senses of consistency are stated in Section 3, and their proofs are sketched in Section 5. Section 4 states the existence and uniqueness of the TLS estimator. Auxiliary theoretical constructions and theorems are presented in Section 6. Section 7 explains the relationship between the TLS estimator and the generalized eigenvalue problem. The results in Section 7 are used in construction of the TLS estimator and in the proof of its uniqueness. Detailed proofs are moved to the appendix (Section 8).

Notations
At first, we list the general notation. For v = (x k ) n k=1 being a vector, v = For V 1 and V 2 being linear subspaces of R n of equal dimension dim V 1 = dim V 2 , sin ∠(V 1 , V 2 ) = P V1 − P V2 = P V1 (I − P V2 ) is the greatest sine of the canonical angles between V 1 and V 2 . See Section 6.2 for more general definitions.
Now, list the model-specific notations. The notations (except for the matrix Σ) come from [9]. The notations are listed here only for reference; they are introduced elsewhere in this paper -in Sections 1 and 2.
n is the number of regressors, i.e., the number of explanatory variables for each observation; d is the number of response variables for each observation; m is the number of observations, i.e., the sample size.
It is an m × (n + d) nonrandom matrix. The left-hand block A 0 of size m × n consists of true explanatory variables, and the right-hand block B 0 of size m×d consists of true response variables.
is the matrix of errors.
It is an m × (n + d) random matrix.
is the matrix of observations. It is an m × (n + d) random matrix.
Σ is a covariance matrix of errors for one observation. For every i, it is assumed that Ec i = 0 and Ec ic ⊤ i = Σ. The matrix Σ is symmetric, positive semidefinite, nonrandom, and of size (n + d) × (n + d). It is assumed known when we construct the TLS estimator.
X 0 is the matrix of true regression parameters. It is a nonrandom n × d matrix and is a parameter of interest.
X is the TLS estimator of the matrix X 0 .
X ext is a matrix whose column space span X ext is considered an estimator of the subspace span X 0 ext . The matrix X ext is of size (n + d) × d. For fixed m and Σ, X ext is a Borel measurable function of the matrix C.
While in consistency theorems m tends to ∞, all matrices in this list except Σ, X 0 and X 0 ext silently depend on m. For example, in equations "lim m→∞ λ min (A ⊤ 0 A 0 ) = +∞" and " X → X 0 almost surely" the matrices A 0 and X depend on m.

Statistical model
It is assumed that the matrices A 0 and B 0 satisfy the relation ( They are observed with measurement errorsÃ and B, that is The matrix X 0 is a parameter of interest. Rewrite the relation in an implicit form. Let the m × (n + d) block matrices C 0 , C, C ∈ R m×(n+d) be constructed by binding "respective versions" of matrices A and B: The entries of the matrix C are denoted δ ij ; the rows arec i : Throughout the paper the following three conditions are assumed to be true: The rowsc i of the matrix C are mutually independent random vectors.
This example is taken from [1, Section 1.1]. But the notation in Example 2.1 and elsewhere in the paper is different. Our notation is a 0 , and X 0 = (β 0 , β 1 ) ⊤ . Remark 2.1. For some matrices Σ, (6) is satisfied for any n × d matrix X 0 . If the matrix Σ in nonsingular, then condition (6) is satisfied. If the errors in the explanatory variables and in the response are uncorrelated, i.e., if the matrix Σ has a blockdiagonal form with nonsingular matrix Σ bb , then condition (6) is satisfied. For example, in the basic mixed LS-TLS problem Σ is diagonal, Σ bb is nonsingular, and so (6) holds true. If the null-space of the matrix Σ (which equals span Σ ⊥ because Σ is symmetric) lies inside the subspace spanned by the first n (of n + d) standard basis vectors, then condition (6) is also satisfied. On the other hand, if rk Σ < d, then condition (6) is not satisfied.

Total least squares (TLS) estimator
First, find the m × (n + d) matrix ∆ for which the constrained minimum is attained Hereafter Σ † is the Moore-Penrose pseudoinverse matrix of the matrix Σ, P Σ is an orthogonal projector onto the column space of Σ, P Σ = ΣΣ † . Now, show that the minimum in (7) is attained. The constraint rk(C − ∆) ≤ n is satisfied if and only if all the minors of C −∆ of order n+1 vanish. Thus the set of all ∆ that satisfy the constraints (the constraint set) is defined by m!(n+d)! (n+1)! 2 (m−n−1)!(d−1)! + 1 algebraic equations; and so it is closed. The constraint set is nonempty almost surely because it contains C. The functional ∆Σ † F is a pseudonorm on R m×(n+d) , but it is a norm on the linear subspace {∆ : ∆ (I − Σ † ) = 0}, where it induces a natural subspace topology. The constraint set is closed on the subspace (with the norm), and whenever it is nonempty (i.e., almost surely), it has a minimal-norm element.
Notice that under condition (6) the constrain set is non-empty always and not just almost surely. This follows from Proposition 7.9.
For the matrix ∆ that is a solution to minimization problem (7), consider the rowspace span (C − ∆) ⊤ of the matrix C − ∆. Its dimension does not exceed n. Its orthogonal basis can be completed to the orthogonal basis in R n+d , and the complement consists of n + d − rk(C − ∆) ≥ d vectors. Choose d vectors from the complement, which are linearly independent, and bind them (as column-vectors) into (n + d) × d matrix X ext . The matrix X ext satisfies the equation If the lower d × d block of the matrix X ext is a nonsingular matrix, by linear transformation of columns (i.e., by right-multiplying by some nonsingular matrix) the matrix X ext can be transformed to the form The matrix X satisfies the equation (Otherwise, if the lower block of the matrix X ext is singular, then our estimation fails. Note that whether the lower block of the matrix X ext is singular might depend not only on the observations C, but also on the choice of the matrix ∆ where the minimum in (7) in attained and the d vectors that make matrix X ext . We will show that the lower block of the matrix X ext is nonsingular with high probability regardless of the choice of ∆ and X ext .) Columns of the matrix X ext should span the eigenspace (generalized invariant space) of the matrix pencil C ⊤ C, Σ which corresponds to the d smallest generalized eigenvalues. That the columns of the matrix X ext span the generalized invariant space corresponding to finite generalized eigenvalues is written in the matrix notation as follows: Possible problems that may arise in the course of solving the minimization problem (7) are discussed in [18]. We should mention that our two-step definition (7) & (9) of the TLS estimator is slightly different from the conventional definition in [20, Sections 2.3.2 and 3.2] or in [10]. In these papers, the problem from which the estimator X is found is equivalent to the following: where the optimization is performed for ∆ and X that satisfy the constraints in (10). If our estimation defined with (7) and (9) succeeds, then the minimum values in (7) and (10) coincide, and the minimum in (10) is attained for (∆, X) that is the solution to (7) & (9). Conversely, if our estimation succeeds for at least one choice of ∆ and X ext , then all the solutions to (10) can be obtained with different choices of ∆ and X ext . However, strange things may happen if our estimation always fails. Besides (7), consider the optimization problem It will be shown that every ∆ that minimizes (7) also minimizes (11). We can construct the optimization problem that generalizes both (7) and (11). Let M U be a unitarily invariant norm on m × (n + d) matrices. Consider the optimization problem Then every ∆ that minimizes (7) also minimizes (12), and every ∆ that minimizes (12) also minimizes (11). If M U is the Frobenius norm, then optimization problems (7) and (12) coincide, and if M U is the spectral norm, then optimization problems (11) and (12) coincide. Remark 2.2. A solution to problem (7) or (11) does not change if the matrix Σ is multiplied by a positive scalar factor. Thus, instead of assuming that the matrix Σ is known completely, we can assume that Σ is known up to a scalar factor.

Known consistency results
In this section we briefly revise known consistency results. One of conditions for the consistency of the TLS estimator is the convergence of 1 m A ⊤ 0 A 0 to a nonsingular matrix. It is required, for example, in [5]. The condition is relaxed in the paper by Gallo [4].
and the measurement errorsc i are identically distributed, with finite fourth moment The theorem can be generalized for the multivariate regression. The condition that the errors on different observations have the same distribution can be dropped. Instead, Kukush and Van Huffel [10] assume that the fourth moments of the error distributions are bounded.
Here is the strong consistency theorem: Then X → X 0 as m → ∞, almost surely.
In the following consistency theorem the moment condition imposed on the errors is relaxed.
In the next theorem strong consistency is obtained for r < 2.
Then X → X 0 as m → ∞, almost surely.
The key point of the proof is the application of our own theorem on perturbation bounds for generalized eigenvectors (Theorems 6.5 and 6.6, see also [18]). The conditions were relaxed by renormalization of the data.

Existence and uniqueness of the estimator
When we speak of sequence {A m , m ≥ 1} of random events parametrized by sample size m, we say that a random event occurs with high probability if the probability of the event tends to 1 as m → ∞, and we say that a random event occurs eventually if almost surely there exists m 0 such that the random event occurs whenever m > m 0 , that is P(lim inf   (7) is attained. If ∆ satisfies the constraints in (7) (particularly, if matrix ∆ is a solution to optimization problem (7)), then the linear equation (8) has a solution X ext that is a full-rank matrix.

The constrained minimum in
2. The optimization problem (7) has a unique solution ∆.
3. For any ∆ that is a solution to (7), equation (9) (which is a linear equation in X) has a unique solution.
Theorem 4.2. (11) is attained. If ∆ satisfies the constraints in (11), then the linear equation (8) has a solution X ext that is a full-rank matrix.

The constrained minimum in
2. Under the conditions of Theorem 3.5, the following random event occurs with high probability: for any ∆ that is a solution to (11), equation (9) has a solution X. (Equation (9) might have multiple solutions.) The solution is a consistent estimator of X 0 , i.e., X → X 0 in probability. 3. Under the conditions of Theorem 3.6 or 3.7, the following random event occurs eventually: for any ∆ that is a solution to (11), equation (9) has a solution X. The solution is a strongly consistent estimator of X 0 , i.e., X → X 0 almost surely.
Remark 4.2-1. Theorem 4.2 can be generalized in the following way: all references to (11) can be changed into references to (12). Thus, if Frobenius norm in the definition of the estimator is changed to any unitarily invariant norm, the consistency results are still valid.
5 Sketch of the proof of Theorems 3.5-3.7 Denote Under the conditions of any of the consistency theorems in Section 3 there is a convergence λ min (A ⊤ 0 A 0 ) → ∞. Hence the matrix N is nonsingular for m large enough. The matrix N is used as the denominator in the law of large numbers. Also, it is used for rescaling the problem: the condition number of N −1/2 C ⊤ 0 C 0 N −1/2 equals 2 at most.
The proofs of consistency theorems differ one from another, but they have the same structure and common parts. First, the law of large numbers holds either in probability or almost surely, which depends on the theorem being proved. The proof varies for different theorems.
The inequalities (54) and (57) imply that whenever convergence (13) occurs, the sine between vectors X ext and X 0 ext (in the univariate regression) or the largest of sines of canonical values between column spans of matrices X ext and X 0 ext tends to 0 as the sample size m increases: To prove (14), we use some algebra, the fact that X 0 ext (in the univariate model) or the columns of X 0 ext (in the multivariate model) are the minimum-eigenvalue eigenvectors of matrix N (see ineq. (52)), and eigenvector perturbation theorems -Lemma 6.5 or Lemma 6.6.
Then, by Theorem 8.3 we conclude that

Relevant classical results
We use some classical results. However, we state them in a form convenient for our study and provide the proof for some of them.

Generalized eigenvectors and eigenvalues
In this paper we deal with real matrices. Most theorems in this section can be generalized for matrices with complex entries by requiring that matrices be Hermitian rather than symmetric, and by complex conjugating where it is necessary.
Theorem 6.1 (Simultaneous diagonalization of a definite matrix pair). Let A and B be n × n symmetric matrices such that for some α and β the matrix αA + βB is positive definite. Then there exist a nonsingular matrix T and diagonal matrices Λ and M such that If in the decomposition T = [u 1 , u 2 , . . . , u n ], Λ = diag(λ 1 , . . . , λ n ), M = diag(µ 1 , . . . , µ n ), then the numbers λ i /µ i ∈ R ∪ {∞} are called generalized eigenvalues, and the columns u i of the matrix T are called the right generalized eigenvectors of the matrix pencil A, B because the following relation holds true: In Theorem 6.1 λ i and µ i cannot be equal to 0 for the same i, while in Theorem 6.2 they can. On the other hand, in Theorem 6.1 λ i and µ i can be any real numbers, while in Theorem 6.2 λ i ≥ 0 and µ i ≥ 0. Theorem 6.2 is proved in [15]. Remark 6.2-1. If the matrices A and B are symmetric and positive semidefinite, then where is the determinantal rank of the matrix pencil A, B . (For square n × n matrices A and B, the determinantal rank characterizes if the matrix pencil is regular or singular. The matrix pencil A, B is regular if rk A, B = n, and singular if rk A, B < n.) The inequality rk A, B ≥ rk(A + B) follows from the definition of the determinantal rank. For all k ∈ R and for all such vectors x that (A + B)x = 0 we have x ⊤ Ax + x ⊤ Bx = 0, and because of positive semidefiniteness of matrices A and B, x ⊤ Ax ≥ 0 and x ⊤ Bx ≥ 0. Thus, x ⊤ Ax = x ⊤ Bx = 0. Again, due to positive semidefiniteness of A and B, Ax = Bx = 0 and (A + kB)x = 0. Thus, for all k ∈ R and (17) is proved. Remark 6.2-2. Let A and B be positive semidefinite matrices of the same size such that rk(A + B) = rk(B). The representation (16) might be not unique. But there exists a representation (16) such that (Here if the matrix B is nonsingular, then T 1 is n × 0 empty matrix; if B = 0, then T 2 is n × 0 matrix. In these marginal cases, T ⊤ 1 T 2 is an empty matrix and is considered to be zero matrix.) The desired representation can be obtained from [2] for S = 0 (in de Leeuw's notation). This representation is constructed as follows. Let the columns of matrix T 1 make the orthogonal normalized basis of Ker(B) = {v : Bv = 0}. There exists n × rk(B) matrix F such that B = F F ⊤ . Let the columns of matrix L be the orthogonal normalized eigenvectors of the matrix F † A(F † ) ⊤ . Then set T 2 = (F † ) ⊤ L. Note that the notation S, F and L is borrowed from [2], and is used only once. Elsewhere in the paper, the matrix F will have a different meaning.
Proof. Let us verify the Moore-Penrose conditions: and the fact that the matrices ( (18) and (19) can be verified directly; and the symmetry properties can be reduced to the equality ). Since

Angle between two linear subspaces
Then there exists an orthogonal n × n matrix U such that Here rectangular diagonal matrices are allowed. If in (21) there are more cosines than sines (i.e., if k 2 + k 1 > n), then the excessive cosines should be equal to 1, so the columns of the bidiagonal matrix in (21) are unit vectors (which are orthogonal to each other). Here the columns of U are the vectors of some convenient "new" basis in R n , so U is a transitional matrix from the standard basis to "new" basis; the columns of matrix products in span · · · in (21) and (22) are the vectors of the bases of subspaces V 1 and V 2 ; the bidiagonal matrix in (21) and the diagonal matrix in (22) are the transitional matrices from "new" basis in R n to the bases in V 1 and V 2 , respectively. The angles θ k are called the canonical angles between V 1 and V 2 . They can be selected so that 0 ≤ θ k ≤ 1 2 π (to achieve this, we might have to reverse some vectors of the bases).
Denote P V1 the matrix of the orthogonal projector onto V 1 . The singular values of the matrix P V1 (I − P V2 ) are equal to sin θ k (k = 1, . . . , k 1 ); besides them, there is a singular value 0 of multiplicity n − k 1 .
Denote the greatest of the sines of the canonical eigenvalues This can be generalized for dim V 1 ≥ 1: ) may or may not be equal to 1, but always P V2 (I − P V1 ) = 1; see the proof of Lemma 8.2 in the appendix).
We will often omit "span" in arguments of sine. Thus, for n-row matrices X 1 and , and due to equation (23), sin ∠(V 12 , V 2 ) = 1. Thus, the subspace V 12 has the desired properties.

Perturbation of eigenvectors and invariant spaces
be attained at the point x * . Then Remark 6.5-1. The function f (x) may or may not attain the minimum. Thus the and x * = 0. Now proclaim the multivariate generalization of Lemma 6.5. We will not generalize Remark 6.5-1. Instead, we will check that the minimum is attained when we use Lemma 6.6 (see Proposition 7.10).
attain its minimum. Then for any point X where the minimum is attained,

Rosenthal inequality
In the following theorems, a random variable ξ is called centered if E ξ = 0.
Theorem 6.7. Let ν ≥ 2 be a nonrandom real number. Then there exist α ≥ 0 and β ≥ 0 such that for any set of centered mutually independent random variables {ξ i , i = 1, . . . , m}, m≥1, the following inequality holds true: Here the first inequality is due to Marcinkiewicz and Zygmund [11,Theorem 13].
The second inequality follows from the fact that for ν ≤ 2, 7 Generalized eigenvalue problem for positive semidefinite matrices In this section we explain the relationship between the TLS estimator and the generalized eigenvalue problem. The results of this section are important for constructing the TLS estimator. Proposition 7.9 is used to state the uniqueness of the TLS estimator.
Lemma 7.1. Let A and B be n × n symmetric positive semidefinite matrices, with simultaneous diagonalization (see Theorem 6.2 for its existence). For i = 1, . . . , n denote Assume that ν 1 ≤ ν 2 ≤ · · · ≤ ν n . Then i.e., ν i is the smallest number λ ≥ 0, such that there exists an i-dimensional subspace V ⊂ R n , such that the quadratic form A − λB is negative semidefinite on V . (27) is attained for V being the linear span of first i columns of the matrix T (i.e., the linear span of the eigenvectors of the matrix pencil A, B that correspond to the i smallest generalized eigenvalues). That is Here the matrix X is assumed to be of full rank: Here span M is a column space of the matrix M . 2. Let the constraints in (28) be compatible. Then the least element of the partially ordered set (in the Loewner order) {∆Σ † ∆ ⊤ : ∆ (I−P Σ ) = 0 and (C−∆)X = 0} is attained for ∆ = CX(X ⊤ ΣX) † X ⊤ Σ and is equal to CX(X ⊤ ΣX) † X ⊤ C ⊤ . This means the following: 2b. For any ∆ which satisfies the constraints ∆ (I −P Σ ) = 0 and (C −∆)X = 0, Remark 7.2-1. If the constraints are compatible, the least element (and the unique minimum) is attained at a single point. Namely, the equalities In the left-hand side of (34) the minima are attained for the same ∆ = CX(X ⊤ ΣX) † X ⊤ Σ for all k (the k sets where the minima are attained have nonempty intersection; we will show that the intersection comprises of a single element). One can choose a stack of subspaces such that V k is the element where the minimum in the right-hand side of (34) is attained, i.e., for all k = 1, . . . , d, In Propositions 7.5 to 7.9, we will use notation from simultaneous diagonalization of matrices C ⊤ C and Σ: where Λ = diag(λ 1 , . . . , λ n+d ), M = diag(µ 1 , . . . , µ n+d ), If Remark 6.2-2 is applicable, let the simultaneous diagonalization be constructed accordingly. For k = 1, . . . , n+d denote Let ν k be arranged in ascending order.
Corollary. In the minimization problem (11), the constrained minimum is equal to Proposition 7.6. In the minimization problem (7) the constrained minimum is equal to Whenever the minimum in (7) is attained for some matrix ∆, the minimum in (11) is attained for the same ∆.
Let M 1 and M 2 be m × n matrices. Then Proposition 7.8. Consider the optimization problem (12) with arbitrary unitarily invariant norm M U . Then 1. Any minimizer ∆ to the optimization problem (7) also minimizes (12).
Proposition 7.9. For any ∆ where the minimum in (7) is attained and the corresponding solution X ext of the linear equations (8) Conversely, if ν d < +∞ and the matrix X ext satisfies conditions (37), then there exists a common solution ∆ to the minimization problem (7) and the linear equations (8).
Proposition 7.10. Let C ⊤ C, Σ be a definite matrix pencil. Then for any ∆ where the minimum in (11) is attained, the corresponding solution X ext of the linear equations (8) is attained. It is also a point where the minimum of is attained.
If the conditions of Theorem 3.5, 3.6 or 3.7 hold true, then λ min (A ⊤ 0 A 0 ) → ∞, and thus for m large enough. In what follows, assume that Σ is a singular but non-zero matrix. Let F = ( F1 F2 ) be a (n+d)×(n+d−rk(Σ)) matrix whose columns make the basis of the null-space Ker(Σ) = {x : Σx = 0} of the matrix Σ.
2. Now prove that columns of the matrix [I n X 0 ]F are linearly independent. Assume the contrary. Then for some v ∈ R n+d−rk (Σ) \ {0}, Furthermore, F v = 0 because v = 0 and the columns of F are linearly independent. Hence, by (41), F 2 v = 0.
Equality (42) implies that the columns of the matrix ΣX 0 ext are linearly dependent, and this contradicts condition (6). The contradiction means that columns of the matrix [I X 0 ext ] F are linearly independent.

5.
It remains to prove the implication: The matrices C ⊤ C and Σ are positive semidefinite. Suppose that x ⊤ (C ⊤ C + Σ)x = 0 and prove that x = 0. Since x ⊤ (C ⊤ C + Σ)x = 0, Cx = 0 and Σx = 0. The vector x belongs to the null-space of the matrix Σ. Therefore, x = F v for some vector v ∈ R n+d−rk Σ . Then As the matrix A ⊤ 0 A 0 is nonsingular and columns of the matrix [I n X 0 ] F are linearly independent, the columns of the matrix A ⊤ 0 A 0 [I n X 0 ] F are linearly independent as well. Hence, (43) implies v = 0, and so x = F v = 0.
We have proved that the equality x ⊤ (C ⊤ C + Σ)x = 0 implies x = 0. Thus, the positive semidefinite matrix C ⊤ C + Σ is nonsingular, and so positive definite.

Eigenvalues and common eigenvectors of N and
The rank-deficient positive semidefinite symmetric matrix C ⊤ 0 C 0 can be factorized as: with an orthogonal matrix U and Then the eigendecomposition of the matrix N = The matrix N is nonsingular as soon as A ⊤ 0 A 0 is nonsingular. Hence, under the conditions of Theorem 3.5, 3.6, or 3.7, the matrix N is nonsingular for m large enough.
Since C 0 X 0 ext = 0, it holds that As soon as N is nonsingular, the matrices N −1/2 and N −1/2 C ⊤ 0 C 0 N −1/2 have the eigendecomposition Thus, the eigenvalues of N −1/2 and N −1/2 C ⊤ 0 C 0 N −1/2 satisfy the following: As a result, Because These properties will be used in Sections 8.2 and 8.3.

Multivariate regression (d ≥ 1)
What follows is valid for both univariate (d = 1) and multivariate (d > 1) regression. Due to (44), N ≥ λ min (A ⊤ 0 A 0 )I in the Loewner order; thus inequality (51) holds in the Loewner order. Hence With inequality (45), we get Using equation (24) to determine the sine and noticing that The TLS estimator X ext is defined as a solution to the linear equations (8) for ∆ that brings the minimum to (7). By Proposition 7.6, the same ∆ brings the minimum to (11). By Proposition 7.10, the functions (38) and (39) attain their minima at the point X ext . Therefore, the minimum of the function is attained for M = N 1/2 X ext . Now, apply Lemma 6.6 on perturbation bounds for a generalized invariant subspace. The unperturbed matrix (denoted A in Lemma 6.6) is N −1/2 C ⊤ 0 C 0 N −1/2 ; its nullspace is the column space of the matrix N 1/2 X 0 ext (which is denoted X 0 in Lemma 6.6). The perturbed matrix (A +Ã in Lemma 6.6) is N −1/2 (C ⊤ C − mΣ)N −1/2 . The matrix B in Lemma 6.6 equals N −1/2 ΣN −1/2 . The norm of the perturbation is denoted ǫ (it is Ã in Lemma 6.6). The (n + d) × d matrix which brings the minimum to (56) is N 1/2 X ext . The other conditions of Lemma 6.6 are (47), (48), and (53). We have Again, with (55), (45) and (46), we have

Proof of the convergence ǫ → 0
In this section, we prove the convergences in probability for Theorem 3.5, and almost surely for Theorems 3.6 and 3.7. As ǫ = M 1 + M ⊤ 1 + M 2 , the convergences M 1 → 0 and M 2 → 0 imply ǫ → 0. End of the proof of Theorem 3.5. It holds that The right-hand side can be simplified since Ec j N −1c⊤ i = 0 for i = j and Ec i N −1c⊤ i = tr(ΣN −1 ): The first multiplier in the right-hand side is bounded due to (50) as tr(C 0 N −1 C ⊤ 0 ) ≤ n, for m large enough. Now, construct an upper bound for the second multiplier: Finally, . The conditions of Theorem 3.5 imply that λ max (A ⊤ 0 A 0 ) → ∞; therefore, M 1 P −→ 0 as m → ∞. Now, we prove that M 2 P −→ 0 as m→∞. We have Now apply the Rosenthal inequality (case 1 ≤ ν ≤ 2; Theorem 6.8) to construct a bound for E M 2 r : .

By the conditions of Theorem 3.5, the sequence
End of the proof of Theorem 3.6.
By the Rosenthal inequality (case ν ≥ 2; Theorem 6.7) Construct an upper bound for the first summand: The asymptotic relation can be proved similarly; in order to prove it, we use boundedness of the sequence The conditions of Theorem 3.6 imply that ∞ m=m0 E M 1 2r < ∞, whence M 1 → 0 as m → ∞, almost surely. Now, prove that M 2 → 0 almost surely. In order to construct a bound for E M 2 r , use the Rosenthal inequality (case ν ≥ 2; Theorem 6.7) as well as (58): .
End of the proof of Theorem 3.7. The proof of the asymptotic relation from Theorem 3.6 is still valid. The almost sure convergence M 1 → 0 as m → ∞ is proved in the same way as in Theorem 3.6. Now, show that M 2 → 0 as m → ∞, almost surely. Under the condition of Theorem 3.7, whence, with (58), M 2 → 0 as m → ∞, a.s.

Proof of the uniqueness theorems
Proof of Theorem 4.1. The random events 1, 2 and 3 are defined in the statement of this theorem on page 256. The random event 1 always occurs. This was proved in Section 2.2 where the estimator X ext is defined. In order to prove the rest, we first construct the random event (59), which occurs either with high probability or eventually. Then we prove that, whenever (59) occurs, there is the existence and "more than uniqueness" in the random event 3, and then prove that the random event 2 occurs. Now, we construct a modified version X mod ext of the estimator X ext in the following way. If there exist such solutions (∆, X ext ) to (7) & (8) that sin ∠( X ext , X 0 ext ) ≥ (1 + X 0 2 ) −1/2 , let X mod ext come from one of such solutions. Otherwise, if for every solution (∆, X ext ) to (7) & (8) sin ∠( X ext , X 0 ext ) < (1 + X 0 2 ) −1/2 , let X mod ext come from one of these solutions. In any case, let us construct X mod ext in such a way that it is a random matrix. It is possible; that follows from [17].
From the proof of Theorem 3.5 it follows that sin ∠( X mod ext , X 0 ext ) → 0 in probability as m → ∞. From the proof of Theorem 3.6 or 3.7 it follows that sin ∠( X mod ext , X 0 ext ) → 0 almost surely. Then either with high probability or almost surely. Whenever the random event (59) occurs, for any solution ∆ to (7) and the corresponding full-rank solution X ext to (8) (which always exists) it holds that sin ∠( X ext , X 0 ext ) < (1 + X 0 2 ) −1/2 , whence, due to Theorem 8.3, the bottom d × d block of the matrix X ext is nonsingular. Right-multiplying X ext by a nonsingular matrix, we can transform it into a form ( X −I ). The constructed matrix X is a solution to equation (9) for given ∆. Thus, we have just proved that if the random event (59) occurs, then for any ∆ which is a solution to (7), equation (9) has a solution. Now, prove the uniqueness of X. Let (∆ 1 , X 1 ) and (∆ 2 , X 2 ) be two solutions to (7) & (9). Show that X 1 = X 2 . (If we can for ∆ 1 = ∆ 2 , then the random event 3 occurs.) Denote X ext 1 = ( X1 −I ) and X ext 2 = ( X2 −I ). By Proposition 7.9, span X ext 1 ⊂ span u k , ν k ≤ d and span X ext 2 ⊂ span u k , ν k ≤ d , where ν k and u k are generalized eigenvalues (arranged in ascending order) and respective eigenvectors of the matrix pencil X ⊤ X, Σ .
Assume by contradiction that and X ext 2 . Then ext and d * = dim span u k , ν k ≤ d (notation d * and d * comes from the proof of Proposition 7.9). By Lemma 6.4, there exists a ddimensional subspace V 12 for which span u k , ν k < d ⊂ V 12 ⊂ span u k , ν k ≤ d and sin ∠(V 12 , X 0 ext ) = 1. Bind a basis of the d-dimensional subspace V 12 ⊂ R (n+d) into the (n + d) × d matrix X ext 3 , so span X ext 3 = V 12 . Again, by Proposition 7.9 for some matrix ∆, (∆, X ext 3 ) is a solution to (7) & (9). Then sin ∠( X ext 3 , X 0 ext ) = 1 ≥ (1 + X 0 2 ) −1/2 . Then sin ∠( X mod ext , X 0 ext ) ≥ (1 + X 0 2 ) −1/2 , which contradicts (59). Thus, the random event 3 occurs. Now prove that the random event 2 occurs. Let ∆ 1 and ∆ 2 be two solutions to the optimization problem (7). Whenever the random event (59) occurs, the respective solutions X 1 and X 2 to equation (9) exist. By already proved uniqueness, they are equal, i.e., X 1 = X 2 . Then both ∆ 1 and ∆ 2 are solutions to the optimization problem for the fixed X ext 1 = ( X1 −I ) = ( X2 −I ). By Proposition 7.2 and Remark 7.2-1, the least element in the optimization problem (28) for X = X ext 1 is attained for the unique matrix ∆ = C X ext Since it is attained, it is also attained for both ∆ 1 and ∆ 2 . Hence, ∆ 1 = ∆ 2 . Thus, the random event 2 occurs.
We proved that the random event 1 always occurs, and the random events 2 and 3 occur whenever (59) occurs, which occurs either with high probability or eventually as desired.
Remark 8.1. This uniqueness of the solution ∆ to the optimization problem (7) agrees with the uniqueness result in [6]. The solution is unique if ν d < ν d+1 .
Proof of Theorem 4.2. 1. In Theorem 4.1, the event 1 occurs always, not just with high probability or eventually. The solution ∆ to (7) exists and also solves (11) due to Proposition 7.6. Thus, the first sentence of Theorem 4.2 is true. The second sentence of Theorem 4.2 has been already proved, since the constraints in the optimization problems (7) and (11) are the same.
2 & 3. The proof of consistency of the estimator defined with (11) & (9) and of the existence of the solution is similar to the proof for the estimator defined with (7) & (9) in Theorems 3.5-3.7 and 4.1. The only difference is skipping the use of Proposition 7.6. Notice that we do not prove the uniqueness of the solution because we cannot use Proposition 7.9.
To Remark 4.2-1. The amended Theorem 4.2 can be proved similarly. In the proof of part 1, read "The solution ∆ to (7) . . . solves (12) due to Proposition 7.8." In the proof of parts 2 and 3, read "The only difference is using Proposition 7.8, part 2 instead of Proposition 7.6."

Proofs of auxiliary results 8.5 Proof of lemmas on perturbation bounds for invariant subspaces
Proof of Lemma 6.5 and Remark 6.5-1. For the proof of Lemma 6.5 itself, see parts 2 and 3 of the proof below. For the proof of Remark 6.5-1, see parts 2, 3 and 4 below. Part 1 is a mere discussion of why the conditions of Remark 6.5-1 are more general than ones of Lemma 6.5.
In the proof, we assume that {x : x ⊤ Bx > 0} is the domain of the function f (x). The assumption affects the definition of lim x→x * f (x), and inf f is the infimum of f (x) over the domain.
1. At first, clarify the conditions of Remark 6.5-1. As it is, the existence of a point x such that lim inf is assumed in Remark 6.5-1. Now, prove that, under the preceding condition of Remark 6.5-1, there exists a vector x = 0 that satisfies (61). The function f (x) is homogeneous of degree 0, i.e., Hence, all values which are attained by f (x) on its domain {x : x ⊤ Bx > 0}, are also attained on the bounded set {x : x =1, x ⊤ Bx > 0}: Then inf (In equations (62) and (63), we assume that {x : f (x * ) makes sense. Hence (62) follows from (63) and thus holds true either way. Taking the limit in the relation f (x) ≥ inf f , we obtain the opposite inequality Thus, the equality (25) holds true for some x * ∈ F . Note that x * = 1, so x * = 0.

2.
Prove that under the conditions of Lemma 6.5 or Remark 6.5-1 Under the conditions of Lemma 6.5 the function f (x) is well-defined at x 0 and attains its minimum at x * , so f (x * ) ≤ f (x 0 ).
Under the conditions of Remark 6.5-1 we consider 3 cases concerning the value of x ⊤ * Bx * .
But on the domain of f (x) the inequality x ⊤ Bx > 0 holds true. Since x * is a limit point of the domain of f (x), the inequality x ⊤ * Bx * ≥ 0 holds true, and Case 1 is impossible.
which cannot be inf f (x). The contradiction obtained implies that x ⊤ * (A+Ã)x * ≤ 0.
Then the function f (x) is well-defined at x * , and .
Notation. If A and B are symmetric matrices of the same size, and furthermore the matrix B is positive definite, denote The notation is used in the proof of Lemma 6.6.
Let X ∈ R n×d1 be a matrix of full rank, and V be a d 2 -dimensional subspace in R n . Then Proof. Using the min-max theorem, the relation span X = span P span X and simple properties of orthogonal projectors, construct the inequality On the other hand, Thus, Hence the subspaces span X and V ⊥ have nontrivial intersection, i.e., there exists w = 0, w ∈ span X ∩ V ⊥ . Then P span X (I − P V )w = w, whence P span X (I − P V ) ≥ 1. On the other hand, P span X (I −P V ) ≤ P span X × (I −P V ) ≤ 1. Thus, P span X (I − P V ) = 1. This completes the proof.
Proof of Lemma 6.6. The matrix B is positive semidefinite, the matrix X ⊤ 0 BX 0 is positive definite, and the matrix X 0 is of full rank d (hence, n ≥ d). The matrix A satisfies inequality A ≥ λ d+1 (A)(I − P span X0 ) in the Loewner order.
Let X be a point where the functional f (x) defined in (26) attains its minimum.
Using the relations Then the desired inequality follows from (64):

8.6
Comparison of sin ∠( X ext , X 0 ext ) and X − X 0 In the next theorem and in its proof, matrices A, B and Σ have different meaning than elsewhere in the paper. then: 1) the matrix B is nonsingular; Proof. 1. Split the matrix P ⊥ X0 −I , which is an orthogonal projector along the column space of the matrix ( X0 −I ), into four blocks: Up to the end of the proof, P 1 means the upper-left n×n block of the (n+p)×(n+p) Let X 0 = U ΣV ⊤ be a singular value decomposition of the matrix X 0 (here Σ is a diagonal n × d matrix, U and V are orthogonal matrices). Then The n × n matrix I − Σ(Σ ⊤ Σ + I) −1 Σ ⊤ is diagonal; its diagonal entries are Those diagonal entries comprise all the eigenvalues of P 1 ; 2. Due to equation (23), the square of the largest of sines of canonical eigenvalues between the subspaces V 1 and V 2 is equal to 3. Prove the first statement of Theorem 8.3 by contradiction. Suppose that the matrix B is singular. Then there exist f ∈ R d \ {0} and u = Af ∈ R n such that Bf = 0 where V 1 ⊂ R n+d is the column space of the matrix ( A B ). As the columns of the matrix ( A B ) are linearly independent, ( u 0 ) = 0. Then, by (66), which contradicts condition (65).

Prove inequality (67). (Later on we will show that the second statement of Theorem 8.3 follows from (67)). There exists such a vector
Notice that z = 0 because B −1 f = 0 and the columns of the matrix ( A B ) are linearly independent. Thus, By (66),

5.
Prove that the second statement of Theorem 8.3 follows from (67). The function is strictly increasing on [0, +∞), with s(0) = 0 and lim δ→+∞ s(δ) = 1 √ 1+ X0 2 . Therefore, inequality (67) implies the implication: The equivalent contrapositive implication is as follows: The inverse function to s(δ) in (68) is Substitute δ = δ( sin ∠(( A B ), ( X0 −I )) ) into (69) and obtain the following statement: whence the second statement of Theorem 8.3 follows. In part 5 of the proof, condition (65) is used twice. First, it is one of conditions of the first statement of the theorem: without it, the matrix B might be singular. Second, the function δ(s) is defined on interval [0, where T i1 is the matrix constructed of the first i columns of T , and T i2 is the matrix constructed of the last n − i + 1 columns of T . Denote V 1 and V 2 the column spaces of the matrices T i1 and T i2 , respectively. Then dim V 1 = i and dim V 2 = n − i + 1.

The proof of the fact that
In other words, if ν i < ∞, then relations hold true for λ = ν i and V = V 1 .

The proof of the fact that ν i is a lower bound of the set
In other words, if there exists a subspace V ⊂ R n such that the relations (70) hold true, then ν i ≤ λ.
For j ≥ i, due to the inequality ν j ≥ ν i > 0 and the conditions of the lemma, the case λ j = 0 is impossible; thus λ j > 0. Prove the inequality λ j −λµ j > 0. If µ j > 0, Since this holds for all v ∈ V 2 \ {0}, the restriction of the quadratic form A − λB onto the linear subspace V 2 is positive definite. On the one hand, since (A − λB)| V ≤ 0 and (A − λB)| V2 > 0, the subspaces V and V 2 have a trivial intersection. On the other hand, since dim V +dim V 2 = n+1 > n, the subspaces V and V 2 cannot have a trivial intersection. We got a contradiction.
Hence ν i ≤ λ, and ν i is a lower bound of Since the n × n covariance matrix Σ is positive semidefinite, for every k × n matrix M the equality span M ΣM ⊤ = span M Σ holds true. This can be proved with use of the matrix square root.
If what follows, for a fixed (n + d) × d matrix X denote where C is an m × (n + d) matrix, Σ is an n × n positive semidefinite matrix. 1, sufficiency. Relation (30) is a sufficient condition for compatibility of the constraints in (28). Let span(X ⊤ C ⊤ ) ⊂ span(X ⊤ Σ). Then X ⊤ C ⊤ = X ⊤ ΣM for some matrix M . The constraints ∆ (I − P Σ ) = 0, (C − ∆)X = 0 are satisfied for ∆ = M ⊤ Σ, so they are compatible.

Remark 7.2-1. The least point is attained for a unique ∆.
It is enough to show that if ∆ satisfies the constraints and ∆Σ † ∆ ⊤ = CX(X ⊤ ΣX) † X ⊤ C ⊤ , then ∆ = ∆ pm . Indeed, if ∆ satisfies the constraints ∆ (I − P Σ ) = 0 and (C − ∆)X = 0, and ∆Σ † ∆ ⊤ = CX(X ⊤ ΣX) † X ⊤ C ⊤ , then due to (72) As Σ † is a positive semidefinite matrix, (∆ − ∆ pm )Σ † = 0 and (∆ − ∆ pm )P Σ = (∆ − ∆ pm )Σ † Σ = 0. Add the equality ∆ (I − P Σ ) = 0 (which is one of the constraints) and subtract the equality ∆ pm (I − P Σ ) = 0 (which is one of equalities (31) and holds true due part 2a of the theorem). Obtain Proof of Proposition 7.3. 1. Necessity. Since the matrices C ⊤ C and Σ are positive semidefinite, the matrix pencil C ⊤ C, Σ is definite if and only if the matrix C ⊤ C + Σ is positive semidefinite. Thus, if the matrix pencil C ⊤ C, Σ is definite, then the matrix C ⊤ C + Σ is positive definite. As the columns of the matrix X are linearly independent, the matrix X( If the constraints are compatible, then the condition (30) holds true, whence Since span X ⊤ ΣX = R n , the matrix X ⊤ ΣX is nonsingular.

Sufficiency.
If the matrix X ⊤ ΣX is nonsingular, then Thus the condition (30), which is the necessary and sufficient condition for compatibility of the constraints, holds true.
Proof of Proposition 7.4. Construct simultaneous diagonalization of matrices XCC ⊤ X ⊤ and XΣX ⊤ (according to Theorem 6.2) that satisfies Remark 6.2-2: Notations Λ, M, T = T 1 T 2 , µ i , λ i , ν i are taken from Theorem 6.2, Remark 7.2-1, and Lemma 7.1. The subspace is spanned by columns of the matrix (T −1 ) ⊤ that correspond to nonzero λ i 's. Similarly, the subspace span X ⊤ Σ = span (T −1 ) ⊤ M is spanned by columns of the matrix (T −1 ) ⊤ that correspond to non-zero µ i 's. Note that the columns of the matrix (T −1 ) ⊤ are linearly independent. The condition span X ⊤ C ⊤ ⊂ span X ⊤ Σ is satisfied if and only if λ i = 0 for all i such that µ i = 0 (that is ν i < ∞, i = 1, . . . , d, where notation ν i = λ i /ν i comes from Theorem 6.2). Thus, due to Proposition 6.3, Construct the chain of equalities: Equality (a) follows from 7.2 because the matrix CX(X ⊤ ΣX) † X ⊤ C ⊤ is the least value of the expression ∆Σ † ∆ ⊤ with constraints (I − P Σ )∆ ⊤ = 0 and (C − ∆)X = 0. Equality (b) follows from the relation between characteristic polynomials of two products of two rectangular matrices: because CXT is an m × d matrix and M † T ⊤ X ⊤ C ⊤ is a d × m matrix. Thus, the matrix CXT M † T ⊤ X ⊤ C ⊤ has all the eigenvalues of the matrix M † T ⊤ X ⊤ C ⊤ × CXT = M † Λ and, besides them, the eigenvalue 0 of multiplicity m − d. All these eigenvalues are nonnegative.
Equality (c) holds true due to Lemma 7.1.
Since the columns of the matrix X are linearly independent, there is a one-toone correspondence between subspaces of span X and of R d : if V is a subspace of span X , then there exists a unique subspace V 1 ⊂ R d , and for those V and V 1 , • the restriction of the quadratic form C ⊤ C − λΣ to the subspace V is negative semidefinite if and only if the restriction of the quadratic form X ⊤ C ⊤ CX − λX ⊤ ΣX to the subspace V 1 is negative semidefinite.
Hence, equality (d) holds true. Equation (34) is proved. As to Remark 7.4-1, the minimum in the left-hand side of (34) is attained for ∆ = ∆ pm . The minimum in the right-hand side of (34) is attained if the subspace V is a linear span of k columns of the matrix XT that correspond to the k least ν i 's.
Proof of Proposition 7.5. By Lemma 7.1 and Proposition 7.4, the inequality (37) is equivalent to the obvious inequality From the proof it follows that if ν d = ∞, then for any (n + d) × d matrix X of rank d the constraints in (28) are not compatible. Now prove that if ν d < ∞ and X = [u 1 , u 2 , . . . , u d ], then the inequality in Proposition 7.5 becomes an equality. Indeed, then the constraints in (28) are compatible because they are satisfied for ∆ = CT DT −1 , where . . . , µ d ) and Λ d = diag(λ 1 , . . . , λ d ) are principal submatrices of the matrices M and Λ, respectively.
We have where the inequalities hold true due to positive semidefiniteness of Σ and due to Proposition 7.5.
If ν d = ∞, than the constraints ∆ (I − P Σ ) = 0 and rk(C − ∆) ≤ n are not compatible. Otherwise, the equality in (73) is attained for ∆ = ∆ em := CX(X ⊤ ΣX) † × X ⊤ Σ, where the matrix X consists of first d rows of the matrix T , where T comes from decomposition (35).
Thus, if the constraints in (7) are compatible, then the minimum is equal to ( d k=1 ν k ) 1/2 and is attained at ∆ em . Otherwise, if the constraints are incompatible, then by contraposition to the second statement of Proposition 7.5 ν d = +∞ and ( d k=1 ν k ) 1/2 = +∞. If the minimum in (7) is attained at ∆, then the inequality (73) becomes an equality, whence (74) in particular, Remember that ν d is the minimum value in (11). Thus, the minimum in (11) is attained at ∆, although it may be also attained elsewhere.
Proof of Proposition 7.7. 1. The monotonicity follows from results of [14]. The unitarily invariant norm is a symmetric gauge function of the singular values, and the symmetric gauge function is monotonous in non-negative inputs (see [14, ineq. (2.5)]).
Proof of Proposition 7.9. We can assume that µ i ∈ {0, 1} in (35). The set of matrices ∆ that satisfy (8) depends only on span X ext and does not change after linear transformations of columns of X ext .
By linear transformations of the columns, the matrix T −1 X ext can be transformed to the reduced column echelon form. Thus, there exists such an (n + d) × d matrix T 5 in the column echelon form that span X ext = span T T 5 .
Notice that rk T 5 = rk X ext = d.
Denote by d * and d * the first and the last of the indices i such that ν i = ν d . Then Necessity. Let ∆ be a point where the constrained minimum in (7) is attained. Then equalities (74)-(75) from the proof of Proposition 7.6 hold true. Thus, due to Propositions 7.4 and 7.5, for all k = 1, . . . , d According to 7.4-1, we can construct a stack of subspaces such that dim V k = k and the restriction of the quadratic form C ⊤ C − ν k Σ to the subspace V k is negative semidefinite, for all k ≤ d. Now, prove that Suppose the contrary: span u i : ν i < ν d ⊂ span X ext . Then there exists i < d * such that u i / ∈ span X ext , and, as a consequence, u i / ∈ V max{j : νj ≤νi} . Find the least k such that u k / ∈ V max{j : νj ≤ν k } . Let k * and k * denote the first and the last indices i such that Since u k / ∈ V k * , u k / ∈ V k * ∩ span u k * , . . . , u n+d , dim span V k * ∩ span u k * , . . . u n+d , u k = k * − k * + 2.
Sufficiency. Remember that T = [u 1 , . . . , u n+d ] is an (n + d) × (n + d) matrix of generalized eigenvectors of the matrix pencil C ⊤ C, Σ , and respective generalized eigenvalues are arranged in ascending order. By means of linear operations of the columns, the matrix T −1 X ext can be transformed into the reduced column echelon form. In other words, there exists such an n × n nonsingular matrix T 8 , that the (n + d) × n matrix is in the reduced column echelon form. The equality (83) implies that span X ext = span T T 5 .
If condition (37) holds, then in representation (84) the matrix T 5 has the following block structure where T 61 is a (d * − d * + 1) × (d − d * + 1) reduced column echelon matrix. (Any of the blocks except T 61 may be an "empty matrix".) Since the columns of T 5 are linearly independent, the columns of T 61 are linearly independent as well. Hence the matrix T 61 may be appended with columns such that the resulting matrix T 6 = [T 61 , T 62 ] is nonsingular. Perform the Gram-Schmidt orthogonalization of columns of the matrix T 6 by constructing such an upper-triangular matrix  is spanned by the first d columns of the matrix T new . It can be easily verified that span X ⊤ ext C ⊤ = span T ⊤ 8 T ⊤ 5 Λ and span X ⊤ ext Σ = span T ⊤ 8 T ⊤ 5 M . The condition span X ⊤ ext C ⊤ ⊂ span X ⊤ ext Σ holds true if (and only if) ν d < ∞. Thus, due to Proposition 7.2, if the condition ν d < ∞ holds true, then the constraints ∆ (I − P Σ ) = 0 and (C − ∆) X ext = 0 are compatible.
Proof of Proposition 7.10. Remember that if ν d < ∞, then the constraints in (11) are compatible, and the minimum is attained and is equal to ν d ; see Proposition 7.5. Otherwise, if ν d = ∞, then the constraints in (11) are incompatible.
Transform the expression for the functional (38): Here we used the rule how eigenvalues of the matrix product change when the matrices are swapped, and we also used Propositions 7.2 and 7.3. By Proposition 7.5, Q 1 (X) ≥ ν d . If the minimum in (11) & (8) is attained (say at some point (∆, X ext )), then the constraints in the right-hand side of (85) are compatible for X = X ext (particularly, ∆ is a matrix that satisfies the constraints). Then by Proposition 7.3 the matrix X ⊤ ext Σ X ext is nonsingular. Thus, for X = X ext , minimum in the right-hand of (85) is attained at ∆ 1 = ∆ (because ∆ satisfies stronger constraints of (85) and brings a minimum to the same functional with weaker constraints of (11)). Hence, which is the minimum value of Q 1 .
Transform the expression for the functional (39): Hence, the functionals (38) and (39) attain their minimal values at the same points.

Conclusion
The linear errors-in-variables model is considered. The errors are assumed to have the same covariance matrix for each observation and to be independent between different observations, however some variables may be observed without errors. Detailed proofs of the consistency theorems for the TLS estimator, which were first stated in [18], are presented. It is proved that that the final estimator X for explicit-notation regression coefficients (i.e., for X 0 in (1) or (2), and not the estimator X ext for X 0 ext in equation (3), which sets the relationship between the regressors and response variables implicitly) is unique, either with high probability or eventually. This means that in the classification used in [8], the TLS problem is of 1st class set F 1 (the solution is unique and "generic"), with high probability or eventually.
As by-product, we get that if in the definition of the estimator the Frobenius norm is replaced by the spectral norm, then the consistency theorems still hold true. The disadvantage of using spectral norm is that the estimator X is not unique then. (The set of solutions to the minimal spectral norm problem contains the set of solutions to the TLS problem. On the other hand, it is possible that the minimal spectral norm problem has solutions, but the TLS problem has not -this is the TLS problem of 1st class set F 3 ; the probability of this random event tends to 0.) Results can be generalized to any unitary invariant matrix norm. I do not know whether they hold true for non-invariant norms such as the maximum absolute entry, which is studied in [7].