The area under the receiver operating characteristic curve (AUC) is a suitable measure for the quality of classification algorithms. Here we use the theory of U-statistics in order to derive new confidence intervals for it. The new confidence intervals take into account that only the total sample size used to calculate the AUC can be controlled, while the number of members of the case group and the number of members of the control group are random. We show that the new confidence intervals can not only be used in order to evaluate the quality of the fitted model, but also to judge the quality of the classification algorithm itself. We would like to take this opportunity to show that two popular confidence intervals for the AUC, namely DeLong’s interval and the Mann–Whitney intervals due to Sen, coincide.
This paper deals with a homoskedastic errors-in-variables linear regression model and properties of the total least squares (TLS) estimator. We partly revise the consistency results for the TLS estimator previously obtained by the author [18]. We present complete and comprehensive proofs of consistency theorems. A theoretical foundation for construction of the TLS estimator and its relation to the generalized eigenvalue problem is explained. Particularly, the uniqueness of the estimate is proved. The Frobenius norm in the definition of the estimator can be substituted by the spectral norm, or by any other unitarily invariant norm; then the consistency results are still valid.
In this paper we develop a general framework for quantifying how binary risk factors jointly influence a binary outcome. Our key result is an additive expansion of odds ratios as a sum of marginal effects and interaction terms of varying order. These odds ratio expansions are used for estimating the excess odds ratio, attributable proportion and synergy index for a case-control dataset by means of maximum likelihood from a logistic regression model. The confidence intervals associated with these estimates of joint effects and interaction of risk factors rely on the delta method. Our methodology is illustrated with a large Nordic meta dataset for multiple sclerosis. It combines four studies, with a total of 6265 cases and 8401 controls. It has three risk factors (smoking and two genetic factors) and a number of other confounding variables.