Abstract
When data analysts train a classifier and check if its accuracy is significantly different from a half, they are implicitly performing a two-sample test. We investigate the statistical optimality of this indirect but flexible method in the high-dimensional setting of $d/n \to c \in (0,\infty)$. We provide a concrete answer for the case of distinguishing Gaussians with mean-difference $\delta$ and common (known or unknown) covariance $\Sigma$, by contrasting the indirect approach using variants of linear discriminant analysis (LDA) such as naive Bayes, with the direct approach using corresponding variants of Hotelling’s test. Somewhat surprisingly, the indirect approach achieves the same power as the direct approach in terms of $n,d,\delta,\Sigma$, and is only worse by a constant factor, achieving an asymptotic relative efficiency of $1/\pi$ for the balanced sample case. Other results of independent interest are provided, such as minimax lower bounds, and optimality of Hotelling’s test when $d=o(n)$. Simulation results validate our theory, and we present practical takeaway messages along with several open problems.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1602.02210