A Semi-Supervised Kernel Two-Sample Test

Jan 6, 2025·
Gyumin Lee
Gyumin Lee
,
Ilmun Kim
,
Shubhanshu Shekhar
· 1 min read
Abstract
Recent advancements in statistics and machine learning have spurred the development of semi-supervised methodologies that effectively integrate both labeled and unlabeled data. In this context, statistical inference problems have recently gained increasing attention. Among these, the kernel Maximum Mean Discrepancy (MMD) test is a widely used method for detecting distributional differences in two-sample testing. However, standard kernel-MMD tests typically rely on computationally expensive permutation procedures to establish rejection thresholds. Moreover, incorporating additional covariates introduces further complications: under the null hypothesis, the distributions of those do not need to match, violating the exchangeability assumption required by permutation tests. To address these challenges, we extend kernel-based two-sample testing to a semi-supervised setting using sample-splitting and studentization. We establish that our test statistic achieves asymptotic Normality under the null. We further demonstrate that the test statistic, regardless of the inclusion of unlabeled data, approximates an asymptotic Normal distribution under the alternative, which facilitates accurate power analyses. We demonstrate that using unlabeled data increases the test’s power while ensuring consistency, despite the fact that power consistency is maintained without it. We derive an explicit power expression for bilinear kernels to substantiate these findings and validate the proposed method’s enhanced performance through numerical simulations.
Type

This paper is still in preparation. This paper will be available to read in the openreview link in my “bio” page(link with book logo).