Hard contamination scenarios¶
Question¶
Where do robust covariance estimators work, and where should users be cautious?
Design¶
This benchmark creates several contamination mechanisms:
mean-shift outliers;
clustered contamination;
variance contamination;
leverage contamination;
heavy-tailed inliers.
These scenarios are intentionally not all favorable to MCD. The goal is to teach users when robust covariance assumptions match the data and when the geometry is ambiguous.
Results table¶
scenario |
contamination |
method |
rel_fro_error |
seconds |
support_purity |
outlier_leakage |
support_size |
radial_kurtosis |
|---|---|---|---|---|---|---|---|---|
mean_shift |
0.2 |
robustcov FastMCD |
0.1352 |
0.046352 |
1.0000 |
0.0000 |
786 |
142.4016 |
mean_shift |
0.2 |
sklearn MinCovDet |
0.1107 |
0.123946 |
1.0000 |
0.0000 |
786 |
|
variance |
0.2 |
robustcov RegTyler |
0.1087 |
0.000875 |
225.0001 |
|||
variance |
0.2 |
robustcov FastMCD |
0.1341 |
0.044778 |
1.0000 |
0.0000 |
790 |
341.7775 |
heavy_tail_inliers |
0.2 |
robustcov FastMCD |
0.0886 |
0.046087 |
7.7035 |
|||
heavy_tail_inliers |
0.2 |
sklearn MinCovDet |
0.1205 |
0.114427 |
Interpretation¶
Mean-shift and variance contamination are favorable settings for robust covariance. Clustered and leverage scenarios can be genuinely ambiguous: the algorithm may not be able to distinguish a bad cluster from a legitimate subpopulation without domain knowledge. This is why robust-distance plots and diagnostic reports are part of the package rather than optional decoration.
Run it yourself¶
python benchmarks/hard_contamination_scenarios.py --csv results/hard_scenarios.csv