Hard contamination scenarios

Question

Where do robust covariance estimators work, and where should users be cautious?

Design

This benchmark creates several contamination mechanisms:

  • mean-shift outliers;

  • clustered contamination;

  • variance contamination;

  • leverage contamination;

  • heavy-tailed inliers.

These scenarios are intentionally not all favorable to MCD. The goal is to teach users when robust covariance assumptions match the data and when the geometry is ambiguous.

Results table

Hard contamination scenarios

scenario

contamination

method

rel_fro_error

seconds

support_purity

outlier_leakage

support_size

radial_kurtosis

mean_shift

0.2

robustcov FastMCD

0.1352

0.046352

1.0000

0.0000

786

142.4016

mean_shift

0.2

sklearn MinCovDet

0.1107

0.123946

1.0000

0.0000

786

variance

0.2

robustcov RegTyler

0.1087

0.000875

225.0001

variance

0.2

robustcov FastMCD

0.1341

0.044778

1.0000

0.0000

790

341.7775

heavy_tail_inliers

0.2

robustcov FastMCD

0.0886

0.046087

7.7035

heavy_tail_inliers

0.2

sklearn MinCovDet

0.1205

0.114427

Interpretation

Mean-shift and variance contamination are favorable settings for robust covariance. Clustered and leverage scenarios can be genuinely ambiguous: the algorithm may not be able to distinguish a bad cluster from a legitimate subpopulation without domain knowledge. This is why robust-distance plots and diagnostic reports are part of the package rather than optional decoration.

Run it yourself

python benchmarks/hard_contamination_scenarios.py --csv results/hard_scenarios.csv