Robust preprocessing before classification
==========================================

Sometimes robust covariance is not the final model.  It can be a preprocessing step that identifies suspicious training rows before fitting a standard classifier.

Result at a glance
------------------

In this run, filtering removes 39 training rows.  The filtered classifier is slightly worse than the raw classifier, which is an important honest result: robust filtering is not automatically beneficial.

What the data represent
-----------------------

The example uses a noisy supervised classification problem.  robustcov scores are computed on the training features and high-distance rows are removed before refitting the classifier.

Why this estimator
------------------

``RegularizedCauchy`` or ``AutoRobustScatter`` is useful when the training set may contain heavy-tailed contamination.  The goal is not to win every classifier benchmark, but to diagnose influential or suspicious rows.

Reproduce the result
--------------------

.. code-block:: bash

   python examples/use_case_ml_preprocessing.py

Output from the run
-------------------

.. literalinclude:: ../_static/gallery/ml_preprocessing/output.txt
   :language: text

Figures and diagnostics
-----------------------

.. image:: ../_static/gallery/ml_preprocessing/accuracy_comparison.png
   :alt: Robust preprocessing before classification — accuracy comparison
   :width: 760px


.. image:: ../_static/gallery/ml_preprocessing/distance_profile.png
   :alt: Robust preprocessing before classification — distance profile
   :width: 760px


How to read the result
----------------------

Compare the accuracy plot before and after filtering.  If performance drops, the removed rows may be hard-but-valid training examples rather than harmful contamination.

What this does not prove
------------------------

Use this workflow with cross-validation.  Never filter using test labels, and do not assume that every outlier is an error.