Optional OpenMP acceleration
============================

``robustcov`` can use OpenMP in its C++ extension when the compiler and build
environment provide it. OpenMP is optional: builds without OpenMP still work and
fall back to the serial kernels.

What is parallelized?
---------------------

The current OpenMP path targets the loops that are shared by several estimators:

* robust / Mahalanobis distance evaluation over observations;
* column means and covariance accumulation over selected subsets;
* Tyler and regularized Tyler scatter accumulation;
* FastMCD random-start evaluation and final candidate polishing.

These are useful first targets because they are repeated many times inside
FastMCD C-steps, reweighting, Tyler iterations, diagnostics, and benchmark runs.

Thread-control API
------------------

Use the process-wide helpers when you want global control:

.. code-block:: python

   import robustcov as rc

   print(rc.has_openmp())
   print(rc.get_num_threads())
   rc.set_num_threads(4)

Estimators that call the C++ backend also accept ``n_jobs``:

.. code-block:: python

   est = rc.FastMCD(n_init=500, n_jobs=4, random_state=0).fit(X)
   ty = rc.RegularizedTyler(alpha=0.1, n_jobs=4).fit(X)

For temporary changes, use ``thread_limit``:

.. code-block:: python

   with rc.thread_limit(2):
       est = rc.FastMCD(n_init=1000, random_state=0).fit(X)

Benchmarking scaling
--------------------

Run:

.. code-block:: bash

   python benchmarks/openmp_scaling.py --n 8000 --p 20 --threads 1 2 4 --csv results/openmp_scaling.csv

Interpret scaling carefully. Small data can be slower with multiple threads
because thread launch and reduction overhead dominate. The gains should be most
visible for larger ``n``, larger ``p``, and many FastMCD starts.

BLAS and OpenMP interaction
---------------------------

Some NumPy/SciPy builds also use threaded BLAS. If BLAS and OpenMP both use many
threads, oversubscription can hurt performance. For clean OpenMP benchmarks, set
BLAS thread counts explicitly, for example:

.. code-block:: bash

   OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 python benchmarks/openmp_scaling.py

Determinism
-----------

Random starts are generated serially so ``random_state`` remains reproducible.
Parallel reductions may still cause tiny floating-point differences because
summation order changes. These should be numerically negligible.


Benchmark integration
---------------------

The benchmark report includes OpenMP scaling by default::

   OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 \
   python benchmarks/make_report.py --outdir results/report

The generated report includes ``openmp_scaling.csv`` and ``openmp_scaling.png``. For larger
``n``, higher ``p``, and more FastMCD starts, the parallel benefit should become more visible.

A common shell mistake is to paste Python snippets directly into bash. Use ``python - <<'PY'``
or an interactive Python session when checking OpenMP helpers::

   python - <<'PY'
   import robustcov as rc
   print(rc.has_openmp())
   print(rc.get_num_threads())
   PY