OpenMP scaling benchmark
========================

Question
--------

Does optional OpenMP parallelism improve speed on larger workloads?

Design
------

The benchmark runs the same estimator with different thread counts.  BLAS thread counts should be
set to one so OpenMP and BLAS do not oversubscribe the CPU.

.. code-block:: bash

   OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 \
   python benchmarks/openmp_scaling.py --n 8000 --p 20 --threads 1 2 4

Scaling table
-------------

.. csv-table:: OpenMP scaling
   :file: ../_static/benchmarks/openmp_scaling.csv
   :header-rows: 1

Plot
----

.. image:: ../_static/benchmarks/openmp_scaling.png
   :alt: OpenMP scaling plot
   :width: 760px

Interpretation
--------------

OpenMP helps most when the workload has enough rows, enough features, or enough random starts to
pay for threading overhead.  Small examples may not speed up because thread startup and scheduling
costs dominate.  In larger benchmark settings, robust distance evaluation, covariance accumulation,
Tyler updates, and FastMCD candidate evaluation can all benefit.

Practical advice
----------------

Use explicit environment variables for reproducible timing:

.. code-block:: bash

   OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1

Inside Python, users can also control the package thread count:

.. code-block:: python

   import robustcov as rc

   print(rc.has_openmp())
   rc.set_num_threads(4)
   est = rc.FastMCD(n_jobs=4, random_state=0).fit(X)