Kaggle and external dataset roadmap¶

The built-in gallery intentionally uses sklearn datasets and synthetic generators so that examples run without network access, API keys, or large downloads. The next expansion path is to add optional Kaggle/external-data examples behind explicit commands.

Why optional?¶

External datasets are useful for adoption, but they should not make tests or documentation builds fragile. Large downloads, license terms, credentials, and changing dataset URLs should stay outside core package validation.

Candidate external examples¶

Kaggle-ready use-case candidates¶
Application	Dataset family	Why it fits robustcov
Credit-card fraud	Kaggle Credit Card Fraud Detection	Extreme class imbalance and tabular anomaly screening are natural robust-distance use cases.
IEEE-CIS transaction fraud	IEEE-CIS Fraud Detection competition data	High-dimensional transaction features make robust preprocessing and screening useful.
Industrial equipment monitoring	Predictive-maintenance / sensor fault datasets	Robust distance profiles are easy for engineers to inspect; compare honestly against IsolationForest.
Medical tabular screening	UCI/Kaggle diagnostic datasets	Robust distances provide interpretable patient-level anomaly scores.

Suggested implementation pattern¶

External examples should follow this pattern:

python examples_external/kaggle_credit_card_fraud.py --data /path/to/creditcard.csv

The script should never download data silently. It should print the expected schema, save metrics, create plots, and write a short Markdown summary into results/external/.

Documentation strategy¶

Each external dataset should get a page with:

where to obtain the data;
expected file names and columns;
preprocessing notes;
baseline comparisons;
robustcov output and plots;
limitations and licensing caveats.