Text / embedding outlier screening ================================== Embedding spaces often contain topical clusters and occasional off-topic documents. Robust scatter estimators provide a simple way to rank unusual embeddings without training a supervised classifier. Result at a glance ------------------ The example selects ``StudentTScatter`` and recovers all injected outlier embeddings in the lightweight simulation. This makes a good entry point for document review, moderation, and search-quality diagnostics. What the data represent ----------------------- The data are synthetic embedding-like vectors: a central topic cloud plus a small group of off-topic points. The goal is to mimic the geometry of sentence or document embeddings without requiring external models. Why this estimator ------------------ ``AutoRobustScatter`` chooses among robust scatter candidates. Student-t scatter is often a good compromise for diffuse, heavy-tailed embedding clouds. Reproduce the result -------------------- .. code-block:: bash python examples/use_case_text_embedding_outliers.py Output from the run ------------------- .. literalinclude:: ../_static/gallery/text_embedding_outliers/output.txt :language: text Figures and diagnostics ----------------------- .. image:: ../_static/gallery/text_embedding_outliers/distance_panel.png :alt: Text / embedding outlier screening — distance panel :width: 760px How to read the result ---------------------- Use the distance panel as a ranked review queue. The top-scoring embeddings are candidates for off-topic or low-quality items; the threshold should usually be calibrated by review capacity. What this does not prove ------------------------ Real embeddings can be strongly multimodal. For multiple legitimate topics, prefer cluster-aware robust distances or segment the corpus before fitting.