Project overview

Rare Transit Pipeline

A machine learning–driven system for discovering rare exoplanet transit events in NASA Kepler, K2, and TESS light curves. We integrate a self‑supervised time‑series encoder with physics‑informed matched filtering and rigorous calibration to produce interpretable, reproducible candidate lists.

Kepler · K2 · TESSSelf‑supervised encoderGLRT + templatesIsotonic + conformal

153

ranked windows (latest scan)

86

shortlist with p < 0.1

KIC 12557548 and KIC 3542116

Known systems recovered

How we approached the problem

Rare transits are needle‑in‑a‑haystack events: asymmetric dips, ring‑like overshoots, or sequences of shallow events buried in stellar variability and instrument systematics. Manual vetting doesn’t scale, so we built an end‑to‑end system that makes these signals pop without sacrificing statistical validity.

1) Data prep that respects the physics

We pulled Kepler/K2/TESS light curves (Lightkurve/Astroquery), removed long‑term trends with Gaussian‑process fits, and windowed residuals into ~36–60h segments. We handled gaps and reaction‑wheel artifacts, normalized windows, and split by star so evaluation never leaks across targets.

2) A label‑free encoder that understands morphology

A compact self‑supervised model learned to reconstruct masked residuals while matching a whitened power spectrum. The result is a mission‑agnostic embedding that highlights ingress/egress‑like structure without needing hand labels.

3) Physics‑informed detection

We run fast, PSD‑aware matched filters using physically motivated templates (exocomet tails, ringed trapezoids, disintegrator combs, circumbinary sequences). Each window gets a set of GLRT‑style scores plus template parameters.

4) Calibrated scoring you can trust

Scores from the encoder and detectors are mapped through isotonic regressions and converted to conformal p‑values using a large null pool. We then control false discovery rate (FDR) at star and global levels, so a “significant” call means what it should.

5) Evidence, not just numbers

For every candidate we generate an image card that overlays the detrended flux, matched‑filter responses, residual checks, and metadata. The explorer you can open below renders these precomputed artifacts with zero backend code.

Workflow heartbeat

  • Ingest

    MAST/Lightkurve pulls; deterministic star splits; per‑mission manifests.

  • Clean

    Gap‑fill, normalize, GP detrend; overlapping residual windows.

  • Encode

    SSL encoder → 256‑D embeddings + attention; spectral stability across missions.

  • Detect

    PSD‑aware matched filters over template grids; GLRT‑style scores.

  • Calibrate

    Isotonic per mission/class; conformal p‑values; BH FDR control.

  • Publish

    Rank + merge windows; evidence cards; CSV/JSON for explorer.

What this run surfaced

We processed Kepler and K2 windows end‑to‑end and packaged the top candidates as evidence cards. The latest scan produced 153 ranked windows; 86 form a conservative shortlist with p < 0.1. We recovered known systems (KIC 12557548, KIC 3542116) and flagged several recurrent, asymmetric TESS‑like dips that merit follow‑up. No candidates passed 5% global FDR—by design we chose conservative calibration in this iteration.

The explorer exposes the full dataset behind each image: calibrated score, conformal p‑value, template parameters, and context windows.