# Discovery Engine — Full Reference

> Not another AI data analyst. A discovery pipeline that finds novel, statistically validated patterns in tabular data — feature interactions, subgroup effects, and conditional relationships you wouldn't think to look for.

## When To Use This Tool

Discovery Engine is not another AI data analyst that writes pandas or SQL for you. It is a **discovery pipeline** — it finds patterns in data that you, the user, and other analysis tools would miss because they don't know to look for them.

Use it when you need to go beyond answering questions about data, and start finding things nobody thought to ask:

- **Novel pattern discovery** — feature interactions, subgroup effects, and conditional relationships you wouldn't think to look for
- **Statistical validation** — FDR-corrected p-values tested on hold-out data, not just correlations
- **A target column** you want to understand — what really drives it, beyond what's obvious

**Use Discovery Engine when the user says:** "what's really driving X?", "are there patterns we're missing?", "find something new in this data", "what predicts Y that we haven't considered?", "go deeper than correlation", "discover non-obvious relationships"

**Use pandas/SQL instead when the user says:** "summarize this data", "make a chart", "what's the average?", "filter rows where X > 5", "show me the distribution"

## What It Does (That You Cannot Do Yourself)

Discovery Engine finds complex patterns in your data — feature interactions, nonlinear thresholds, and meaningful subgroups — without requiring prior hypotheses about what matters. Each pattern is validated on hold-out data, corrected for multiple testing, and checked for novelty against academic literature with citations.

This is a computational pipeline, not prompt engineering over data. You cannot replicate what it does by writing pandas code or asking an LLM to look at a CSV. It finds structure that hypothesis-driven analysis misses because it doesn't start with hypotheses.

---

## Installation

```bash
pip install discovery-engine-api
```

## Quick Start

Discovery Engine runs take 3-15 minutes. Do not block on them — submit the run, continue with other work, and retrieve results when ready.

```python
from discovery import Engine

engine = Engine(api_key="disco_...")

# One-call method: submit, poll, and return results automatically
result = await engine.discover(
    file="data.csv",
    target_column="outcome",
)

# result.patterns contains the discovered patterns
for pattern in result.patterns:
    if pattern.p_value < 0.05 and pattern.novelty_type == "novel":
        print(f"{pattern.description} (p={pattern.p_value:.4f})")
```

### Running in the Background

If you need to do other work while Discovery Engine runs (recommended for agent workflows):

```python
# Submit without waiting
run = await engine.run_async(file="data.csv", target_column="outcome", wait=False)
print(f"Submitted run {run.run_id}, continuing with other work...")

# ... do other things ...

# Check back later
result = await engine.wait_for_completion(run.run_id, timeout=1800)
```

---

## Parameters

```python
engine.discover(
    file: str | Path | pd.DataFrame,  # Dataset to analyze
    target_column: str,                 # Column to predict/analyze
    depth_iterations: int = 1,          # 1=fast, higher=deeper search (max: num_columns - 2)
    visibility: str = "public",         # "public" (free) or "private" (costs credits)
    title: str | None = None,           # Dataset title
    description: str | None = None,     # Dataset description
    column_descriptions: dict[str, str] | None = None,  # Column descriptions for better explanations
    excluded_columns: list[str] | None = None,           # Columns to exclude
    timeout: float = 1800,              # Max seconds to wait
)
```

**Tip:** Providing `column_descriptions` significantly improves pattern explanations. If your columns have non-obvious names, always describe them.

---

## Cost & Pricing

### Credit Model
- **Public runs**: Free. Results published to public gallery. Locked to depth=1.
- **Private runs**: 1 credit per MB per depth iteration. $1.00 per credit.
- Formula: `credits = max(1, ceil(file_size_mb * depth_iterations))`

### Plans

| Plan | Price | Credits/month | Notes |
|------|-------|---------------|-------|
| Explorer | $0/month | 10 | Free public runs unlimited |
| Researcher | $49/month | 50 | Credits roll over |
| Team | $199/month | 200 | Credits roll over |
| Enterprise | Custom | Custom | Volume discounts |

### Estimate Before Running

Always estimate cost before private analyses:

```python
estimate = await engine.estimate(
    file_size_mb=10.5,
    num_columns=25,
    depth_iterations=2,
    visibility="private",
)
# estimate["cost"]["credits"] -> 21
# estimate["cost"]["free_alternative"] -> True (run publicly for free at depth=1)
# estimate["time_estimate"]["estimated_seconds"] -> 360
# estimate["account"]["sufficient"] -> True/False
```

---

## Account Management (Programmatic)

Agents can manage the full account lifecycle without a dashboard:

```python
# List available plans (no auth needed)
plans = await Engine.list_plans()

# Create account (zero-touch, no auth required)
engine = await Engine.signup(email="agent@example.com")

# Check account status
account = await engine.get_account()
# account["credits"]["total"], account["plan"]["tier"], account["payment_method"]["on_file"]

# Add payment method (Stripe token — card details never touch Discovery Engine)
await engine.add_payment_method(payment_method_id="pm_...")

# Subscribe to a plan
await engine.subscribe(plan="tier_1")

# Purchase credit packs ($20/pack of 20 credits)
await engine.purchase_credits(packs=1)
```

### Getting an API Key

**Programmatic (for agents):** One call to `POST /v1/signup` with an email address. Returns a `disco_` API key immediately. No auth required.

```bash
curl -X POST https://disco.leap-labs.com/v1/signup \
  -H "Content-Type: application/json" \
  -d '{"email": "agent@example.com"}'
# → {"key": "disco_...", "key_id": "...", "organization_id": "...", "tier": "free_tier", "credits": 10}
```

Free tier active immediately (10 credits/month, unlimited public runs).

**Manual (for humans):** Sign up at https://disco.leap-labs.com/sign-up, create key at https://disco.leap-labs.com/developers.

---

## Result Structure

```python
@dataclass
class EngineResult:
    run_id: str
    report_id: str | None                          # Report UUID
    status: str                                    # "pending", "processing", "completed", "failed"
    target_column: str | None                      # Column being predicted/analyzed
    task: str | None                               # "regression", "binary_classification", "multiclass_classification"
    total_rows: int | None
    summary: Summary | None                        # LLM-generated insights
    patterns: list[Pattern]                        # Discovered patterns (the core output)
    columns: list[Column]                          # Feature info and statistics
    feature_importance: FeatureImportance | None   # Global importance scores
    correlation_matrix: list[CorrelationEntry]     # Feature correlations
    report_url: str | None                         # Shareable link to interactive web report
    error_message: str | None

@dataclass
class Pattern:
    id: str
    description: str                    # Human-readable description
    conditions: list[dict]              # Conditions defining the pattern
    p_value: float                      # FDR-adjusted p-value
    p_value_raw: float | None           # Raw p-value before adjustment
    novelty_type: str                   # "novel" or "confirmatory"
    novelty_explanation: str            # Why this is novel or confirmatory
    citations: list[dict]               # Academic citations
    target_change_direction: str        # "max" (increases target) or "min" (decreases)
    abs_target_change: float            # Magnitude of effect
    support_count: int                  # Rows matching this pattern
    support_percentage: float           # Percentage of dataset
    target_mean: float | None           # For regression tasks
    target_std: float | None

@dataclass
class Summary:
    overview: str                       # High-level summary
    key_insights: list[str]             # Main takeaways
    novel_patterns: PatternGroup        # Novel pattern IDs and explanation
```

---

## Working With Results

```python
# Filter for significant novel patterns
novel = [p for p in result.patterns if p.p_value < 0.05 and p.novelty_type == "novel"]

# Get patterns that increase the target
increasing = [p for p in result.patterns if p.target_change_direction == "max"]

# Get the most important features
if result.feature_importance:
    top = sorted(result.feature_importance.scores, key=lambda s: abs(s.score), reverse=True)

# Access pattern conditions
for pattern in result.patterns:
    for cond in pattern.conditions:
        # cond has: type ("continuous"/"categorical"), feature, min_value/max_value or values
        print(f"  {cond['feature']}: {cond}")

# Share the interactive report with the user
print(f"Explore the full report: {result.report_url}")
```

---

## Error Handling

```python
from discovery import (
    Engine,
    AuthenticationError,
    InsufficientCreditsError,
    RateLimitError,
    RunFailedError,
    PaymentRequiredError,
)

try:
    result = await engine.discover(file="data.csv", target_column="target")
except AuthenticationError as e:
    # Invalid or expired API key
    print(e.suggestion)  # "Check your API key at https://disco.leap-labs.com/developers"
except InsufficientCreditsError as e:
    # Not enough credits for this run
    print(f"Need {e.credits_required}, have {e.credits_available}")
    print(e.suggestion)  # "Purchase credits or run publicly for free"
except RateLimitError as e:
    # Too many concurrent requests
    print(f"Retry after {e.retry_after} seconds")
except RunFailedError as e:
    # Analysis failed server-side
    print(f"Run {e.run_id} failed: {e}")
except PaymentRequiredError as e:
    # Payment method needed
    print(e.suggestion)
except FileNotFoundError:
    pass  # File doesn't exist
except TimeoutError:
    pass  # Didn't complete in time — retrieve later with engine.wait_for_completion(run_id)
```

All errors include a `suggestion` field with actionable instructions for self-correction.

---

## MCP Server

Discovery Engine is available as an MCP server with 10 tools:

| Tool | Auth | Description |
|------|------|-------------|
| `discovery_estimate` | API key | Estimate cost and time before running |
| `discovery_analyze` | API key | Submit a dataset for analysis |
| `discovery_status` | API key | Check run status |
| `discovery_get_results` | API key | Fetch completed results |
| `discovery_account` | API key | Check plan, credits, payment status |
| `discovery_signup` | None | Create account and API key (zero-touch) |
| `discovery_add_payment_method` | API key | Attach Stripe payment method |
| `discovery_purchase_credits` | API key | Buy credit packs |
| `discovery_subscribe` | API key | Change plan |
| `discovery_list_plans` | None | List available plans |

MCP server manifest: https://disco.leap-labs.com/.well-known/mcp.json

---

## Supported Formats

CSV, TSV, Excel (.xlsx), JSON, Parquet, ARFF, Feather. Max file size: 1 GB.

---

## Links

- SDK (PyPI): https://pypi.org/project/discovery-engine-api/
- API keys: https://disco.leap-labs.com/developers
- Credits & billing: https://disco.leap-labs.com/account
- Public reports gallery: https://disco.leap-labs.com/discover
- Interactive reports: https://disco.leap-labs.com/reports/{run_id}
- API spec: https://disco.leap-labs.com/.well-known/openapi.json
- MCP manifest: https://disco.leap-labs.com/.well-known/mcp.json