Competing Hypotheses
Multiple investigators propose, test, and disprove theories like a scientific debate.
At a Glance
| Field | Value |
|---|---|
| Best For | Ambiguous bugs, flaky tests, architectural decisions, root cause analysis |
| Team Shape | Lead (judge/arbiter) + 3-6 Investigators |
| Cost Profile | Medium (debate rounds consume tokens) |
| Complexity | Medium |
| Parallelism | Medium |
Medium Complexity
Medium Cost
Medium Parallelism
When to Use
- A bug is ambiguous and the root cause is unclear
- Flaky or intermittent behavior defies simple reproduction
- You need to choose between multiple architectural approaches with real trade-offs
- You want to counter anchoring bias by forcing divergent investigation paths
When NOT to Use
- The root cause is obvious and just needs fixing
- The decision has a clear “right answer” that does not require debate
- Cost is a primary concern – debate rounds are token-intensive
How It Works
Each investigator proposes a hypothesis and a test plan to validate or disprove it. Investigators are encouraged to actively disprove each other’s theories, creating a structured debate. The lead acts as final arbiter, evaluating the evidence and declaring a consensus root cause or decision.
graph TD
Judge[Lead<br/>Judge / Arbiter]
H1[Hypothesis 1]
H2[Hypothesis 2]
H3[Hypothesis 3]
Judge -->|spawn| H1
Judge -->|spawn| H2
Judge -->|spawn| H3
H1 <-.->|debate| H2
H2 <-.->|debate| H3
H1 <-.->|debate| H3
H1 -.->|evidence| Judge
H2 -.->|evidence| Judge
H3 -.->|evidence| Judge
- Lead describes the symptom and spawns investigators
- Each Investigator proposes: hypothesis, test plan, predicted observations, disproof criteria
- Investigators exchange messages to actively disprove each other’s theories
- Lead evaluates the surviving hypothesis and declares: consensus root cause, reproducer, fix plan, verification steps
Spawn Prompt
Users report: "<symptom>".
Spawn 5 teammates to investigate different hypotheses.
Have them talk to each other to disprove each other's theories like a scientific debate.
End with: (1) consensus root cause, (2) reproducer, (3) fix plan, (4) verification steps.
Task Breakdown Strategy
Structure tasks around the hypothesis lifecycle, not code modules:
- Hypothesis formation: Each investigator proposes a distinct theory
- Evidence gathering: Each investigator collects supporting and contradicting evidence
- Cross-examination: Investigators challenge each other’s evidence
- Convergence: Surviving hypotheses are refined into a consensus
Use task list dependencies so that “Fix” tasks only unblock after “Root cause confirmed” tasks complete.
Configuration
- Agents: Use
investigator.mdagent definitions with instructions to propose and disprove - Hooks: Use task dependencies to gate fix tasks behind confirmed root cause
- Team size: 3-6 investigators; more hypotheses increase coverage but also cost and debate complexity
Variations
- Architecture Decision variant: Instead of debugging, each investigator champions a different design approach and argues for it
- Red Team variant: One investigator is explicitly assigned to attack and disprove every other hypothesis
- Time-boxed variant: Limit debate rounds to control cost – after N rounds, the lead forces a decision
- Pipeline variant: Debate concludes with a root cause or decision, which flows into a Risky Refactor for controlled execution. See Composing Topologies
Trade-offs
Pros:
- Counters anchoring bias by forcing multiple investigation paths
- Cross-examination catches weak evidence and flawed reasoning
- Structured debate produces high-confidence conclusions
- Works well for problems where “the obvious answer” is often wrong
Cons:
- Token-intensive due to inter-agent messaging and debate rounds
- Requires clear facilitation from the lead to prevent circular arguments
- Medium parallelism – investigators need to read each other’s findings
- Overkill for straightforward bugs with clear symptoms
Related Patterns
- Parallel Explorers – when you need discovery without the debate overhead
- Risky Refactor – when the debate concludes and the fix needs careful execution
- Quality-Gated – layer on top to ensure debate conclusions meet evidence standards