Scenario
Expected output
Final synthesis contains all required key facts for 8 of 10 test questions
Dataset
Scoring rubric
Full agent trace is logged and interpretable
Each agent has a tight focused prompt, no agent is doing multiple unrelated jobs
Handles sub-agent failures, contradictory source documents, and incomplete retrieval
Key facts from ground truth appear in final synthesis for >= 8/10 questions
Agent roles are distinct, state handoff between agents is explicit and inspectable
Language-free evaluation
Build your solution in any language or framework — Python, TypeScript, Go, Rust, Java, C#, or anything else. The dataset artifacts may be in one language; your implementation does not need to match. TryCrucible evaluates the behaviour of your system, the quality of your AI workflow, your verification strategy, and the reproducibility of your submission — not your language choice.
Submission requirements
- A public GitHub repository link
- A Dockerfile in the repo root — any language or framework; the evaluator builds and runs your container
- Your solution reads test_inputs.json from the working directory and writes results.json — standard I/O contract across all challenges
- A decisions.md — 3–5 sentences on the key architectural and AI-workflow choices you made
- The system must be fully reproducible — we clone, build, and run it against real test inputs
Evaluation contract
When you submit, the evaluator runs these steps in order:
- 1Clone your public GitHub repository
- 2Build your container from the Dockerfile in the repo root
- 3Mount test_inputs.json into the working directory
- 4Run your solution in a network-isolated sandbox (5 min limit, 512 MB RAM)
- 5Read results.json from the working directory
- 6Score correctness against hidden ground truth, then score architecture, AI workflow, robustness, and clarity
Input (provided by evaluator)
// test_inputs.json
[
{ "id": "t1", "input": { ... } },
{ "id": "t2", "input": { ... } }
]Output (written by your solution)
Create a free account to start
Already have one? Sign in