Scenario
Expected output
Eval suite catches all 5 planted failure modes with < 5% false positives
Dataset
Scoring rubric
Eval results are interpretable and actionable
LLM-as-judge prompts are tight, consistent, and calibrated against human labels
Evals are deterministic or have low variance across runs
Detects all planted failure modes, false positive rate < 5%
Eval dimensions are orthogonal, well-defined, and independently scorable
Language-free evaluation
Build your solution in any language or framework — Python, TypeScript, Go, Rust, Java, C#, or anything else. The dataset artifacts may be in one language; your implementation does not need to match. TryCrucible evaluates the behaviour of your system, the quality of your AI workflow, your verification strategy, and the reproducibility of your submission — not your language choice.
Submission requirements
- A public GitHub repository link
- A Dockerfile in the repo root — any language or framework; the evaluator builds and runs your container
- Your solution reads test_inputs.json from the working directory and writes results.json — standard I/O contract across all challenges
- A decisions.md — 3–5 sentences on the key architectural and AI-workflow choices you made
- The system must be fully reproducible — we clone, build, and run it against real test inputs
Evaluation contract
When you submit, the evaluator runs these steps in order:
- 1Clone your public GitHub repository
- 2Build your container from the Dockerfile in the repo root
- 3Mount test_inputs.json into the working directory
- 4Run your solution in a network-isolated sandbox (5 min limit, 512 MB RAM)
- 5Read results.json from the working directory
- 6Score correctness against hidden ground truth, then score architecture, AI workflow, robustness, and clarity
Input (provided by evaluator)
// test_inputs.json
[
{ "id": "t1", "input": { ... } },
{ "id": "t2", "input": { ... } }
]Output (written by your solution)
Create a free account to start
Already have one? Sign in