Scenario
Expected output
Eval suite correctly labels answer faithfulness and gives useful explanations for 24 held-out cases
Dataset
Scoring rubric
Eval report shows label, evidence, and rationale per answer
LLM-as-judge prompt requires quotes or evidence spans from context
Handles partial support, subtle contradictions, and unsupported extra claims
Labels match ground truth for at least 20/24 held-out cases
Faithfulness, contradiction, and omission checks are separate and interpretable
Language-free evaluation
Build your solution in any language or framework — Python, TypeScript, Go, Rust, Java, C#, or anything else. The dataset artifacts may be in one language; your implementation does not need to match. TryCrucible evaluates the behaviour of your system, the quality of your AI workflow, your verification strategy, and the reproducibility of your submission — not your language choice.
Submission requirements
- A public GitHub repository link
- A Dockerfile in the repo root — any language or framework; the evaluator builds and runs your container
- Your solution reads test_inputs.json from the working directory and writes results.json — standard I/O contract across all challenges
- A decisions.md — 3–5 sentences on the key architectural and AI-workflow choices you made
- The system must be fully reproducible — we clone, build, and run it against real test inputs
Evaluation contract
When you submit, the evaluator runs these steps in order:
- 1Clone your public GitHub repository
- 2Build your container from the Dockerfile in the repo root
- 3Mount test_inputs.json into the working directory
- 4Run your solution in a network-isolated sandbox (5 min limit, 512 MB RAM)
- 5Read results.json from the working directory
- 6Score correctness against hidden ground truth, then score architecture, AI workflow, robustness, and clarity
Input (provided by evaluator)
// test_inputs.json
[
{ "id": "t1", "input": { ... } },
{ "id": "t2", "input": { ... } }
]Output (written by your solution)
Create a free account to start
Already have one? Sign in