ARBITER

Live AI judge for hackathons. Watches demos in real-time. Scores with multi-model ensemble. Roasts prompt injection attempts on stage.

"Twenty-five teams walked in with 24 hours of code and a live demo slot. What they didn't expect: an AI judge watching every second."

GitHub Repo Event Results Security Assessment

25demos judged

3AI models

1451tests

7languages

Real-Time Observation

Connects to Gemini Live API, streams audio and video, generates observations as presenters speak. Captures what they say and what they show.

Multi-Model Scoring

Gemini, Claude, and Groq independently evaluate each demo. Scores aggregated with outlier detection. Python-side arithmetic prevents LLM manipulation.

4-Layer Injection Defense

Regex denylist, semantic classifier, multi-language detection (7 languages), dual-LLM privilege separation. Red-teamed by 3 AI agents post-event.

British-Accented Commentary

Generates persona-driven reviews delivered via Cartesia TTS. Sharp, fair, and entertaining. Each sentence emotion-tagged for voice modulation.

Theatrical Score Reveal

Animated criterion-by-criterion reveal on the audience display. Dramatic pacing with score bars, justifications, and total score.

Cross-Team Deliberation

After all demos, compares every team against every other team. Produces evidence-based rankings with cross-references and narrative summary.

Quick Start

No hardware needed for rehearsal mode.

git clone https://github.com/basicScandal/arbiter.git
cd arbiter && uv sync
uv run python -m src.main --rehearsal

Documentation

How We Built Arbiter— architecture, what broke live, lessons learned
Red Team Report— 11 prompt injection findings from 3 AI agents
Red Team Slides— 10-slide HTML presentation
Architecture— system design and module overview
Operator Guide— running the system at an event
Judge Instructions— how human judges score alongside Arbiter