Adversarial evaluation toolkit for frontier language models. Tests four failure-prone reasoning categories using an LLM-as-judge framework.
Built by Zalina Dezhina, PhD โ AI Evaluation ScientistBased on real evaluation methodology developed at Mercor for frontier AI systems.
Select an adversarial task and evaluate how a model handles it. Each task is designed to surface a specific reasoning failure mode.