๐Ÿงช LLM Reasoning Evaluator

Adversarial evaluation toolkit for frontier language models. Tests four failure-prone reasoning categories using an LLM-as-judge framework.

Built by Zalina Dezhina, PhD โ€” AI Evaluation Scientist
Based on real evaluation methodology developed at Mercor for frontier AI systems.


Model to evaluate

Select an adversarial task and evaluate how a model handles it. Each task is designed to surface a specific reasoning failure mode.

Select Evaluation Task