Projects
Check out what I've been working on.
A lightweight recorder that logs everyday computer use (keyboard, screen, mouse events) as structured traces for computer-use agent development.
See demo
LLM Evaluation Uncertainty
· 2024Conformal prediction for LLM-as-judge systems to quantify evaluation uncertainty. Includes LLM-as-judge evaluation using Sonnet 4, multiple conformal prediction strategies, and interactive visualization for uncertainty quantification.