The Human Intelligence Platform for Training and Evaluating AI

Expert human intelligence to train, test, and validate AI - across frontier models and real-world enterprise workflows.

Backed by founders & technical leaders from
About us

Huzzle AI is a human intelligence platform for training and evaluating AI.
We partner with frontier AI labs and enterprises building agents that operate in real-world professional environments.

Our focus is on producing human data of long-horizon professional tool use, and computer interaction — the kinds of trajectories required for modern agents to perform reliably in production.

What we've built

  1. A platform + operational process for building RL environments & outcome verifiable tasks
  2. An expert talent pool (300k+ talents, Customers: Apple, Lazard, FT)
  3. An AI recruiter that sources, assesses, matches, and contracts thousands of experts compliantly (100k+ interviews, avg. NPS: 88)
>  2) and 3) have been succesfully trading for years, so we focus all our energy on 1)


Team

We are 20 engineers and 10 operators, backed by founders and senior technical leaders from Meta, Hugging Face, Applied Intuition, and Magic.

Our operations are based in the US and Europe, allowing us to support frontier labs as well as compliance-heavy European enterprises.

Core principles

What differentiates Huzzle in the space

Case studies

Example work

In a recent clinical reasoning project with 35 licensed medical doctors, we were able to achieve a ~16% improvement in annotation quality compared to a baseline from leading data providers through cohort-based upskilling. The average inter annotator agreement score was ~87%.
+16%
Uplift in annotation quality
Example Task

A 62-year-old male presents with progressive shortness of breath, bilateral leg edema, and recent weight gain. He has a history of hypertension and type 2 diabetes. Based on the clinical presentation and available information, determine the most likely diagnosis, outline the key differential diagnoses, and describe the next diagnostic steps you would take to confirm your assessment.

grading rubrics

Responses are evaluated on diagnostic accuracy, quality of reasoning, and clinical prioritization. High-quality answers correctly identify the most likely diagnosis, consider relevant differentials, and justify them using symptoms and risk factors. Strong responses also propose appropriate, guideline-aligned next steps (e.g. imaging, labs) and avoid irrelevant or unsafe recommendations.

We completed a project capturing multimodal data from real-world desktop and browser workflows. Participants performed multi-step tasks involving navigation and tool use, producing long-horizon interaction traces alongside step-by-step reasoning.
55 steps / task
Avg. trajectory length
Example Task

You are given access to an internal CRM system. A customer reports that their contract was renewed at an incorrect price. Navigate the system to locate the customer record, identify the source of the pricing error, and update the contract to reflect the correct terms. Document the steps taken and any checks performed to ensure the change is accurate.

grading rubrics

Responses are evaluated on task completion, tool navigation, and procedural accuracy. High-quality answers follow a logical sequence of actions, correctly use the relevant interface elements, and resolve the issue without introducing new errors. Strong responses clearly document actions taken, verify the outcome, and demonstrate awareness of system constraints and potential side effects.

We built a realistic environment modeled on a leading medical CRM to capture end-to-end clinician workflows, including information retrieval, navigation, decision-making, and state transitions.
91%
Step-level correctness
Example Task

You are using a medical CRM to manage outpatient appointments. A patient calls to report worsening symptoms and requests an earlier follow-up. Review the patient’s record, assess recent notes and test results, reschedule the appointment to an appropriate time slot, and flag the case for clinician review according to protocol. Record all actions in the system.

grading rubrics

Responses are evaluated on correct system usage, clinical safety, and adherence to workflow protocols. High-quality answers accurately locate and interpret relevant patient data, perform the appropriate CRM actions (rescheduling, flagging, documentation), and respect escalation rules. Strong responses demonstrate cautious judgment, clear documentation, and avoid unauthorized clinical decisions.

Talent network

Access over 300,000 Academics and Professionals

We’ve invested five years in building an engaged talent pool. Today, this serves as the foundation for our data operations platform.

"I’m thrilled to be part of Huzzle’s human data project! Huzzle made the entire process seamless - from onboarding to understanding the labelling tasks. I felt supported at every step and truly valued for my domain knowledge. It’s exciting to know my input helps train models that could shape the future of healthcare."

Dr. Guru Dutt Tyagi
M.B.B.S. (Gold Medalist)