💼
Under ReviewXinyang Han (UCB), Yiyou Sun (UCB)

Human Resources

High-frequency, repetitive, rule-driven work managing the full employee lifecycle—a strong fit for agent automation and execution-focused benchmarks.

Contribute to Human Resources
Part 1

HR Overview

HR is often compared to a company's "circulatory system": it does not directly generate revenue, but if it breaks (can't hire, payroll errors, compliance incidents), the entire organization suffers. This makes HR work unusually high-frequency, repetitive, and rule-driven.

The Employee Lifecycle

HR manages the full employee journey through six core functional blocks. The persistent "source of truth" is the HRIS (Human Resource Information System).

Employee Lifecycle
Hiring / Recruiting
Find and attract the right talent
Onboarding
Paperwork & provisioning
Payroll & Tax
Pay people correctly
Benefits Administration
Health, retirement, open enrollment
Performance Management
Assess and calibrate
Offboarding
Exit process & compliance

HRIS: The Source of Truth

The Human Resource Information System stores employee master data and feeds all other modules. Data flows from Hiring → Onboarding → HRIS → Payroll/Benefits/Performance.

WorkdaySAP SuccessFactorsOracle HCMBambooHR

Key Roles & Automation Potential

Different HR roles have varying levels of automation potential based on how deterministic and rule-driven their daily work is.

Best Starting Points for AgentHLE

The most deterministic, rules-driven work clusters in Payroll Specialist, Benefits Specialist, and HR Coordinator—making them natural starting points for benchmarks.

Automation Potential:
Very High (~90%)
High
Medium
Low

Payroll Specialist

Payroll

Collect time data, compute payroll, verify deductions, file tax forms

Very High (~90%)
Typical Daily Work

Collect time data, compute payroll, verify deductions, file tax forms

Why This Automation Potential
Highly deterministic rulesNumeric precision requiredRegulatory compliance
Common Tools
GustoADPPaychex

HR Coordinator

Onboarding / Admin

Collect documents, run onboarding checklists, maintain employee records

Very High (~90%)
Typical Daily Work

Collect documents, run onboarding checklists, maintain employee records

Why This Automation Potential
Form processingDeadline-drivenCompliance requirements
Common Tools
BambooHRWorkday

Benefits Specialist

Benefits

Manage plans, determine eligibility, handle open enrollment

High
Typical Daily Work

Manage plans, determine eligibility, handle open enrollment

Why This Automation Potential
Rule-driven eligibilityRegulatory knowledge (ACA, COBRA)Plan administration
Common Tools
BenefitfocusEmployee Navigator

Recruiter

Hiring

Source candidates, screen resumes, coordinate interviews, follow-ups

High
Typical Daily Work

Source candidates, screen resumes, coordinate interviews, follow-ups

Why This Automation Potential
High volume screeningCommunication heavyExternal data sources
Common Tools
GreenhouseLeverLinkedIn Recruiter

HR Generalist

Cross-functional

Employee support, compliance, enrollments, policy questions

Medium
Typical Daily Work

Employee support, compliance, enrollments, policy questions

Why This Automation Potential
Varied tasksEmployee-facingPolicy interpretation

Compensation Analyst

Payroll / Comp

Salary benchmarking, compensation bands, equity analysis

Medium
Typical Daily Work

Salary benchmarking, compensation bands, equity analysis

Why This Automation Potential
Data analysisMarket researchFairness evaluation

HRIS Analyst

Systems

System configuration, reporting, data quality maintenance

Medium
Typical Daily Work

System configuration, reporting, data quality maintenance

Why This Automation Potential
Technical configurationReport buildingData integrity

L&D Specialist

Development

Course assignment, completion tracking, training evaluation

Medium
Typical Daily Work

Course assignment, completion tracking, training evaluation

Why This Automation Potential
Compliance training trackingLearning pathsCertification management
Common Tools
DoceboCornerstone

HR Business Partner

Strategy

Org design, change management, business partnering

Low
Typical Daily Work

Org design, change management, business partnering

Why This Automation Potential
Strategic thinkingRelationship buildingComplex judgment
Part 2

Where LLM Agents Fit

Identifying high-confidence surfaces where AI agents can reliably execute specifiable, reviewable tasks—while respecting the boundary between deterministic operations and judgment-heavy decisions.

The "Human-Last" Boundary

HR contains both deterministic operations (payroll, eligibility, forms) and judgment-heavy work (performance calibration, interpersonal conflict, leadership decisions).

"Agents can reliably execute what humans can specify and verify, while humans retain judgment for ambiguous, high-stakes decisions."

High-Confidence Agent Surfaces (v1)Most Benchmarkable

Rules + Forms Execution

Compute fields, validate constraints, generate official artifacts with deterministic rules.

Eligibility & Compliance Checks

Deterministic pass/fail checks with auditable evidence based on regulatory rules.

Data Plumbing & Reconciliation

HRIS ↔ Payroll ↔ Benefits data consistency checks, diffing, and reporting.

Artifact Packaging

Produce standardized outputs with explanations and traceable intermediate steps.

Pain Points → Agent Opportunities

HR's defining trait is operational "busywork" with real consequences. These share the same properties: clear rules, repeated operations, high cost of mistakes—excellent for agents.

Hiring

Pain: Constant follow-ups; recruiter search limits constrain scale

Agent help: Automated screening, scheduling, follow-up tracking

Onboarding

Pain: Many required forms (I-9, W-4, direct deposit); missing any triggers compliance risk

Agent help: Form validation, deadline tracking, checklist automation

Payroll

Pain: Multi-state taxation edge cases; errors immediately damage trust

Agent help: Deterministic calculation, error detection, verification

Benefits

Pain: ACA/COBRA/HSA eligibility rules are strict; manual determination is error-prone

Agent help: Rule-based eligibility determination with evidence

Why Agent Evaluation Should Emphasize Evidence

The hard part is making end-to-end automation reliable under real-world interruptions (missing data, regulation changes, people not responding). This is exactly why agent evaluation should emphasize tool use, evidence, and verification.

Part 3

Example Tasks

Benchmarkable tasks that evaluate end-to-end execution on a computer: given raw inputs and constraints, the agent must use real tools/APIs to produce deliverables that a reviewer can verify.

Design Principles

Must Use Software

Tasks require tools/APIs/scripts; pure LLM reasoning is not acceptable.

Raw I/O Only

Define only inputs, outputs, and acceptance criteria; agent chooses the path.

Operational Scoring

Numeric matching or deterministic pass/fail; automated reviewer verification.

Scalable Data

Prioritize public rules + synthetic data generation for scalability.

Workflow Depth

Each workflow supports ~20–30 parametrized task instances.

Core Tasks (3)

PayrollCore

Payroll Calculation (Pay Stub Generation)

A Payroll Specialist computes pay for a pay period, generating accurate pay stubs with all required fields.

BenefitsCore

Benefits Eligibility Determination

HR determines whether an employee is eligible for a specific benefit. Many eligibility rules are strictly defined by law.

TaxCore

W-2 Generation (Per-Box Values)

At year end, generate W-2 forms where many boxes follow explicit IRS definitions.

Optional Tasks

Time OffOptional

PTO Balance Calculation

Calculate current PTO balances based on policy rules, accrual rates, caps, and usage history.

RecruitingOptional

Interview Schedule Validation

Validate a proposed interview schedule against calendars and constraints (not generation—validation only).

Data Availability for Benchmarking

Data TypeAvailabilityExamples
Job descriptions / resumes★★★★★Kaggle resume/JD datasets
Skills / occupation taxonomy★★★★★O*NET, ESCO
Wage statistics★★★★★BLS (API available)
Tax rules★★★★★IRS Publication 15-T + state authorities
Benefits regulations★★★★★IRS Q&A (67 cases), DOL guidance
HR software sandboxes★★★☆☆Gusto demo, Lever sandbox, BambooHR
Real company data★☆☆☆☆Requires privacy partnerships

Key insight: The bottleneck is private company data. For benchmarking, this is manageable—synthetic data + public regulations/forms can cover the deterministic core.

HR Software Ecosystem

The SMB tier is often most developer-friendly with accessible APIs and sandboxes for benchmarking.

Gusto
BambooHR
Workday
ADP
SAP SuccessFactors
Oracle HCM Cloud
Paylocity
Paycom
Rippling
Merge.dev
Finch
OrangeHRM
Frappe HR
Greenhouse
Lever
Ashby

Task Comparison Summary

TaskRequired SoftwareScoringScalePriority
Payroll calculationGusto APINet Pay ± $0.01~30 tasksCore
Benefits eligibilityBambooHR / FinchPass/Fail~30 tasksCore
W-2 generationTaxBandits / fire-1099Per-box match~30 tasksCore
PTO balanceBambooHR Time OffNumeric match~15 tasksOptional
Interview validationCronofy CalendarValid/Invalid + conflicts~15 tasksOptional

Review Agent Design (Scoring Layers)

L1Deterministic Checks

What to check:

  • • Output schema validation
  • • Numeric values match ground truth
  • • Evidence of required API/tool calls
L2Evidence Validation

What to check:

  • • Intermediate artifacts support final output
  • • API responses correctly parsed
  • • Computation steps traceable in trace

Contribute to Human Resources

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Our Commitments to Contributors

  • Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
  • Partner Review: Industry partners can review and approve task specifications before public release.
  • Data Control: Contributors can exclude sensitive or proprietary data from submissions.