💼

Under ReviewXinyang Han (UCB), Yiyou Sun (UCB)

Human Resources

High-frequency, repetitive, rule-driven work managing the full employee lifecycle—a strong fit for agent automation and execution-focused benchmarks.

Contribute to Human Resources

Part 1

HR Overview

HR is often compared to a company's "circulatory system": it does not directly generate revenue, but if it breaks (can't hire, payroll errors, compliance incidents), the entire organization suffers. This makes HR work unusually high-frequency, repetitive, and rule-driven.

The Employee Lifecycle

HR manages the full employee journey through six core functional blocks. The persistent "source of truth" is the HRIS (Human Resource Information System).

Employee Lifecycle

Hiring

Find and attract the right talent

Onboard

Paperwork & provisioning

Payroll

Pay people correctly

Benefits

Health, retirement, open enrollment

Performance

Assess and calibrate

Offboard

Exit process & compliance

Hiring / Recruiting

Find and attract the right talent

Onboarding

Paperwork & provisioning

Payroll & Tax

Pay people correctly

Benefits Administration

Health, retirement, open enrollment

Performance Management

Assess and calibrate

Offboarding

Exit process & compliance

HRIS: The Source of Truth

The Human Resource Information System stores employee master data and feeds all other modules. Data flows from Hiring → Onboarding → HRIS → Payroll/Benefits/Performance.

WorkdaySAP SuccessFactorsOracle HCMBambooHR

Key Roles & Automation Potential

Different HR roles have varying levels of automation potential based on how deterministic and rule-driven their daily work is.

Best Starting Points for Agents' Last Exam

The most deterministic, rules-driven work clusters in Payroll Specialist, Benefits Specialist, and HR Coordinator—making them natural starting points for benchmarks.

Automation Potential:

Very High (~90%)

High

Medium

Low

Payroll Specialist

Payroll

Collect time data, compute payroll, verify deductions, file tax forms

Very High (~90%)

Typical Daily Work

Collect time data, compute payroll, verify deductions, file tax forms

Why This Automation Potential

Highly deterministic rulesNumeric precision requiredRegulatory compliance

Common Tools

GustoADPPaychex

HR Coordinator

Onboarding / Admin

Collect documents, run onboarding checklists, maintain employee records

Very High (~90%)

Typical Daily Work

Collect documents, run onboarding checklists, maintain employee records

Why This Automation Potential

Form processingDeadline-drivenCompliance requirements

Common Tools

BambooHRWorkday

Benefits Specialist

Benefits

Manage plans, determine eligibility, handle open enrollment

High

Typical Daily Work

Manage plans, determine eligibility, handle open enrollment

Why This Automation Potential

Rule-driven eligibilityRegulatory knowledge (ACA, COBRA)Plan administration

Common Tools

BenefitfocusEmployee Navigator

Recruiter

Hiring

Source candidates, screen resumes, coordinate interviews, follow-ups

High

Typical Daily Work

Source candidates, screen resumes, coordinate interviews, follow-ups

Why This Automation Potential

High volume screeningCommunication heavyExternal data sources

Common Tools

GreenhouseLeverLinkedIn Recruiter

HR Generalist

Cross-functional

Employee support, compliance, enrollments, policy questions

Medium

Typical Daily Work

Employee support, compliance, enrollments, policy questions

Why This Automation Potential

Varied tasksEmployee-facingPolicy interpretation

Compensation Analyst

Payroll / Comp

Salary benchmarking, compensation bands, equity analysis

Medium

Typical Daily Work

Salary benchmarking, compensation bands, equity analysis

Why This Automation Potential

Data analysisMarket researchFairness evaluation

HRIS Analyst

Systems

System configuration, reporting, data quality maintenance

Medium

Typical Daily Work

System configuration, reporting, data quality maintenance

Why This Automation Potential

Technical configurationReport buildingData integrity

L&D Specialist

Development

Course assignment, completion tracking, training evaluation

Medium

Typical Daily Work

Course assignment, completion tracking, training evaluation

Why This Automation Potential

Compliance training trackingLearning pathsCertification management

Common Tools

DoceboCornerstone

HR Business Partner

Strategy

Org design, change management, business partnering

Low

Typical Daily Work

Org design, change management, business partnering

Why This Automation Potential

Strategic thinkingRelationship buildingComplex judgment

Part 2

Where LLM Agents Fit

Identifying high-confidence surfaces where AI agents can reliably execute specifiable, reviewable tasks—while respecting the boundary between deterministic operations and judgment-heavy decisions.

The "Human-Last" Boundary

HR contains both deterministic operations (payroll, eligibility, forms) and judgment-heavy work (performance calibration, interpersonal conflict, leadership decisions).

"Agents can reliably execute what humans can specify and verify, while humans retain judgment for ambiguous, high-stakes decisions."

High-Confidence Agent Surfaces (v1)Most Benchmarkable

Rules + Forms Execution

Compute fields, validate constraints, generate official artifacts with deterministic rules.

Examples

Payroll calculation from time data

W-4 form field validation

Tax withholding computation

Benefits enrollment form processing

Eligibility & Compliance Checks

Deterministic pass/fail checks with auditable evidence based on regulatory rules.

Examples

ACA full-time status determination

COBRA eligibility verification

HSA contribution limits check

401k vesting calculation

Data Plumbing & Reconciliation

HRIS ↔ Payroll ↔ Benefits data consistency checks, diffing, and reporting.

Examples

Employee data sync verification

Payroll vs timesheet reconciliation

Benefits enrollment audit

Year-end data validation

Artifact Packaging

Produce standardized outputs with explanations and traceable intermediate steps.

Examples

W-2 generation with audit trail

Pay stub creation

Onboarding document packages

Compliance reports

Pain Points → Agent Opportunities

HR's defining trait is operational "busywork" with real consequences. These share the same properties: clear rules, repeated operations, high cost of mistakes—excellent for agents.

Hiring

Pain: Constant follow-ups; recruiter search limits constrain scale

Agent help: Automated screening, scheduling, follow-up tracking

Onboarding

Pain: Many required forms (I-9, W-4, direct deposit); missing any triggers compliance risk

Agent help: Form validation, deadline tracking, checklist automation

Payroll

Pain: Multi-state taxation edge cases; errors immediately damage trust

Agent help: Deterministic calculation, error detection, verification

Benefits

Pain: ACA/COBRA/HSA eligibility rules are strict; manual determination is error-prone

Agent help: Rule-based eligibility determination with evidence

Why Agent Evaluation Should Emphasize Evidence

The hard part is making end-to-end automation reliable under real-world interruptions (missing data, regulation changes, people not responding). This is exactly why agent evaluation should emphasize tool use, evidence, and verification.

Part 3

Example Tasks

Benchmarkable tasks that evaluate end-to-end execution on a computer: given raw inputs and constraints, the agent must use real tools/APIs to produce deliverables that a reviewer can verify.

Design Principles

Must Use Software

Tasks require tools/APIs/scripts; pure LLM reasoning is not acceptable.

Raw I/O Only

Define only inputs, outputs, and acceptance criteria; agent chooses the path.

Operational Scoring

Numeric matching or deterministic pass/fail; automated reviewer verification.

Scalable Data

Prioritize public rules + synthetic data generation for scalability.

Workflow Depth

Each workflow supports ~20–30 parametrized task instances.

Core Tasks (3)

PayrollCore

Payroll Calculation (Pay Stub Generation)

A Payroll Specialist computes pay for a pay period, generating accurate pay stubs with all required fields.

BenefitsCore

Benefits Eligibility Determination

HR determines whether an employee is eligible for a specific benefit. Many eligibility rules are strictly defined by law.

TaxCore

W-2 Generation (Per-Box Values)

At year end, generate W-2 forms where many boxes follow explicit IRS definitions.

Optional Tasks

Time OffOptional

PTO Balance Calculation

Calculate current PTO balances based on policy rules, accrual rates, caps, and usage history.

RecruitingOptional

Interview Schedule Validation

Validate a proposed interview schedule against calendars and constraints (not generation—validation only).

Data Availability for Benchmarking

Data Type	Availability	Examples
Job descriptions / resumes	★★★★★	Kaggle resume/JD datasets
Skills / occupation taxonomy	★★★★★	O*NET, ESCO
Wage statistics	★★★★★	BLS (API available)
Tax rules	★★★★★	IRS Publication 15-T + state authorities
Benefits regulations	★★★★★	IRS Q&A (67 cases), DOL guidance
HR software sandboxes	★★★☆☆	Gusto demo, Lever sandbox, BambooHR
Real company data	★☆☆☆☆	Requires privacy partnerships

Key insight: The bottleneck is private company data. For benchmarking, this is manageable—synthetic data + public regulations/forms can cover the deterministic core.

HR Software Ecosystem

The SMB tier is often most developer-friendly with accessible APIs and sandboxes for benchmarking.

Gusto

BambooHR

Workday

ADP

SAP SuccessFactors

Oracle HCM Cloud

Paylocity

Paycom

Rippling

Merge.dev

Finch

OrangeHRM

Frappe HR

Greenhouse

Lever

Ashby

Task Comparison Summary

Task	Required Software	Scoring	Scale	Priority
Payroll calculation	Gusto API	Net Pay ± $0.01	~30 tasks	Core
Benefits eligibility	BambooHR / Finch	Pass/Fail	~30 tasks	Core
W-2 generation	TaxBandits / fire-1099	Per-box match	~30 tasks	Core
PTO balance	BambooHR Time Off	Numeric match	~15 tasks	Optional
Interview validation	Cronofy Calendar	Valid/Invalid + conflicts	~15 tasks	Optional

Review Agent Design (Scoring Layers)

L1Deterministic Checks

What to check:

• Output schema validation
• Numeric values match ground truth
• Evidence of required API/tool calls

L2Evidence Validation

What to check:

• Intermediate artifacts support final output
• API responses correctly parsed
• Computation steps traceable in trace

Contribute to Human Resources

We seek high-level, representative contributions—not exhaustive documentation. Share your expertise in any of these areas:

Submit Landscape Understanding

Help us map sectors, roles, tasks, and tools in human resources. Share your perspective on the industry structure.

Submit a Workflow

Describe a specific professional task with tools, inputs, outputs, and how success is verified.

Our Commitments to Contributors

Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
Partner Review: Industry partners can review and approve task specifications before public release.
Data Control: Contributors can exclude sensitive or proprietary data from submissions.