🏦
Under ReviewXinyang Han (UCB), Yiyou Sun (UCB)

Finance

Workflow-driven, evidence-rich domain with strict audit/compliance requirementsβ€”ideal for execution-focused AI agent benchmarks.

Contribute to Finance
Part 1

Finance Overview

Finance is one of the most promising domains for AI agent evaluation because the work is highly workflow-driven, produces abundant digital artifacts, and is constrained by strict correctness / auditability / compliance requirements.

Market Structure

Understanding the fundamental split between primary and secondary markets, and the buy-side/sell-side ecosystem.

Capital Flow

Primary Market

Companies raise capital directly from investors

IPOBond issuancePrivate placements
Examples: Deal structuring, Underwriting, Pricing

Secondary Market

Already-issued securities trade among investors

Liquidity provisionPrice discoveryTrading
Examples: NYSE, NASDAQ, HKEX, SSE
Ecosystem

Buy-Side

Invests on behalf of end investors to generate returns

Participants
Asset ManagersHedge FundsPrivate EquityPension FundsInsurers
Interaction Loop

Sell-Side

Creates/distributes products and services

Participants
Investment BanksBroker-DealersResearch ShopsMarket Makers

Career transition: A common path is moving from sell-side to buy-side after building expertise and relationships.

Organizational Structure & Key Roles

How finance institutions organize around front/middle/back office functions, and the representative roles where execution-heavy work occurs.

Office Structure
Revenue
Control
Operations

Front Office

Revenue generation

Investment BankersTradersResearch AnalystsInvestment Managers

High comp, high pressure, client-facing, long hours

Middle Office

Risk + compliance control

Risk ManagersCompliance OfficersFinance Controllers

Ensure front office stays within risk/compliance boundaries

Back Office

Operations support

SettlementITCustomer SupportHR

Ensure business runs reliably (books & records, reconciliation, reporting)

Key Roles

IBD Analyst

Investment Banking

Execution engine. Builds models (3-statement, DCF, LBO), drafts pitch decks, runs diligence, manages VDRs.

Execution engine. Builds models (3-statement, DCF, LBO), drafts pitch decks, runs diligence, manages VDRs.

Hours
70–90+ / week
Entry Comp
$100K–$200K
Career Progression
Analyst (2-3 yr)
Associate (3-4 yr)
VP (3-4 yr)
MD

Portfolio Manager

Asset Management

Final decision-maker; portfolio construction, security selection, risk, fundraising/IR.

Final decision-maker; portfolio construction, security selection, risk, fundraising/IR.

Hours
45–60 / week
Core Skills
Investment judgmentRisk managementClient relations

Quant Researcher

Quant Trading

Discovers signals/alphas; runs backtests; iterates hypotheses.

Discovers signals/alphas; runs backtests; iterates hypotheses.

Core Skills
StatisticsStochastic calculusPython/R/C++

Research Analyst

Sell-Side Research

Maintains coverage universe; builds earnings/valuation models; publishes reports with ratings/targets.

Maintains coverage universe; builds earnings/valuation models; publishes reports with ratings/targets.

Hours
~70 / week
Career Progression
Associate (2-3 yr)
Analyst
Part 2

Where LLM Agents Fit

Identifying high-confidence surfaces where AI agents can reliably execute specifiable, reviewable tasksβ€”while respecting the "human-last" boundary for high-stakes decisions.

The "Human-Last" Boundary

Finance has a hard constraint: errors are expensive, and many decisions are regulated and audit-sensitive. The stable pattern is:

Agents can reliably execute what humans can specify and verify, but humans retain judgment for ambiguous, high-stakes decisions.
High-Confidence Agent Surfaces (v1)

Research β†’ Evidence Packaging

Sell-side and buy-side research automation

High confidence
  • Pull data (prices, fundamentals, filings), build standardized tables
  • Maintain/update a valuation model (spreadsheet outputs, checks)
  • Draft structured summaries from predefined templates
  • Create plots, dashboards, and "what changed" diffs

Quant Research & Engineering

Strategy implementation and validation

High confidence
  • Implement a strategy specification in a backtest framework
  • Run backtests, produce standardized performance reports
  • Export trade logs and validate signal β†’ trade consistency

Trading Operations / Execution

Order management and execution reporting

High confidence
  • Convert a structured instruction into broker API requests
  • Submit orders, poll status, handle errors, output execution reports
  • Generate post-trade position snapshots

Compliance / Middle-Office

Rules + documentation automation

Medium confidence
  • Map transactions to rule checklists
  • Produce auditable check outputs and exception reports (with escalation)
What's Already Mature in AI for Finance

NLP & Text Analysis

Earnings call / filings NLP

Sentiment extractionForward-looking statement taggingRisk factor summarizationFinBERT models

Automated Reporting

Deck drafting, compliance reports

Pitch deck templatingESG draftsCompliance report generation

Quant + ML

Signal discovery and execution

Alpha signal generationExecution optimizationMarket microstructure

For AgentHLE, the key is not "does the model know finance", but:

Can the agent operate real tools end-to-end and produce verifiable artifacts?

Part 3

Example Tasks

Benchmarkable tasks that evaluate end-to-end execution on a computer: given raw inputs and constraints, the agent must use real tools to produce deliverables that a reviewer can verify.

Core Tasks (3)

Quant Research

Run a Quantitative Strategy in a Backtesting Framework

A quant researcher validates a strategy idea by running it through historical data.

Data Extraction

Retrieve and Parse SEC Filings via edgartools (XBRL Extraction)

An analyst extracts structured financial metrics from annual reports.

Trading Operations

Execute a Trade Instruction via Alpaca API (Paper Trading)

A trader executes a PM instruction in a trading system.

Alternative Tasks

Financial Modeling

Generate an Excel Financial Model via openpyxl

An IBD analyst builds a valuation model deliverable.

Portfolio Management

Portfolio Optimization via PyPortfolioOpt

A PM solves an allocation under constraints.

Recommended Tool Stack (All Free)

Data retrieval, backtesting, trade execution, modeling outputs, and optimizationβ€”all using open-source or free-tier tools.

SEC EDGAR
edgartools
yfinance
Backtrader
QuantConnect LEAN
Alpaca Trading API
openpyxl
xlsxwriter
PyPortfolioOpt
cvxpy

Task Comparison Summary

TaskMust-Use SoftwareIndustry Rep.ScoringScalePriority
Backtest executionBacktrader / LEANβ˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…Core
SEC parsingedgartoolsβ˜…β˜…β˜…β˜…β˜†β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…Core
Trade executionAlpaca APIβ˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…Core
Excel modelingopenpyxlβ˜…β˜…β˜…β˜…β˜†β˜…β˜…β˜…β˜…β˜†β˜…β˜…β˜…β˜…β˜…Optional
Portfolio optimizationPyPortfolioOptβ˜…β˜…β˜…β˜†β˜†β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…Optional

Contribute to Finance

We seek high-level, representative contributionsβ€”not exhaustive documentation. Share your expertise in any of these areas:

Our Commitments to Contributors

  • Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
  • Partner Review: Industry partners can review and approve task specifications before public release.
  • Data Control: Contributors can exclude sensitive or proprietary data from submissions.