Finance
Workflow-driven, evidence-rich domain with strict audit/compliance requirementsβideal for execution-focused AI agent benchmarks.
Finance Overview
Finance is one of the most promising domains for AI agent evaluation because the work is highly workflow-driven, produces abundant digital artifacts, and is constrained by strict correctness / auditability / compliance requirements.
Market Structure
Understanding the fundamental split between primary and secondary markets, and the buy-side/sell-side ecosystem.
Primary Market
Companies raise capital directly from investors
Secondary Market
Already-issued securities trade among investors
Buy-Side
Invests on behalf of end investors to generate returns
- Management fees (% of AUM)
- Performance fees
- Carried interest
Sell-Side
Creates/distributes products and services
- Advisory fees
- Underwriting fees
- Commissions & spreads
Buy-Side
Invests on behalf of end investors to generate returns
- Management fees (% of AUM)
- Performance fees
- Carried interest
Sell-Side
Creates/distributes products and services
- Advisory fees
- Underwriting fees
- Commissions & spreads
Career transition: A common path is moving from sell-side to buy-side after building expertise and relationships.
Organizational Structure & Key Roles
How finance institutions organize around front/middle/back office functions, and the representative roles where execution-heavy work occurs.
Front Office
Revenue generation
High comp, high pressure, client-facing, long hours
Middle Office
Risk + compliance control
Ensure front office stays within risk/compliance boundaries
Back Office
Operations support
Ensure business runs reliably (books & records, reconciliation, reporting)
IBD Analyst
Investment BankingExecution engine. Builds models (3-statement, DCF, LBO), drafts pitch decks, runs diligence, manages VDRs.
Execution engine. Builds models (3-statement, DCF, LBO), drafts pitch decks, runs diligence, manages VDRs.
Portfolio Manager
Asset ManagementFinal decision-maker; portfolio construction, security selection, risk, fundraising/IR.
Final decision-maker; portfolio construction, security selection, risk, fundraising/IR.
Quant Researcher
Quant TradingDiscovers signals/alphas; runs backtests; iterates hypotheses.
Discovers signals/alphas; runs backtests; iterates hypotheses.
Research Analyst
Sell-Side ResearchMaintains coverage universe; builds earnings/valuation models; publishes reports with ratings/targets.
Maintains coverage universe; builds earnings/valuation models; publishes reports with ratings/targets.
Where LLM Agents Fit
Identifying high-confidence surfaces where AI agents can reliably execute specifiable, reviewable tasksβwhile respecting the "human-last" boundary for high-stakes decisions.
The "Human-Last" Boundary
Finance has a hard constraint: errors are expensive, and many decisions are regulated and audit-sensitive. The stable pattern is:
Agents can reliably execute what humans can specify and verify, but humans retain judgment for ambiguous, high-stakes decisions.
Research β Evidence Packaging
Sell-side and buy-side research automation
- Pull data (prices, fundamentals, filings), build standardized tables
- Maintain/update a valuation model (spreadsheet outputs, checks)
- Draft structured summaries from predefined templates
- Create plots, dashboards, and "what changed" diffs
Quant Research & Engineering
Strategy implementation and validation
- Implement a strategy specification in a backtest framework
- Run backtests, produce standardized performance reports
- Export trade logs and validate signal β trade consistency
Trading Operations / Execution
Order management and execution reporting
- Convert a structured instruction into broker API requests
- Submit orders, poll status, handle errors, output execution reports
- Generate post-trade position snapshots
Compliance / Middle-Office
Rules + documentation automation
- Map transactions to rule checklists
- Produce auditable check outputs and exception reports (with escalation)
NLP & Text Analysis
Earnings call / filings NLP
Automated Reporting
Deck drafting, compliance reports
Quant + ML
Signal discovery and execution
For AgentHLE, the key is not "does the model know finance", but:
Can the agent operate real tools end-to-end and produce verifiable artifacts?
Example Tasks
Benchmarkable tasks that evaluate end-to-end execution on a computer: given raw inputs and constraints, the agent must use real tools to produce deliverables that a reviewer can verify.
Core Tasks (3)
Run a Quantitative Strategy in a Backtesting Framework
A quant researcher validates a strategy idea by running it through historical data.
Retrieve and Parse SEC Filings via edgartools (XBRL Extraction)
An analyst extracts structured financial metrics from annual reports.
Execute a Trade Instruction via Alpaca API (Paper Trading)
A trader executes a PM instruction in a trading system.
Alternative Tasks
Generate an Excel Financial Model via openpyxl
An IBD analyst builds a valuation model deliverable.
Portfolio Optimization via PyPortfolioOpt
A PM solves an allocation under constraints.
Recommended Tool Stack (All Free)
Data retrieval, backtesting, trade execution, modeling outputs, and optimizationβall using open-source or free-tier tools.
Task Comparison Summary
| Task | Must-Use Software | Industry Rep. | Scoring | Scale | Priority |
|---|---|---|---|---|---|
| Backtest execution | Backtrader / LEAN | β β β β β | β β β β β | β β β β β | Core |
| SEC parsing | edgartools | β β β β β | β β β β β | β β β β β | Core |
| Trade execution | Alpaca API | β β β β β | β β β β β | β β β β β | Core |
| Excel modeling | openpyxl | β β β β β | β β β β β | β β β β β | Optional |
| Portfolio optimization | PyPortfolioOpt | β β β ββ | β β β β β | β β β β β | Optional |
Contribute to Finance
We seek high-level, representative contributionsβnot exhaustive documentation. Share your expertise in any of these areas:
Submit Landscape Understanding
Help us map sectors, roles, tasks, and tools in finance. Share your perspective on the industry structure.
Submit a Workflow
Describe a specific professional task with tools, inputs, outputs, and how success is verified.
Our Commitments to Contributors
- Evaluation Only: All contributions are used exclusively for agent evaluation, never for model training.
- Partner Review: Industry partners can review and approve task specifications before public release.
- Data Control: Contributors can exclude sensitive or proprietary data from submissions.