Interactive Demo

Experience ARTEMIS

The Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems. Watch structured debates unfold with built-in safety monitoring.

N-Agent Debates

Watch agents reason together

Unlike frameworks limited to 2-3 agents, ARTEMIS supports N-agent debates with structured jury scoring. Click play to watch a loan approval debate unfold.

Debate Topic

Should we approve loan application #4829?

$125,000 business expansion loan for a 3-year-old restaurant with mixed financials

A
Advocate Argues FOR approval
Waiting to begin...
C
Critic Argues AGAINST approval
Waiting to begin...
R
Risk Analyst Evaluates risk factors
Waiting to begin...
P
Policy Agent Checks compliance
Waiting to begin...
Jury Scoring Panel Round 0/3
Approve
50%
Deny
50%
Verdict Deliberating...
Hierarchical Argument Generation

H-L-DAG: Structured reasoning

Arguments are organized across strategic, tactical, and operational levels. Click any node to see how reasoning flows from high-level goals to concrete actions.

Strategic
S1 Maximize portfolio return while managing risk exposure
Tactical
T1 Evaluate business viability
T2 Assess collateral adequacy
T3 Verify compliance requirements
Operational
O1 Check 3yr revenue trend
O2 Verify cash flow ratio
O3 Appraise property value
O4 Run KYC checks
O5 Verify fair lending
STRATEGIC Node S1

Goal: Maximize portfolio return while managing risk exposure

This strategic objective guides all downstream tactical and operational decisions. The lending decision must balance potential returns against risk factors.

Weight 1.0 (Primary)
Evaluation Causal Reasoning
Children 3 tactical nodes
Safety Infrastructure

Real-time safety monitoring

ARTEMIS continuously monitors for sandbagging, deception, behavioral drift, and ethical boundary violations. All checks run in real-time during debates.

Sandbagging Detection

CLEAR

Detects when agents deliberately hide capabilities or underperform to manipulate outcomes.

Capability Utilization
87%
Response Consistency
94%
14:32:01 All agents operating at expected capability levels

Deception Monitoring

CLEAR

Identifies misleading arguments, cherry-picked evidence, or attempts to deceive other agents.

Argument Validity
96%
Source Accuracy
100%
14:31:58 No deceptive patterns detected in Round 2

Behavioral Drift

MONITORING

Tracks unexpected changes in agent behavior patterns compared to baseline.

Advocate Drift
12%
Critic Drift
3%
14:31:45 Advocate showing 12% drift - within acceptable threshold

Ethical Boundaries

ENFORCED

Ensures agents operate within defined ethical constraints and don't violate policy boundaries.

Fair lending compliance
No discriminatory reasoning
Privacy boundaries respected
Regulatory guidelines followed
4/4 Safety Checks Passing
0 Violations Detected
1 Active Monitoring Alert
247ms Avg Check Latency
L-AE-CR

Adaptive evaluation with causal reasoning

Unlike static evaluation metrics, ARTEMIS dynamically adjusts criteria weights based on debate context. Watch how weights shift as the debate progresses.

Evaluation Criteria

Context: Loan Approval Debate
Evidence Quality 35%

Strength and relevance of supporting data

Logical Coherence 25%

Soundness of reasoning chain

Risk Assessment 20%

Thoroughness of risk consideration

Policy Alignment 15%

Adherence to lending policies

Argument Novelty 5%

Introduction of new perspectives

Weight Adaptation Log

T+0s Initial weights set based on debate type: Financial Decision
T+12s Risk Assessment weight +5% (risk factors identified)
T+28s Evidence Quality weight +10% (conflicting data presented)
Weights adjust based on causal analysis of debate dynamics, not just static rules
Consensus Protocols

Configurable agreement mechanisms

Choose how agents reach decisions: simple majority, weighted voting, unanimous consent, or custom protocols.

Weighted Majority Active

Votes weighted by agent expertise and confidence scores

Threshold: 60% Min Agents: 3
Unanimous Consent

All agents must agree for decision to pass

Threshold: 100% Veto: Any agent
Quorum Vote

Decision requires minimum participation threshold

Quorum: 75% Pass: 51%
Expert Override

Domain expert can override if confidence > threshold

Override: 95% conf Audit: Required

Current Vote Distribution

A Advocate
APPROVE
Weight: 1.2x Confidence: 87%
C Critic
DENY
Weight: 1.0x Confidence: 72%
R Risk Analyst
APPROVE
Weight: 1.5x Confidence: 91%
P Policy Agent
APPROVE
Weight: 1.3x Confidence: 94%
Weighted Result 68.4% APPROVE THRESHOLD MET

Liked what you saw?

Now run debates with your use cases

This demo shows a loan approval scenario. Imagine multi-agent reasoning applied to your specific domain challenges with your safety policies.

Configure agents for your domain
Define your safety thresholds and policies
Built on peer-reviewed, TD Commons published research

Or reach out to [email protected] to discuss your specific requirements