LLM Evaluation & Governance Workflow

Workflow for evaluating, approving, and monitoring LLM applications — offline test suites, risk tiering, production sampling, and governance board process.

AI ArchitecturesIntermediateWorkflow Template

Architecture Diagram

AWS reference layout with grouped regions, numbered flows, and official service icons.

LLM Evaluation & Governance on AWSPre-deploy eval gates and post-deploy monitoring
LLM Application LifecycleEvaluation & Safety Pipeline123pass/fail45678sample trafficUse case canvasDesignPrompt / appAmazon BedrockBuildEval suiteAmazon BedrockTestApproval gateAWS IAMProduction deployAWS LambdaMonitor + re-evalAmazon CloudWatchGolden Q&A setAmazon S3LLM-as-judgeAmazon BedrockSafety testsAmazon MacieAudit trailAmazon S3

All critical eval tests must pass before deploy · Continuous re-eval on prompt/model changes · Full S3 audit trail

Code preview

67 lines

Replace {{PLACEHOLDERS}} with your environment values, then deploy to your stack.

# LLM Evaluation & Governance Workflow

> AI Architecture · {{ORGANIZATION_NAME}}

## Purpose

End-to-end workflow for evaluating, approving, and monitoring LLM applications in production.

## Governance Workflow

```
┌─────────┐   design   ┌─────────┐   eval    ┌─────────┐  approve  ┌─────────┐
│ Use Case│ ─────────▶ │  Build  │ ────────▶ │  Test   │ ────────▶ │ Deploy  │
│  Canvas │            │  Prompt │           │  Suite  │           │  Prod   │
└─────────┘            └─────────┘           └─────────┘           └────┬────┘
                                                                          │
                     ┌────────────────────────────────────────────────────┘
                     ▼
              ┌─────────────┐
              │  Monitor +  │
              │  Re-eval    │
              └─────────────┘
```

## Evaluation Pipeline

### Offline evaluation (pre-deploy)
1. Curate golden dataset: {{NUM_TEST_CASES}} Q&A pairs per use case
2. Run prompts against candidate model {{MODEL_A}} vs {{MODEL_B}}
3. Score with: exact match, LLM-as-judge, human rubric sample
4. Safety tests: jailbreak attempts, PII leakage, toxic output
5. Gate: all critical tests pass before staging deploy

### Online evaluation (post-deploy)
1. Sample {{SAMPLE_RATE}}% production traffic
2. Log inputs/outputs (redacted) to evaluation store
3. User feedback + implicit signals (task completion rate)
4. Weekly drift report to {{AI_GOVERNANCE_BOARD}}

## Risk Tiering

| Tier | Criteria | Approval |
|------|----------|----------|
| T1 Low | Internal docs, no PII | Team lead |
| T2 Medium | Customer-facing, no financial impact | Director + security |
| T3 High | Regulated/financial decisions | Executive + legal |

## Required Artifacts

- [ ] Model card (training data, limitations, bias notes)
- [ ] Prompt version registry
- [ ] Rollback procedure
- [ ] Incident response for hallucination/PII events

## Metrics Dashboard

- Hallucination rate (human labeled sample)
- Refusal rate on policy violations
- Cost per 1K tokens
- Latency P50/P95

## {{ORGANIZATION_NAME}} Contacts

- AI Governance: {{AI_GOVERNANCE_EMAIL}}
- Security review: {{SECURITY_EMAIL}}
- On-call: {{AI_ONCALL}}

How to use this architecture

  • Use in architecture review meetings or RFC documents
  • Map each component to your cloud accounts, teams, and tools
  • Replace {{PLACEHOLDERS}} with environment-specific values
  • Extend workflow steps with your org's SLAs and governance gates
llmgovernanceevaluationsafety
Downloads33
UpdatedJul 2, 2026
Login to share feedback