LLM Evaluation & Governance Workflow
Workflow for evaluating, approving, and monitoring LLM applications — offline test suites, risk tiering, production sampling, and governance board process.
AI ArchitecturesIntermediateWorkflow Template
Architecture Diagram
AWS reference layout with grouped regions, numbered flows, and official service icons.
LLM Evaluation & Governance on AWSPre-deploy eval gates and post-deploy monitoring
All critical eval tests must pass before deploy · Continuous re-eval on prompt/model changes · Full S3 audit trail
Code preview
67 linesReplace {{PLACEHOLDERS}} with your environment values, then deploy to your stack.
# LLM Evaluation & Governance Workflow
> AI Architecture · {{ORGANIZATION_NAME}}
## Purpose
End-to-end workflow for evaluating, approving, and monitoring LLM applications in production.
## Governance Workflow
```
┌─────────┐ design ┌─────────┐ eval ┌─────────┐ approve ┌─────────┐
│ Use Case│ ─────────▶ │ Build │ ────────▶ │ Test │ ────────▶ │ Deploy │
│ Canvas │ │ Prompt │ │ Suite │ │ Prod │
└─────────┘ └─────────┘ └─────────┘ └────┬────┘
│
┌────────────────────────────────────────────────────┘
▼
┌─────────────┐
│ Monitor + │
│ Re-eval │
└─────────────┘
```
## Evaluation Pipeline
### Offline evaluation (pre-deploy)
1. Curate golden dataset: {{NUM_TEST_CASES}} Q&A pairs per use case
2. Run prompts against candidate model {{MODEL_A}} vs {{MODEL_B}}
3. Score with: exact match, LLM-as-judge, human rubric sample
4. Safety tests: jailbreak attempts, PII leakage, toxic output
5. Gate: all critical tests pass before staging deploy
### Online evaluation (post-deploy)
1. Sample {{SAMPLE_RATE}}% production traffic
2. Log inputs/outputs (redacted) to evaluation store
3. User feedback + implicit signals (task completion rate)
4. Weekly drift report to {{AI_GOVERNANCE_BOARD}}
## Risk Tiering
| Tier | Criteria | Approval |
|------|----------|----------|
| T1 Low | Internal docs, no PII | Team lead |
| T2 Medium | Customer-facing, no financial impact | Director + security |
| T3 High | Regulated/financial decisions | Executive + legal |
## Required Artifacts
- [ ] Model card (training data, limitations, bias notes)
- [ ] Prompt version registry
- [ ] Rollback procedure
- [ ] Incident response for hallucination/PII events
## Metrics Dashboard
- Hallucination rate (human labeled sample)
- Refusal rate on policy violations
- Cost per 1K tokens
- Latency P50/P95
## {{ORGANIZATION_NAME}} Contacts
- AI Governance: {{AI_GOVERNANCE_EMAIL}}
- Security review: {{SECURITY_EMAIL}}
- On-call: {{AI_ONCALL}}
How to use this architecture
- Use in architecture review meetings or RFC documents
- Map each component to your cloud accounts, teams, and tools
- Replace {{PLACEHOLDERS}} with environment-specific values
- Extend workflow steps with your org's SLAs and governance gates
llmgovernanceevaluationsafety
Downloads33
UpdatedJul 2, 2026