User Tools

Site Tools


wiki:ai:clearinghouse_sow_analyzer
Ready for approval 2026/02/12 17:42 by mcarver | Approver: @ai-us-principals

SOW Analyzer – Technical User Guide & Reference Manual

1. System Overview

The SOW Analyzer is a Microsoft Teams-integrated AI validation system that evaluates uploaded Statements of Work (SOWs) against predefined “golden” template documents.

Architecture Components:

  • Microsoft Teams Bot (Bot Framework)
  • Azure Web App (FastAPI backend)
  • Azure Document Intelligence (prebuilt-layout model)
  • Azure OpenAI (GPT-based semantic comparison engine)

The system supports:

  • .pdf
  • .doc
  • .docx

Uploaded documents are processed in-memory and are not persisted.


2. High-Level Architecture Flow

Teams User
   ↓
Teams Bot (teams_bot.py)
   ↓
File Download + Validation
   ↓
POST /analyze-sow (app.py)
   ↓
Azure Document Intelligence (layout extraction)
   ↓
Azure OpenAI GPT comparison
   ↓
Structured Response to Teams

3. Teams Bot (teams_bot.py)

3.1 Message Handling

Entry point:

async def on_turn(turn_context: TurnContext)

Behavior:

  • Detect file attachment
  • Extract filename and download URL
  • Validate supported file extension
  • Download file bytes
  • Validate file signature
  • Send multipart POST request to backend analyzer

3.2 Supported File Types

Defined in:

SUPPORTED_FILE_TYPES = {
    ".pdf": "application/pdf",
    ".doc": "application/msword",
    ".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
}

Extension gating:

if not any(filename.lower().endswith(ext) for ext in SUPPORTED_FILE_TYPES.keys()):

3.3 File Signature Validation

Function:

def _validate_file_signature(filename: str, data: bytes)

Validation Rules:

  • PDF → Must begin with %PDF
  • DOCX → Must begin with PK (ZIP header)
  • DOC → Must begin with OLE header (D0 CF 11 E0 A1 B1 1A E1)

Purpose:

  • Prevent file type spoofing
  • Enforce integrity of uploaded documents
  • Mitigate malicious upload risk

3.4 Download Logic

Function:

async def _download_bytes(url: str, filename: str) -> bytes

Security controls:

  • HTTP status validation
  • Empty file detection
  • File size limit enforcement (MAX_UPLOAD_BYTES)
  • Signature validation

3.5 Analyzer POST Logic

Function:

async def _post_to_analyzer(filename: str, file_bytes: bytes, content_type: str)

Behavior:

  • Builds multipart/form-data request
  • Sends file to ANALYZE_SOW_URL
  • Forwards correct MIME type
  • Returns JSON response

4. Backend API (app.py)

4.1 Endpoint Definition

@app.post("/analyze-sow")
async def analyze_sow(file: UploadFile = File(...))

Validation:

  • Content-Type allowlist
  • Extension allowlist
  • Empty file detection

Allowed MIME types:

  • application/pdf
  • application/msword
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • application/octet-stream (fallback)

4.2 Content-Type Mapping Logic

if filename.endswith(".docx"):
    content_type = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
elif filename.endswith(".doc"):
    content_type = "application/msword"
else:
    content_type = "application/pdf"

Purpose:

  • Ensure correct format handling by Document Intelligence
  • Override unreliable client-provided MIME types

5. Document Extraction (layout_client.py)

5.1 Extraction Function

def extract_layout_text(file_bytes: bytes, content_type: str) -> str

Uses:

  • Azure Document Intelligence
  • prebuilt-layout model

Call Pattern:

poller = client.begin_analyze_document(
    "prebuilt-layout",
    document=file_bytes,
    content_type=content_type
)

Output:

  • Structured page-aware text
  • Preserves headings, paragraphs, and tables
  • Returns full reconstructed textual representation

6. GPT Comparison Engine

6.1 Function

compare_against_best_template(
    uploaded_sow_text,
    golden_templates
)

Purpose:

  • Compare uploaded SOW text against approved template corpus
  • Identify missing clauses
  • Detect semantic deviations
  • Provide structured gap analysis

Characteristics:

  • No fine-tuning
  • No persistent training
  • Inference-only AI reasoning
  • Deterministic template baseline

7. Golden Template Management

Golden templates are loaded at application startup and held in memory.

Design Principles:

  • Immutable baseline
  • Controlled governance
  • Centralized review logic
  • No dynamic template training

8. Security Architecture

8.1 No Persistent Storage

Uploaded files:

  • Not written to disk
  • Not stored in blob storage
  • Not logged
  • Not cached

Processing is memory-bound only.


8.2 Input Hardening

Controls implemented:

  • Extension allowlist
  • MIME type enforcement
  • Signature verification
  • File size limits
  • HTTP status validation
  • Timeout configuration

8.3 AI Isolation

Service separation:

  • Teams Bot → Presentation layer
  • FastAPI → Orchestration layer
  • Document Intelligence → Extraction layer
  • Azure OpenAI → Reasoning layer

Benefits:

  • Clear responsibility boundaries
  • Easier auditing
  • Reduced lateral movement risk
  • Improved troubleshooting clarity

9. Configuration Reference

Required Environment Variables:

  • ANALYZE_SOW_URL
  • Azure Document Intelligence endpoint + key
  • Azure OpenAI endpoint + key
  • Model deployment name

Timeouts:

  • Download: 60s
  • Analyzer call: 180s

File Size Limit:

  • Defined via MAX_UPLOAD_BYTES

10. Error Handling Behavior

Error Categories:

  • Unsupported file type → User-facing validation message
  • Empty file → 400 response
  • Layout extraction error → 500 with diagnostic message
  • GPT comparison error → 500
  • Analyzer HTTP failure → surfaced to Teams user

Errors are intentionally explicit to simplify debugging in production.


11. Extension Points

Potential enhancements:

  • Risk scoring model
  • Structured JSON scoring output
  • Audit log integration
  • Template versioning
  • Clause-level similarity scoring
  • Confidence scoring metrics

12. Operational Characteristics

Performance Factors:

  • Document size
  • Azure Document Intelligence latency
  • GPT token size of uploaded document
  • Template corpus size

Scalability:

  • Horizontally scalable via Azure Web App scaling
  • Stateless backend design
  • No session persistence required

13. Compliance Considerations

  • No long-term data retention
  • No model fine-tuning on customer data
  • Deterministic template baseline
  • Azure-hosted inference services
  • Explicit file validation controls

14. Summary

The SOW Analyzer is a stateless, AI-driven SOW validation engine embedded directly in Microsoft Teams. It combines structured document extraction and semantic reasoning while maintaining strong governance controls and zero persistent storage of uploaded contractual documents.

It is designed for:

  • Secure enterprise AI adoption
  • Operational contract standardization
  • Reduced review cycle time
  • Enforced template compliance
  • Scalable AI document validation
wiki/ai/clearinghouse_sow_analyzer.txt · Last modified: by mcarver