User Tools

Site Tools


wiki:ai:ai-operational-plan

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

wiki:ai:ai-operational-plan [2025/05/28 17:35] – created ddehamerwiki:ai:ai-operational-plan [2025/05/28 17:37] (current) ddehamer
Line 1: Line 1:
 ====== Operational Plan: Managed AI Services Team ====== ====== Operational Plan: Managed AI Services Team ======
  
-====== 1. Team Structure & Roles ======+===== 1. Team Structure & Roles =====
  
 | Role | Responsibilities | | Role | Responsibilities |
Line 11: Line 11:
 | Customer Success Engineer | Handle service requests, documentation, usage monitoring | | Customer Success Engineer | Handle service requests, documentation, usage monitoring |
  
-====== 2. Scope of Services ======+===== 2. Scope of Services =====
  
 AI Platforms Only: AI Platforms Only:
Line 37: Line 37:
 - Issue response and remediation - Issue response and remediation
  
-====== 3. Core Operations ======+===== 3. Core Operations =====
  
-===== Provisioning & Deployment =====+==== Provisioning & Deployment ====
  
 · Use IaC tools (Terraform, Bicep, Deployment Manager) · Use IaC tools (Terraform, Bicep, Deployment Manager)
Line 47: Line 47:
 · Bootstrap scripts for API/endpoint setup · Bootstrap scripts for API/endpoint setup
  
-===== Automation & Shell Command Support =====+==== Automation & Shell Command Support ====
  
 · Secure shell (SSH) access with audit logging · Secure shell (SSH) access with audit logging
Line 55: Line 55:
 · CI/CD pipelines for model deployment · CI/CD pipelines for model deployment
  
-===== Monitoring & Observability =====+==== Monitoring & Observability ====
  
 · System monitoring: CPU, GPU, disk · System monitoring: CPU, GPU, disk
Line 65: Line 65:
 · Alerts via Slack/Teams/email with PagerDuty/Opsgenie · Alerts via Slack/Teams/email with PagerDuty/Opsgenie
  
-===== Python Programming Services =====+==== Python Programming Services ====
  
 · Support JupyterHub · Support JupyterHub
Line 73: Line 73:
 · Support SDKs: openai, boto3, google-cloud-aiplatform, transformers · Support SDKs: openai, boto3, google-cloud-aiplatform, transformers
  
-===== Issue Remediation Workflow =====+==== Issue Remediation Workflow ====
  
 · Detection – Alert received · Detection – Alert received
Line 85: Line 85:
 · Postmortem – RCA documentation · Postmortem – RCA documentation
  
-====== 4. Security and Access Control ======+===== 4. Security and Access Control =====
  
 · RBAC and IAM per platform with least privilege · RBAC and IAM per platform with least privilege
Line 95: Line 95:
 · Data encryption at rest and in transit · Data encryption at rest and in transit
  
-====== 5. Toolchain ======+===== 5. Toolchain =====
  
 IaC: Terraform, Bicep, Deployment Manager IaC: Terraform, Bicep, Deployment Manager
Line 109: Line 109:
 CI/CD: GitHub, GitLab, Azure DevOps CI/CD: GitHub, GitLab, Azure DevOps
  
-====== 6. SLA & Reporting ======+===== 6. SLA & Reporting =====
  
 | Metric | Target | | Metric | Target |
Line 118: Line 118:
 | Monthly Review | Cost, optimization, usage trends | | Monthly Review | Cost, optimization, usage trends |
  
-====== 7. Knowledge Management ======+===== 7. Knowledge Management =====
  
 · Maintain runbooks and playbooks · Maintain runbooks and playbooks
wiki/ai/ai-operational-plan.1748453750.txt.gz · Last modified: by ddehamer