This is an old revision of the document!
| Role | Responsibilities |
| AI Platform Engineer | Deploy, monitor, and manage LLMs on cloud and on-prem infrastructure |
| DevOps Engineer | Automation scripts, CI/CD for model updates, shell-level ops, infrastructure |
| Python Developer | Code AI workflows, model wrappers, APIs, data pipelines |
| Cloud Ops Specialist | Platform-specific expertise (Azure AI, SageMaker, Vertex AI) |
| Monitoring & Incident Lead | Alerting systems, root cause analysis, remediation workflows |
| Customer Success Engineer | Handle service requests, documentation, usage monitoring |
AI Platforms Only:
- Azure: Azure OpenAI, Azure ML, Cognitive Services
- AWS: SageMaker, Bedrock, AI Services (Comprehend, Rekognition, etc.)
- Google: Vertex AI, PaLM, Generative AI Studio
On-Prem LLMs:
- Models like LLaMA, Mistral, GPT-J hosted on Linux VMs with GPU
Support Domains:
- Shell access, CLI tools (AWS CLI, gcloud, az)
- Python scripting and notebooks
- Automation (Bash, Terraform, Python)
- Monitoring (Prometheus, Grafana, Azure Monitor, CloudWatch, Stackdriver)
- Issue response and remediation
· Use IaC tools (Terraform, Bicep, Deployment Manager)
· Maintain VM templates with CUDA, PyTorch, HF Transformers
· Bootstrap scripts for API/endpoint setup
· Secure shell (SSH) access with audit logging
· Model lifecycle scripts (start, stop, etc.)
· CI/CD pipelines for model deployment
· System monitoring: CPU, GPU, disk
· Model monitoring: latency, error rate
· Platform monitoring: Azure Monitor, CloudWatch, Stackdriver
· Alerts via Slack/Teams/email with PagerDuty/Opsgenie
· Support JupyterHub
· Maintain Python utility libraries (logging, retry, chaining)
· Support SDKs: openai, boto3, google-cloud-aiplatform, transformers
· Detection – Alert received
· Classification – Severity assessment
· Investigation – Logs, shell, diagnostics
· Remediation – Patch/redeploy
· Postmortem – RCA documentation
· RBAC and IAM per platform with least privilege
· Bastion/JIT SSH for VM access
· Audit logs on shell, API, model usage
· Data encryption at rest and in transit
IaC: Terraform, Bicep, Deployment Manager
Monitoring: Prometheus, Grafana, ELK, Azure Monitor, CloudWatch, Stackdriver
Automation: GitHub Actions, Azure DevOps, Lambda, Cloud Functions
LLM Frameworks: HuggingFace, LangChain, OpenAI SDKs
Issue Mgmt: Jira, Confluence, PagerDuty/Opsgenie
CI/CD: GitHub, GitLab, Azure DevOps
| Metric | Target |
| Uptime per LLM Endpoint | ≥ 99.5% |
| Incident Response Time | P1: 15 min, P2: 1 hour |
| Model Deployment SLA | ≤ 4 hours from request |
| Weekly Reports | Usage, performance, incidents |
| Monthly Review | Cost, optimization, usage trends |
· Maintain runbooks and playbooks
· Central wiki/documentation
· Training tracks for LLM ops and automation