Pursue your passion and potential

Senior Site Reliability Engineer - AI Platform

Bengaluru, India

Caring. Connecting. Growing together.

With these values to guide us, our people are committed to making a meaningful difference in the lives of those we are honored to serve.

Integrity Compassion Inclusion Relationships Innovation Performance

Senior Site Reliability Engineer - AI Platform

Requisition number: 2369780 Job category: Technology Primary location: Bengaluru, Karnataka Date posted: 06/24/2026 Overtime status: Exempt Travel: No

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.

We're looking for a hands-on Senior SRE - AI Platforms to build, operate, and continuously improve reliable, secure, and scalable AI platforms that support Generative AI, Large Language Models (LLMs), Agentic AI, and enterprise AI applications.

This role focuses on Site Reliability Engineering (SRE), Cloud Engineering, Platform Operations, DevOps, MLOps/LLMOps, and Production Engineering for AI workloads. The ideal candidate will help ensure AI platforms meet enterprise standards for availability, performance, security, observability, and operational excellence while partnering closely with AI/ML, Data Engineering, and Platform teams to operationalize AI solutions at scale.

This position is primarily focused on hands-on implementation, platform operations, automation, and reliability engineering, while contributing to architecture and platform standards established by senior technical leadership.

Primary Responsibilities:

AI Platform Operations & Reliability
- Operate and support highly available, scalable, and secure AI platforms supporting Generative AI, LLM, RAG, and Agentic AI workloads
- Implement Site Reliability Engineering (SRE) practices to improve platform availability, reliability, performance, and operational efficiency
- Support deployment and operation of AI services, model serving platforms, inference endpoints, AI gateways, and vector databases
- Monitor platform health, identify reliability risks, and implement remediation strategies
- Assist in defining and measuring platform SLIs, SLOs, error budgets, and operational KPIs
MLOps, LLMOps & Agentic Ops
- Support MLOps, LLMOps, and Agentic Ops capabilities including:
  - Model deployment
  - Inference operations
  - Model lifecycle management
  - Monitoring and observability
  - Governance controls
- Deploy and support LLM, RAG, AI Agent, and model-serving workloads across cloud and hybrid environments
- Assist with AI platform onboarding, release management, and production readiness activities
- Contribute to AI operational standards, automation frameworks, and deployment best practices
Cloud Infrastructure & Platform Engineering
- Implement and support cloud-native AI infrastructure across Azure, AWS, and/or GCP.
- Deploy and manage Kubernetes-based environments supporting AI workloads
- Implement Infrastructure-as-Code (IaC) and environment automation using enterprise tooling
- Support containerized applications, AI services, and distributed platform components
- Assist with platform scalability, load balancing, auto-scaling, and resource optimization initiatives
Observability & Operational Excellence
- Build and maintain observability solutions covering:
  - Monitoring
  - Logging
  - Distributed tracing
  - Alerting
  - Dashboards
- Establish monitoring for AI workloads including:
  - Availability
  - Latency
  - Throughput
  - Resource consumption
  - Cost utilization
  - Service health
- Participate in incident response, troubleshooting, root cause analysis (RCA), and postmortem activities
- Drive continuous improvements to reduce operational toil and improve platform reliability
Security, Compliance & Governance
- Implement platform security controls including:
  - IAM
  - RBAC
  - Secrets management
  - Encryption
  - Secure networking
- Support governance, compliance, auditability, and security requirements across AI platforms
- Ensure AI systems align with enterprise security and operational standards
Collaboration & Continuous Improvement
- Partner with AI Engineers, Data Scientists, Data Engineers, Platform Engineers, and Cloud teams to support production AI systems
- Contribute to platform engineering standards, reusable automation, and operational best practices
- Support adoption of emerging cloud-native, AI platform, observability, and reliability technologies
- Share operational knowledge and mentor junior engineers where applicable
Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do so

Required Qualifications:

Bachelor's degree in computer science, Engineering, Information Technology, or related field
8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Cloud Engineering, Infrastructure Engineering, or Production Operations
Experience supporting production cloud-native platforms and distributed systems
Experience operating AI/ML, Generative AI, LLM, RAG, or model-serving platforms in production environments
Hands-on experience with Kubernetes, containers, and cloud-native infrastructure
Experience implementing Infrastructure-as-Code (Terraform, ARM/Bicep, CloudFormation, or equivalent).
Experience with CI/CD pipelines, deployment automation, and DevOps tooling
Experience with observability platforms supporting monitoring, logging, tracing, and alerting
Scripting and automation experience using Python, Bash, PowerShell, or similar technologies
Knowledge of IAM, RBAC, encryption, secrets management, and cloud security practices
Familiarity with MLOps, LLMOps, and AI operational practices
Solid understanding of SRE principles including:
- Reliability engineering
- Incident management
- Capacity planning
- Monitoring and observability
- Performance optimization
Understanding of:
- High-availability architectures
- Auto-scaling
- Load balancing
- Disaster recovery
- Performance tuning
Proven solid analytical, troubleshooting, communication, and collaboration skills

Preferred Qualifications:

Experience supporting enterprise AI platforms powering GenAI, LLM, RAG, and Agentic AI applications
Experience with AI platform technologies including MLflow, Kubeflow, Azure ML, SageMaker, Vertex AI, model serving platforms, and AI gateways
Experience operating Kubernetes platforms in enterprise-scale cloud environments
Experience implementing observability solutions using Prometheus, Grafana, Datadog, Splunk, OpenTelemetry, ELK, or similar technologies
Experience with distributed systems, event-driven architectures, Kafka, Spark, and real-time processing platforms
Experience implementing FinOps practices, resource optimization, and cloud cost management
Experience contributing to reliability engineering initiatives, operational improvements, and platform modernization programs
Experience working in healthcare, banking, insurance, financial services, or other highly regulated industries
Experience mentoring junior engineers and contributing to engineering best practices
Understanding of security and compliance frameworks in regulated environments
Familiarity with GitOps, platform engineering, self-service infrastructure, and automation frameworks

Technical Stack

Cloud Platforms: AWS, Azure, GCP
Containers & Orchestration: Docker, Kubernetes (AKS, EKS, GKE), Helm
Infrastructure as Code: Terraform, CloudFormation, ARM, Pulumi
CI/CD & GitOps: Jenkins, GitHub Actions, GitLab CI, ArgoCD
MLOps / LLMOps: MLflow, Kubeflow, SageMaker, Azure ML, Vertex AI
Agentic AI & AI Orchestration: LangChain, LangGraph, AI Agents, RAG Orchestration
Model Serving: KServe, Triton Inference Server, Seldon, Ray Serve, FastAPI
API & Platform Gateway: Kong, NGINX, Envoy, API Gateway
Service Mesh: Istio, Linkerd
Observability: Prometheus, Grafana, ELK Stack, Datadog, OpenTelemetry
Streaming & Messaging: Kafka, Event Hub, Pub/Sub
Data & Storage: S3, ADLS, GCS, Databricks, Snowflake, BigQuery
Security & Governance: IAM, RBAC, Vault, KMS, Encryption, Secrets Management
Networking & Reliability: DNS, CDN, Load Balancers, Traffic Routing, Failover Systems

At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.

Apply Internal apply

Benefits

Our mission of helping people live healthier lives extends to our team members. Learn more about our range of benefits designed to help you live well.

Life

Resources and support to focus on what matters most to you, in every facet of your life.

Emotional

Education, tools and resources to help you reduce and manage stress, build resilience and more.

Physical

Health plans and other coverage to support wellness for you and your loved ones.

Financial

Benefits for today and to help you plan for the future, including your retirement.

Learn more

Since joining Optum, my professional growth has been significant. The dynamic environment has enhanced my problem-solving abilities, and the company’s commitment to innovation and continuous learning motivates me to stay. Optum provides continuous training, mentorship, and a clear path for advancement, all while supporting a healthy work-life balance.