Skip to main content
Search JobsOpen search form

Explore remote jobs

Pursue your passion and potential

Senior Site Reliability Engineer - AI Platform

Bengaluru, India

Caring. Connecting. Growing together.

With these values to guide us, our people are committed to making a meaningful difference in the lives of those we are honored to serve.

Senior Site Reliability Engineer - AI Platform

Requisition number: 2369780 Job category: Technology Primary location: Bengaluru, Karnataka Date posted: 06/24/2026 Overtime status: Exempt Travel: No

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.

We're looking for a hands-on Senior SRE - AI Platforms to build, operate, and continuously improve reliable, secure, and scalable AI platforms that support Generative AI, Large Language Models (LLMs), Agentic AI, and enterprise AI applications.

This role focuses on Site Reliability Engineering (SRE), Cloud Engineering, Platform Operations, DevOps, MLOps/LLMOps, and Production Engineering for AI workloads. The ideal candidate will help ensure AI platforms meet enterprise standards for availability, performance, security, observability, and operational excellence while partnering closely with AI/ML, Data Engineering, and Platform teams to operationalize AI solutions at scale.

This position is primarily focused on hands-on implementation, platform operations, automation, and reliability engineering, while contributing to architecture and platform standards established by senior technical leadership.

Primary Responsibilities:

  • AI Platform Operations & Reliability
    • Operate and support highly available, scalable, and secure AI platforms supporting Generative AI, LLM, RAG, and Agentic AI workloads
    • Implement Site Reliability Engineering (SRE) practices to improve platform availability, reliability, performance, and operational efficiency
    • Support deployment and operation of AI services, model serving platforms, inference endpoints, AI gateways, and vector databases
    • Monitor platform health, identify reliability risks, and implement remediation strategies
    • Assist in defining and measuring platform SLIs, SLOs, error budgets, and operational KPIs
  • MLOps, LLMOps & Agentic Ops
    • Support MLOps, LLMOps, and Agentic Ops capabilities including:
      • Model deployment
      • Inference operations
      • Model lifecycle management
      • Monitoring and observability
      • Governance controls
    • Deploy and support LLM, RAG, AI Agent, and model-serving workloads across cloud and hybrid environments
    • Assist with AI platform onboarding, release management, and production readiness activities
    • Contribute to AI operational standards, automation frameworks, and deployment best practices
  • Cloud Infrastructure & Platform Engineering
    • Implement and support cloud-native AI infrastructure across Azure, AWS, and/or GCP.
    • Deploy and manage Kubernetes-based environments supporting AI workloads
    • Implement Infrastructure-as-Code (IaC) and environment automation using enterprise tooling
    • Support containerized applications, AI services, and distributed platform components
    • Assist with platform scalability, load balancing, auto-scaling, and resource optimization initiatives
  • Observability & Operational Excellence
    • Build and maintain observability solutions covering:
      • Monitoring
      • Logging
      • Distributed tracing
      • Alerting
      • Dashboards
    • Establish monitoring for AI workloads including:
      • Availability
      • Latency
      • Throughput
      • Resource consumption
      • Cost utilization
      • Service health
    • Participate in incident response, troubleshooting, root cause analysis (RCA), and postmortem activities
    • Drive continuous improvements to reduce operational toil and improve platform reliability
  • Security, Compliance & Governance
    • Implement platform security controls including:
      • IAM
      • RBAC
      • Secrets management
      • Encryption
      • Secure networking
    • Support governance, compliance, auditability, and security requirements across AI platforms
    • Ensure AI systems align with enterprise security and operational standards
  • Collaboration & Continuous Improvement
    • Partner with AI Engineers, Data Scientists, Data Engineers, Platform Engineers, and Cloud teams to support production AI systems
    • Contribute to platform engineering standards, reusable automation, and operational best practices
    • Support adoption of emerging cloud-native, AI platform, observability, and reliability technologies
    • Share operational knowledge and mentor junior engineers where applicable
  • Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do so

Required Qualifications:

  • Bachelor's degree in computer science, Engineering, Information Technology, or related field
  • 8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Cloud Engineering, Infrastructure Engineering, or Production Operations
  • Experience supporting production cloud-native platforms and distributed systems
  • Experience operating AI/ML, Generative AI, LLM, RAG, or model-serving platforms in production environments

  • Hands-on experience with Kubernetes, containers, and cloud-native infrastructure
  • Experience implementing Infrastructure-as-Code (Terraform, ARM/Bicep, CloudFormation, or equivalent).
  • Experience with CI/CD pipelines, deployment automation, and DevOps tooling
  • Experience with observability platforms supporting monitoring, logging, tracing, and alerting
  • Scripting and automation experience using Python, Bash, PowerShell, or similar technologies
  • Knowledge of IAM, RBAC, encryption, secrets management, and cloud security practices
  • Familiarity with MLOps, LLMOps, and AI operational practices
  • Solid understanding of SRE principles including:
    • Reliability engineering
    • Incident management
    • Capacity planning
    • Monitoring and observability
    • Performance optimization
  • Understanding of:
    • High-availability architectures
    • Auto-scaling
    • Load balancing
    • Disaster recovery
    • Performance tuning
  • Proven solid analytical, troubleshooting, communication, and collaboration skills

Preferred Qualifications:

  • Experience supporting enterprise AI platforms powering GenAI, LLM, RAG, and Agentic AI applications
  • Experience with AI platform technologies including MLflow, Kubeflow, Azure ML, SageMaker, Vertex AI, model serving platforms, and AI gateways
  • Experience operating Kubernetes platforms in enterprise-scale cloud environments
  • Experience implementing observability solutions using Prometheus, Grafana, Datadog, Splunk, OpenTelemetry, ELK, or similar technologies
  • Experience with distributed systems, event-driven architectures, Kafka, Spark, and real-time processing platforms
  • Experience implementing FinOps practices, resource optimization, and cloud cost management
  • Experience contributing to reliability engineering initiatives, operational improvements, and platform modernization programs
  • Experience working in healthcare, banking, insurance, financial services, or other highly regulated industries
  • Experience mentoring junior engineers and contributing to engineering best practices
  • Understanding of security and compliance frameworks in regulated environments
  • Familiarity with GitOps, platform engineering, self-service infrastructure, and automation frameworks

Technical Stack

  • Cloud Platforms: AWS, Azure, GCP
  • Containers & Orchestration: Docker, Kubernetes (AKS, EKS, GKE), Helm
  • Infrastructure as Code: Terraform, CloudFormation, ARM, Pulumi
  • CI/CD & GitOps: Jenkins, GitHub Actions, GitLab CI, ArgoCD
  • MLOps / LLMOps: MLflow, Kubeflow, SageMaker, Azure ML, Vertex AI
  • Agentic AI & AI Orchestration: LangChain, LangGraph, AI Agents, RAG Orchestration
  • Model Serving: KServe, Triton Inference Server, Seldon, Ray Serve, FastAPI
  • API & Platform Gateway: Kong, NGINX, Envoy, API Gateway
  • Service Mesh: Istio, Linkerd
  • Observability: Prometheus, Grafana, ELK Stack, Datadog, OpenTelemetry
  • Streaming & Messaging: Kafka, Event Hub, Pub/Sub
  • Data & Storage: S3, ADLS, GCS, Databricks, Snowflake, BigQuery
  • Security & Governance: IAM, RBAC, Vault, KMS, Encryption, Secrets Management
  • Networking & Reliability: DNS, CDN, Load Balancers, Traffic Routing, Failover Systems

At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.

Benefits

Our mission of helping people live healthier lives extends to our team members. Learn more about our range of benefits designed to help you live well.

Life

Resources and support to focus on what matters most to you, in every facet of your life.

Emotional

Education, tools and resources to help you reduce and manage stress, build resilience and more.

Physical

Health plans and other coverage to support wellness for you and your loved ones.

Financial

Benefits for today and to help you plan for the future, including your retirement.

Learn more
testimonial-img-1

Since joining Optum, my professional growth has been significant. The dynamic environment has enhanced my problem-solving abilities, and the company’s commitment to innovation and continuous learning motivates me to stay. Optum provides continuous training, mentorship, and a clear path for advancement, all while supporting a healthy work-life balance.

Anurag J.

Senior Software Engineering Manager

We’re honored to be recognized for our exceptional work culture

AGWF recognition award
2025 Campus Forward Award badge from RippleMatch
LinkedIn Top Companies 2025 award badge
Forbes Best Large Employers in the United States 2024 award badge
America’s Greatest Workplaces 2024 award badge