Pursue your passion and potential
Lead Site Reliability Engineer - AI Platforms
Bengaluru, India
Caring. Connecting. Growing together.
With these values to guide us, our people are committed to making a meaningful difference in the lives of those we are honored to serve.
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.
Primary Responsibilities:
- Collaborate with research, engineering, and product teams to translate cutting-edge AI advancements into production-ready capabilities. Uphold ethical AI principles by embedding fairness, transparency, and accountability throughout the model development lifecycle
- Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do so
Required Qualifications:
- 8+ years of experience in SRE, DevOps, or Platform Engineering with large-scale systems
- Hands-on experience with observability, monitoring, logging, tracing, alerting, and production operations
- Experience deploying and operating AI inference services, RAG pipelines, vector databases, and AI serving platforms
- Experience building and supporting CI/CD pipelines, deployment automation, and platform operational workflows
- Experience implementing auto-scaling, load balancing, disaster recovery, failover, backup, and business continuity solutions
- Experience supporting multi-region, multi-cluster, and distributed cloud environments
- Experience working with event-driven architectures, messaging systems, and real-time processing workloads
- Experience optimizing platform performance, resource utilization, AI inference workloads, and operational costs
- Experience mentoring junior engineers and contributing to engineering best practices
- Experience supporting production AI/ML, Generative AI, LLM, or data-intensive platforms
- Experience with Kubernetes, containerization, and cloud-native deployment practices
- Experience building and supporting CI/CD pipelines and deployment automation
- Experience deploying and supporting AI services, APIs, inference endpoints, and RAG-based solutions
- Experience with Infrastructure as Code (Terraform, CloudFormation, ARM, Pulumi, or equivalent)
- Experience with monitoring, logging, tracing, observability, and alerting platforms
- Experience implementing operational controls for backup, recovery, failover, and disaster recovery processes
- Experience with AWS, Azure, or GCP environments
- Experience supporting production incidents, troubleshooting, root cause analysis, and operational excellence initiatives
- Experience optimizing platform reliability, performance, resource utilization, and operational costs
- Proven experience in SRE, DevOps, Platform Engineering, Cloud Infrastructure, or Production Operations
- Proven experience supporting and operating production-scale AI/ML, Generative AI, and LLM-based platforms
- Solid experience implementing MLOps, LLMOps, model deployment, monitoring, and lifecycle management practices
- Solid experience with cloud-native technologies, Kubernetes, container orchestration, and Infrastructure as Code
- Knowledge of data security, governance, and compliance requirements for enterprise AI platforms
- Knowledge of cloud security, IAM, RBAC, encryption, secrets management, and security best practices
- Understanding of distributed systems, scalability, reliability, fault tolerance, and high-availability concepts
- Good understanding of distributed systems, high availability, scalability, fault tolerance, and reliability engineering principles
- Good understanding of security best practices including IAM, RBAC, encryption, secrets management, and Zero Trust principles
- Familiarity with MLOps, LLMOps, model deployment, monitoring, and AI application lifecycle management
- Familiarity with event-driven architectures, messaging systems, and streaming platforms
- Solid scripting and automation skills using Python, Bash, PowerShell, or equivalent technologies
- Solid scripting and automation skills using Python, Bash, PowerShell, or similar technologies
- Proven solid troubleshooting, incident management, root cause analysis (RCA), and production support experience
- Proven ability to independently own platform services and reliability initiatives from implementation through operations
- Proven solid collaboration and stakeholder management skills across AI/ML, Data Engineering, Security, and Platform teams
Technical Stack
- Cloud Platforms: AWS, Azure, GCP
- Containers & Orchestration: Docker, Kubernetes (AKS, EKS, GKE), Helm
- Infrastructure as Code: Terraform, CloudFormation, ARM, Pulumi
- CI/CD & GitOps: Jenkins, GitHub Actions, GitLab CI, ArgoCD
- MLOps / LLMOps: MLflow, Kubeflow, SageMaker, Azure ML, Vertex AI
- AI Platforms: LangChain, LangGraph, RAG Frameworks, AI Agents
- Model Serving: KServe, Triton, Seldon, Ray Serve, FastAPI
- API & Platform Gateway: Kong, NGINX, Envoy, API Gateway
- Service Mesh: Istio, Linkerd
- Observability: Prometheus, Grafana, ELK Stack, Datadog, OpenTelemetry
- Streaming & Messaging: Kafka, Event Hub, Pub/Sub
- Data & Storage: S3, ADLS, GCS, Databricks, Snowflake, BigQuery
- Security & Governance: IAM, RBAC, Vault, KMS, Encryption, Secrets Management
- Networking & Reliability: DNS, CDN, Load Balancers, Traffic Routing, Failover Systems
Preferred Qualifications:
- Experience with AI model serving platforms such as KServe, Triton, Seldon, or Ray Serve
- Experience with LangChain, LangGraph, RAG orchestration, and Agentic AI workflows
- Experience configuring API gateways, model gateways, and service mesh technologies
- Experience with Istio, Linkerd, or enterprise service mesh platforms
- Experience supporting multi-region and multi-cluster deployments
- Experience in Banking, Healthcare, Financial Services, or other regulated industries
- Knowledge of governance, compliance, and regulatory standards such as GDPR, HIPAA, SOC2, or ISO 27001
- Exposure to GPU-based AI infrastructure and inference workloads
- Exposure to FinOps, cloud cost optimization, and AI infrastructure cost management
- Exposure to Platform Engineering and Internal Developer Platforms (IDP)
At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.
#NIC
Benefits
Our mission of helping people live healthier lives extends to our team members. Learn more about our range of benefits designed to help you live well.
Life
Resources and support to focus on what matters most to you, in every facet of your life.
Emotional
Education, tools and resources to help you reduce and manage stress, build resilience and more.
Physical
Health plans and other coverage to support wellness for you and your loved ones.
Financial
Benefits for today and to help you plan for the future, including your retirement.
We’re honored to be recognized for our exceptional work culture
Connect with us


