Pursue your passion and potential

Senior Principal Infrastructure & Operations Eng - Remote or Hybrid

Minnetonka, Minnesota

Caring. Connecting. Growing together.

With these values to guide us, our people are committed to making a meaningful difference in the lives of those we are honored to serve.

Integrity Compassion Inclusion Relationships Innovation Performance

Senior Principal Infrastructure & Operations Eng - Remote or Hybrid

Requisition number: 2370606 Job category: Technology Primary location: Minnetonka, MN Date posted: 07/02/2026 Overtime status: Exempt Travel: No

Optum Tech is a global leader in health care innovation. Our teams develop cutting-edge solutions that help people live healthier lives and help make the health system work better for everyone. From advanced data analytics and AI to cybersecurity, we use innovative approaches to solve some of health care's most complex challenges. Your contributions here have the potential to change lives. Ready to build the next breakthrough? Join us to start Caring. Connecting. Growing together.

This Senior Principal role is accountable for advancing enterprise reliability across a complex, high-scale application portfolio by setting the technical direction, operating model, and leadership approach needed to improve stability, resilience, and operational performance.

You'll enjoy the flexibility to work remotely * from anywhere within the U.S. as you take on some tough challenges. For all hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office a minimum of four days per week.

Primary Responsibilities:

Enterprise Reliability Leadership

Establish and execute a comprehensive reliability strategy across a portfolio of 510+ applications supporting critical business operations
Define and govern enterprise reliability standards, Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, resiliency requirements, and operational maturity models
Create a reliability operating model that spans modern cloud-native platforms, legacy systems, mainframe workloads, third-party hosted solutions, and SaaS applications
Serve as the executive leader accountable for enterprise application reliability, stability, recoverability, and operational risk reduction

AI-First SRE Transformation

Design and implement an AI-first approach to reliability engineering leveraging generative AI, AIOps, predictive analytics, autonomous remediation, intelligent alert management, and operational copilots
Identify opportunities to eliminate manual operational work through automation and machine-driven decision support
Establish AI-powered workflows for:
Incident detection and triage
Root cause analysis
Event correlation
Capacity forecasting
Reliability risk identification
Automated remediation
Knowledge management
Operational reporting
Deliver measurable reductions in Mean Time to Detect (MTTD), Mean Time to Resolve (MTTR), operational toil, and incident volume

Portfolio Reliability Management

Develop a portfolio-wide reliability framework capable of managing highly heterogeneous technology stacks including:
- Mainframe platforms
- Middleware and integration technologies
- Distributed applications
- Containerized workloads
- Public cloud platforms
- Vendor-hosted applications
- SaaS ecosystems
Establish application criticality tiers and reliability targets across the portfolio
Implement standardized observability and operational telemetry strategies regardless of technology platform

Team Leadership

Build, lead, and mentor a high-performing team of direct reports and contractors
Create an elite SRE organization of five or fewer highly skilled engineers capable of delivering enterprise-scale outcomes through leverage, automation, and platform capabilities
Recruit and develop engineers with solid expertise in software engineering, automation, observability, AI, and systems reliability
Foster a culture of ownership, innovation, operational excellence, and continuous improvement

Vendor and Partner Management

Drive reliability accountability across a complex ecosystem of vendors, managed service providers, and third-party technology partners
Establish operational performance expectations, reliability metrics, service level agreements, and governance mechanisms with external partners
Ensure vendors contribute actionable telemetry, operational transparency, and incident management discipline
Lead escalations and executive-level discussions related to service disruptions and reliability concerns

Observability and Platform Engineering

Define and implement enterprise observability standards across metrics, logs, traces, events, synthetic monitoring, and user experience monitoring
Drive platform engineering initiatives that simplify operational support and reduce application-specific operational burden
Establish self-service reliability capabilities for application teams

Incident and Resilience Management

Lead major incident management, post-incident review processes, and enterprise resilience initiatives
Drive systemic problem elimination through engineering-led root cause analysis and preventive action programs
Develop disaster recovery, business continuity, and resiliency testing strategies.
Ensure reliability practices are embedded throughout the software development lifecycle

You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.

Required Qualifications:

Undergraduate degree in applicable area of expertise or equivalent experience
10+ years of experience in technology operations, software engineering, infrastructure engineering, platform engineering, or Site Reliability Engineering
5+ years leading enterprise-scale SRE, reliability engineering, or production engineering organizations
Demonstrated experience owning reliability outcomes for portfolios exceeding 200+ applications, preferably 500+
Proven success building and leading high-performing engineering teams
Experience managing direct reports, contractors, managed services providers, and vendor relationships
Deep understanding of modern SRE principles including:
- SLOs and SLIs
- Error budgets
- Reliability engineering
- Incident management
- Resilience engineering
- Capacity management
- Observability
Experience supporting diverse technology ecosystems spanning legacy platforms, mainframe, distributed systems, and cloud environments
Proven solid executive communication and stakeholder management skills

Preferred Qualifications:

Demonstrated implementation of AI-driven operations, AIOps, or autonomous operations capabilities at enterprise scale
Experience leveraging Generative AI, LLMs, operational copilots, agentic workflows, or predictive analytics to improve operational outcomes
Experience leading large-scale operational transformations with measurable business results.
Demonstrated background in software engineering or platform engineering
Experience with cloud platforms such as AWS, Azure, or GCP
Demonstrated familiarity with observability platforms such as Datadog, Dynatrace, New Relic, Splunk, Grafana, Open Telemetry, or similar technologies
Experience in highly regulated industries such as healthcare, financial services, insurance, or government

*All employees working remotely will be required to adhere to UnitedHealth Group's Telecommuter Policy.

Pay is based on several factors including but not limited to local labor markets, education, work experience, certifications, etc. In addition to your salary, we offer benefits such as, a comprehensive benefits package, incentive and recognition programs, equity stock purchase and 401k contribution (all benefits are subject to eligibility requirements). No matter where or when you begin a career with us, you'll find a far-reaching choice of benefits and incentives. The salary for this role will range from $134,600 to $230,800 annually based on full-time employment. We comply with all minimum wage laws as applicable.

Application Deadline: This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected. Job posting may come down early due to volume of applicants.

At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.

UnitedHealth Group is an Equal Employment Opportunity employer under applicable law and qualified applicants will receive consideration for employment without regard to race, national origin, religion, age, color, sex, sexual orientation, gender identity, disability, or protected veteran status, or any other characteristic protected by local, state, or federal laws, rules, or regulations.

UnitedHealth Group is a drug - free workplace. Candidates are required to pass a drug test before beginning employment.

Apply Internal apply

Benefits

Our mission of helping people live healthier lives extends to our team members. Learn more about our range of benefits designed to help you live well.