Pursue your passion and potential
Sr. Site Reliability Engineer - Azure or GCP, Terraform, Python
Hyderabad, India
Caring. Connecting. Growing together.
With these values to guide us, our people are committed to making a meaningful difference in the lives of those we are honored to serve.
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.
The Site Reliability Engineer (SRE) is a hybrid software-and-systems engineer who ensures that cloud-based systems are highly reliable, scalable, and efficient. Acting as a bridge between development, platform engineering, security, and IT operations, the SRE brings engineering rigor to operations. In this role, the SRE will focus on Google Cloud Platform (GCP) and Microsoft Azure environments, applying best practices from DevOps and DevSecOps. Key goals include automating infrastructure management, improving deployment workflows (GitOps/CI/CD), and proactively addressing operational issues (from incidents to vulnerabilities and secrets management). The result is an enterprise-grade practice that drives up reliability and security while driving down outages and manual toil.
Infrastructure as Code and GitOps are core to this role: all infrastructure changes are managed through code and Git, enabling consistent, auditable, and automated deployments. By leveraging CI/CD pipelines and even AI-Ops tooling, the SRE minimizes manual work and human error, enforcing the desired state of systems and quick rollbacks when needed.
This SRE role embeds security into operations. The engineer will continuously run vulnerability scans, rotate secrets and certificates, and ensure compliance with security policies by design. By "shifting left" on security - integrating checks early in code and build stages - the SRE helps catch and prevent issues before they reach production.
The SRE's responsibilities span three key areas:
- Cloud Infrastructure & Reliability Engineering
- Git Workflows & CI/CD Pipeline Management
- Operations & Security (DevSecOps)
Primary Responsibilities:
- Cloud Infrastructure & Reliability Engineering
- Design, Provisioning & Automation: Architect and manage cloud infrastructure on GCP and Azure to meet reliability and performance goals. This includes using Infrastructure as Code tools (e.g. Terraform, ARM templates) to provision resources in a repeatable manner. The SRE designs systems for high availability (e.g. multi-zone/regional deployments) and disaster recovery, anticipating failures and planning failover strategies in advance. Automation is key - from auto-scaling configurations to scripted environment setups - to eliminate manual configuration drift and enable rapid, consistent deployments
- Reliability Management (SLIs/SLOs & Performance): Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services (e.g. target uptime, response latency). The SRE continuously monitors these metrics and implements improvements to meet or exceed targets. For example, they might set an availability SLO of 99.9% and ensure architectures (load balancing, clustering, backup) support that goal. They also establish error budgets (tolerated downtime) to balance velocity and stability. The SRE conducts capacity planning and performance testing (load tests, stress tests) to validate that systems can scale and to find bottlenecks before they impact users. When performance issues are identified, SRE works with engineering to optimize code or scale resources proactively
- Monitoring & Incident Response: Implement robust monitoring and observability for cloud services. This involves setting up dashboards and telemetry using tools like Google Cloud Operations Suite , Azure Monitor, Prometheus/Grafana, and aggregated logging systems (e.g. Splunk). Alerting rules are configured on key indicators (error rates, latency, resource usage) to ensure early detection of anomalies. The SRE also establishes an on-call rotation and incident response process, being part of 24×7 support for critical systems. During incidents, the SRE will lead troubleshooting (often by examining logs, traces, metrics) and coordinate rapid recovery of services. They perform detailed root cause analysis (RCA) afterward and ensure post-mortems drive fixes to prevent recurrence. An important aspect is continuously tuning alerts to reduce noise and improve signal, and automating responses where possible (for example, auto-restart of failed services, or one-click rollback deployments)
- Resilience & Best Practices: Champion cloud reliability best practices across teams. This includes implementing fault-tolerant architectures (e.g. design for graceful degradation when dependencies fail), enforcing tagging and resource management policies, and validating that new architectures follow the company's reliability standards. The SRE may run chaos engineering exercises (simulated failures) to verify that systems behave as expected under stress. They also work with enterprise architecture and platform teams to review designs for reliability and compliance with well-architected frameworks in both Azure and GCP
- Git Workflows & CI/CD Pipeline Management
- Modernizing Version Control Workflows: Serve as the custodian of the source code repositories and development workflows. The SRE will evaluate and implement a modern Git workflow that suits the organization (for example, migrating from a legacy model to trunk-based development or a refined GitFlow strategy). The goal is to improve developer productivity and code integration frequency. This involves defining branching strategies, merge policies, and commit hygiene rules that reduce conflicts and ensure high-quality code merges. The SRE sets up guardrails in the Git process, such as pull request templates, mandatory code reviews, and status checks (e.g. automated tests must pass before merge). They will also manage repository structures (e.g. deciding on monorepo vs. multi-repo for various components) and access permissions, ensuring teams have appropriate, secure access to the codebases
- Continuous Integration/Continuous Deployment (CI/CD): Own and improve the CI/CD pipelines that take code from commit to production. The SRE will work with tools like GitHub Actions, Jenkins, Azure DevOps Pipelines, or Google Cloud Build to automate builds, testing, and deployments. Key responsibilities include:
- maintaining build scripts and configuration
- optimizing pipeline speed and reliability, and
- integrating quality gates (unit/integration tests, static code analysis, etc.) into the pipeline
- DevSecOps: the SRE ensures security scans (SAST/DAST, dependency vulnerability scans) and compliance checks are automated in the pipeline so that insecure code is caught early. They also introduce GitOps practices where appropriate - for example, using Git as the single source of truth for infrastructure definitions and employing tools that watch Git for changes to auto-apply infrastructure updates. This means Ops changes (like updating a docker/Kubernetes config or a Terraform module) go through the same Git PR review process as application code, providing traceability and rollback capabilities
- Repository Management & Collaboration: Beyond automation, the SRE acts as a coach and facilitator for engineering teams in using version control effectively. They create documentation and training on Git best practices (writing good commit messages, rebasing vs. merging, etc.) and help resolve complex merge conflicts or repository issues. If new tools or platforms (e.g. migrating to a cloud-based source repo or introducing a code analysis tool) are needed, the SRE evaluates and drives their adoption. By improving developer workflows and tooling, the SRE frees engineers to focus on features, with confidence that the path to production is smooth and fast
- Operations & Security (DevSecOps)
- Incident & Problem Management: As part of operations, the SRE leads the effort to handle incidents and continually reduce their impact. This includes participating in an on-call rotation to ensure 24×7 coverage for critical systems. When alerts trigger, the SRE follows defined playbooks to mitigate user impact, such as rolling back a bad deployment or failing over to backup services. They communicate status to stakeholders and coordinate technical teams during major incidents (P1/P2). After incidents, the SRE drives problem management: holding blameless post-incident reviews, documenting the root cause, and following up on action items (e.g. fix bugs, add monitors, adjust capacity). Over time, this yields systemic improvements - for example, trends analysis might reveal a need for more resilience in a particular service, which the SRE will help implement. Key metrics like Mean Time to Detect (MTTD) and Mean Time to Repair (MTTR) are tracked to measure incident response performance, with a mandate to continuously lower these values
- Vulnerability Management & Patching: The SRE takes ownership of the vulnerability remediation process for both infrastructure and application components. They utilize enterprise scanning tools to regularly scan cloud resources, container images, and code dependencies for known vulnerabilities. When critical /high / medium / low vulnerabilities (CVEs) are reported (e.g. in VMs, containers, or libraries), the SRE works with the appropriate teams to prioritize and apply patches or updates in a timely manner. They maintain a view of the "security health" of the services, often via a dashboard of vulnerability scores or open security findings, and drive this number down. Part of this role is establishing automated workflows for vulnerability management - for instance, setting up CI jobs that fail a build if it introduces a high-severity vulnerability, or scripts that clean up unused resources (like old container images) that pose security risk. The SRE coordinates with Security Operations (SecOps) teams to ensure compliance with enterprise security policies and to get ahead of emerging threats (proactively applying hardening measures or library upgrades)
- Secrets, Access & Configuration Management: Given the focus on cloud and DevSecOps, the SRE is responsible for managing secrets and sensitive configurations in a secure manner. They implement and oversee the use of secret management tools such as Azure Key Vault, Google Secret Manager, or HashiCorp Vault to store API keys, credentials, certificates, and tokens. A strict secret rotation schedule is enforced - for example, ensuring that keys and certificates are rotated and not left expired or exposed (an internal practice might be rotating certain Azure service keys every 90 days). The SRE automates renewal of certificates and rollover of credentials to minimize service disruption. In addition, the SRE helps govern access control in cloud (IAM roles, resource policies) following least-privilege principles, and may assist in managing account permissions like ensuring service accounts and user roles are properly set up for new services
- Operational Tooling & Support: Assist in various operational tasks that ensure smooth running of the platform. This can range from supporting deployments (being on hand during production releases to ensure proper execution and post-deploy monitoring) to maintaining runbooks and knowledge bases for common procedures and troubleshooting steps. The SRE often writes or enhances automation scripts (in Python, Bash, PowerShell, etc.) to handle repetitive tasks like log rotation, data backups, or environment resets - reducing the "toil" for everyone. They also co-own platform operations such as cloud account governance. For example, the SRE might be a co-administrator of the Azure subscription or GCP project, responsible for cloud cost monitoring, tagging compliance, and working with central cloud teams on any account-level changes. In summary, the SRE's operations duty is about keeping the lights on in a secure and efficient manner, while continually pushing for improvements that make the ecosystem more self-healing and resilient
- Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do so
Required Qualifications:
A successful SRE in this role will have a blend of solid technical expertise, operational experience, and collaborative skills
- Bachelor's degree in computer science or a related field, or equivalent practical experience. Certifications can be an advantage - for example, Google Cloud Professional DevOps Engineer, AWS/Azure DevOps Engineer Expert, or similar cloud certifications that demonstrate competence in site reliability and cloud operations
- Work Experience
- For Grade 27, 5+ years of overall experience and 3+ years of relevant SRE experience are needed
- Excellent python and terraform (or equivalent) coding experience (High complexity level)
- Excellent communication skills with an ability to present ideas/suggestions and updates to managers and leaders
- Cloud Expertise: Deep understanding of public cloud platforms - particularly GCP and Azure. This includes knowledge of core services (compute, storage, networking, databases) and how to architect distributed systems on these platforms. Experience designing highly available and scalable solutions in a cloud environment is required. Familiarity with multi-cloud management is a plus (knowing the idiosyncrasies of both Azure and GCP)
- Infrastructure as Code & Automation: Hands-on skills with Infrastructure as Code tools such as Terraform (or Azure Resource Manager templates, Cloud Deployment Manager, etc.) to automate provisioning of cloud resources. Comfort with configuration management tools (e.g. Ansible, Chef) and scripting/programming (Python, PowerShell, Bash) to automate operational tasks is expected. Automation-first mindset is key - candidates should demonstrate how they've eliminated manual processes in prior roles
- CI/CD and DevOps Toolchain: Experience building and managing CI/CD pipelines - for example, using Azure DevOps, Jenkins, GitHub Actions, or Google Cloud Build - is essential. The candidate should understand how to design a pipeline for fast feedback and safe deployments (automated testing, artifact management, deployment strategies). Knowledge of containerization and orchestration (Docker, Kubernetes) and how they fit into delivery pipelines is a solid plus, as many cloud workloads use containers
- Observability & Troubleshooting: Solid abilities with monitoring, logging, and tracing tools. This includes setting up and querying telemetry in systems like Azure Monitor/Application Insights, GCP Operations (Stackdriver), Prometheus & Grafana, Datadog, or Splunk. The SRE should be adept at analyzing logs and traces to debug complex issues across distributed systems - for example, tracing an outage to a specific microservice instance or a database choke point. Experience with defining meaningful metrics and alerts (and avoiding alert fatigue) is important
- Security & DevSecOps: Good grasp of security fundamentals in cloud and software delivery. This includes identity and access management (Azure AD, GCP IAM), network security (firewalls, VPC/VNet configurations), and data protection. Experience with vulnerability scanning tools (for container images, code dependencies, etc.) and remediating findings is required. Familiarity with secrets management tools and practices for secret rotation and encryption is expected. An ideal candidate has worked in environments with compliance requirements and can ensure DevOps pipelines support audit needs (e.g. logging changes, approvals for production deploys)
- Incident Response & Resilience: Proven experience in managing high-severity production incidents, including the ability to remain calm under pressure and follow a structured process (engage relevant teams, document findings, etc.). Knowledge of incident management frameworks (like ITIL or SRE best practices around blameless post-mortems) is beneficial. The SRE should be skilled in conducting RCAs and implementing lasting fixes. Additionally, understanding of resilience testing (Chaos engineering, game days) and disaster recovery planning is a plus, indicating the candidate can proactively improve system robustness
- Collaboration & Communication: Proven excellent communication skills are a must. The SRE will frequently coordinate between developers, operations, security, and product teams, so the ability to articulate issues and solutions clearly (in writing and verbally) is critical. They should be able to translate technical findings into business impact for leadership updates during incidents. A collaborative mindset is important - the SRE should be seen as a partner to the dev and platform teams, not just an enforcer. Mentoring developers on reliability and sharing knowledge is part of the job.
At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes , an enterprise priority reflected in our mission.
#NIC
Benefits
Our mission of helping people live healthier lives extends to our team members. Learn more about our range of benefits designed to help you live well.
Life
Resources and support to focus on what matters most to you, in every facet of your life.
Emotional
Education, tools and resources to help you reduce and manage stress, build resilience and more.
Physical
Health plans and other coverage to support wellness for you and your loved ones.
Financial
Benefits for today and to help you plan for the future, including your retirement.
We’re honored to be recognized for our exceptional work culture
Connect with us


