Observability Site Reliability Engineer (SRE)

Cognizant Technology Solutions - Phoenix, AZ

posted 16 days ago

Full-time - Mid Level

Phoenix, AZ

10,001+ employees

Professional, Scientific, and Technical Services

About the position

The Observability Site Reliability Engineer (SRE) is responsible for ensuring the reliability and scalability of services within the organization. This role focuses on improving Mean Time To Detect (MTTD) and Mean Time To Recover (MTTR) through the implementation of fullstack observability and automation of nonfunctional engineering via robust CI/CD pipelines.

Responsibilities

Develop and maintain SMART monitoring solutions to enable quicker problem detection and isolation.
Strategize and implement deployment models like Canary or BlueGreen to minimize downtime during deployments.
Utilize increased automation, reusable assets, and selfhealing techniques to improve system reliability.
Build resiliency across application and infrastructure layers through Chaos Engineering.
Embed performance and scalability into application design and code from the initial stages.

Requirements

Proven experience in SRE or similar roles with a focus on observability.
Strong understanding of CI/CD pipelines and automation tools.
Experience with deployment models such as Canary or BlueGreen.
Knowledge of Chaos Engineering and its application in building resilient systems.
Ability to work collaboratively in a fastpaced environment.
Bachelor's degree in Computer Science, Engineering, or related field.
Minimum of 3 years in a Site Reliability Engineering role or similar.
Proficiency in monitoring tools and technologies.
Strong analytical and problemsolving skills.
Excellent communication and teamwork abilities.

Nice-to-haves

Cloud technologies: Support resources operating in GCP, Azure
Prior experience using a Commercial Observability/APM solution (Dynatrace, New Relic, Datadog, AppDynamics, Honeycomb)
Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana
Prior SRE role
Experience supporting and troubleshooting issues with critical business apps.
Sound knowledge of servers, infrastructure, load balancers, storage etc.
Solid understanding of Unix/Linux and Windows
Technologies: Kubernetes, Containers, serverless
Languages/Programming: One or more of the following: Bash or ksh, Powershell or any other common computer language
Prior experience writing and utilizing Terraform.

Benefits

Collaborative and inclusive workplace environment.
Opportunities for career growth and development.
Support for diversity and inclusion initiatives.

Observability Site Reliability Engineer (SRE)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company