Site Reliability Engineer - Azure

$124,800 - $145,600/Yr

Motion Recruitment - Dallas, TX

posted 17 days ago

Full-time - Entry Level

Dallas, TX

Administrative and Support Services

About the position

The Site Reliability Engineer (SRE) position at a large financial and tax company in Dallas, TX, focuses on developing and utilizing tools to monitor key metrics of data systems, ensuring reliability and recoverability. The role involves enabling high availability through failover methods, implementing changes to minimize downtime, and facilitating testing of failover procedures. The SRE will also make recommendations for improving reliability patterns in existing and new systems.

Responsibilities

Utilize existing tools to create telemetry streams from each system maintained by DevOps.
Track trends of key metrics to build a repeatable snapshot of the current state of all systems within DevOps and predict failures.
Correlate data from disparate systems to determine underlying causes of issues occurring in seemingly unrelated parts of the enterprise.
Monitor existing logging and monitoring systems and reduce unnecessary logging or improperly tuned monitor probes.
Develop a suite of dashboards and tools that enable the SRE to track all incoming metrics and surface the most pressing issues.
Continually improve these dashboards for real-time usefulness and after-the-fact analysis.
Generate 'Post Mortem' reports for unplanned outages or system failures.
Prepare 'Scope of Impact' reports for upcoming planned outages or system changes.
Work with other members of DevOps and the Infrastructure team to ensure underlying resources are ready for failover and to help plan for future growth.
Maintain failover documentation and S.O.P.s.
Perform regularly scheduled failover testing in conjunction with the rest of the DevOps team, Infrastructure, and Business teams.
Continually seek to improve failover procedures.

Requirements

2+ years of experience in a related field.
Proven experience with monitoring tools such as Datadog, Splunk, New Relic, Prometheus, Grafana, Nagios, etc.
Basic understanding of computer programming and experience working with code, databases, and operating systems.
Ability to interact with various groups within the business to communicate changes or current states of system failures.

Nice-to-haves

A bachelor's degree in Computer Science, Data Science, Computer Information Systems, or a related field is preferred, but commensurate experience is acceptable.
Experience/exposure to modern DevOps practices such as Azure, Kubernetes, Terraform.

Benefits

Full-time position with a pay rate of $60-70/hr depending on experience.

Site Reliability Engineer - Azure

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company