Site Reliability Engineer - Azure

$124,800 - $145,600/Yr

Motion Recruitment - Dallas, TX

posted 17 days ago

Full-time - Mid Level
Dallas, TX
Administrative and Support Services

About the position

The Site Reliability Engineer (SRE) position at a large financial and tax company in Dallas, Texas, focuses on developing and utilizing tools to monitor key metrics of data systems, ensuring high availability, and facilitating failover procedures. The role involves tracking system reliability, making recommendations for improvements, and generating reports on system performance and outages. This is a long-term contract position with a hybrid work model, requiring onsite presence 1-2 days per week.

Responsibilities

  • Utilize existing tools to create telemetry streams from each system that DevOps maintains.
  • Track trends of key metrics to build a repeatable snapshot of the current state of all systems within DevOps and predict failures.
  • Correlate data from disparate systems to determine underlying causes to issues that may be occurring in seemingly-unrelated parts of the enterprise.
  • Monitor existing logging and monitoring systems and reduce unnecessary logging or improperly tuned monitor probes.
  • Develop a suite of dashboards and tools that enable the SRE to track all incoming metrics and surface the most pressing issues.
  • Continually improve these dashboards to make their information more useful in real time as well as for after-the-fact analysis.
  • Generate 'Post Mortem' reports for unplanned outages or system failures.
  • Prepare 'Scope of Impact' reports for upcoming planned outages or system changes.
  • Work with the other members of DevOps and the Infrastructure team to ensure that underlying resources are ready for failover and to help plan for future growth.
  • Maintain failover documentation and S.O.P.s.
  • Perform regularly scheduled failover testing in conjunction with the rest of the DevOps team, Infrastructure, and our Business teams.
  • Continually seek to improve our failover procedures.

Requirements

  • 2+ years of experience in a Site Reliability Engineer or similar role.
  • Proven experience with monitoring tools such as Datadog, Splunk, New Relic, Prometheus, Grafana, Nagios, etc.
  • Basic understanding of computer programming and experience working with code, databases, and operating systems.
  • Ability to interact with various groups within the business to communicate system changes and failures.
  • Experience working with data systems.

Nice-to-haves

  • A bachelor's degree in Computer Science, Data Science, Computer Information Systems, or a related field is preferred, but commensurate experience is acceptable in lieu of such a degree.
  • Experience with modern DevOps practices, including Azure, Kubernetes, and Terraform.
  • Awareness of different technologies available in the industry for system telemetry and uptime improvements.

Benefits

  • Medical Insurance - Four medical plans to choose from for you and your family
  • Dental & Orthodontia Benefits
  • Vision Benefits
  • Health Savings Account (HSA)
  • Health and Dependent Care Flexible Spending Accounts
  • Voluntary Life Insurance, Long-Term & Short-Term Disability Insurance
  • Hospital Indemnity Insurance
  • 401(k) including match with pre and post-tax options
  • Paid Sick Time Leave
  • Legal and Identity Protection Plans
  • Pre-tax Commuter Benefit
  • 529 College Saver Plan
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service