Senior Site Reliability Engineer

$110,200 - $151,580/Yr

Federal Reserve Bank - Dallas, TX

posted 3 months ago

Part-time,Full-time - Senior

Dallas, TX

Monetary Authorities-Central Bank

About the position

As a Senior Cloud Reliability Engineer in the SRE chapter, you will be accountable for implementing reliability practices using software as means for the cloud foundational product line in the Federal Reserve. The SRE Chapter is part of the Cloud Solutions & Services department and has the overall responsibility for reliability of the numerous cloud foundational environments in the FRS. You will work as part of cloud foundational platform squads to demonstrate and champion site reliability culture and practices and exert technical influence throughout your team. Your role will involve solving reliability issues of cloud platforms using software engineering principles, developing and maintaining automations, scripts, and code associated with automating manual work, and improving the reliability and stability of the cloud platform. Additionally, you will develop, integrate, and maintain synthetics (canaries) code to establish the health of the platform, lead SLIs, SLOs, and Error budgets efforts in collaboration with the product team to instrument and visualize for proactively managing the stability of cloud platforms. You will also implement observability (logs, metrics, traces) and monitoring for cloud foundational platforms, define chaos experiments in collaboration with product owners, and conduct experiments. Furthermore, you will be responsible for developing and mentoring junior engineers in the team, along with other duties as assigned.

Responsibilities

Champion site reliability culture and practices within cloud foundational platform squads.
Solve reliability issues of cloud platforms using software engineering principles.
Develop and maintain automations, scripts, and code to automate manual work and improve reliability.
Develop, integrate, and maintain synthetics (canaries) code to establish platform health.
Lead SLIs, SLOs, and Error budgets efforts in collaboration with product teams.
Implement observability (logs, metrics, traces) and monitoring for cloud foundational platforms.
Define chaos experiments in collaboration with product owners and conduct experiments.
Develop and mentor junior engineers in the team.
Perform other duties as assigned.

Requirements

5-7 years of experience in end-to-end enterprise software development life cycle including maintenance and support.
3+ years of experience in Observability and SRE practices.
Bachelor's degree in computer science, Information Systems, or equivalent background or equivalent experience.
Extensive knowledge and experience of working in AWS environments.
Software development experience with one of the languages: Python, GoLang.
Experience with observability and tools like Dynatrace, Prometheus, Grafana, AWS CloudWatch, AWS Canary, AWS event bridge.
Expertise in automation and tooling.
Working experience in Agile and Scaled Agile environments.
Experience supporting infrastructure for large multi-services applications.
Knowledge of secure coding standards and banking environment is a plus.

Nice-to-haves

Strong analytic and problem-solving skills.
Strong customer focus and communication skills.
Independent critical thinking and decision-making abilities.
Excellent written and oral communication abilities.

Benefits

Great medical benefits
Pension and 401(k) with employer match
Paid time off
Tuition reimbursement
Employee resource networks
Paid volunteer leave
Flexible work options
Onsite amenities that make working here fun!

Senior Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company