This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Principal Site Reliability Engineer

Fidelity Investmentsposted about 2 months ago

Full-time • Senior

Westlake, TX

About the position

The position involves building and operating highly resilient platforms in AWS cloud environments. The role requires coordination of systems using Infrastructure as Code tools such as IAM, ARM, Terraform, and Chef. The candidate will perform reliability engineering throughout the entire Software Development Lifecycle (SDLC) using programming languages like Python, NodeJS, or Java. Responsibilities include deploying and supporting distributed multi-tiered application systems using Kubernetes and Continuous Integration/Continuous Deployment (CI/CD) pipelines. The role also involves creating dashboards to capture latency, availability, error, and saturation performance of applications using tools like Splunk, Grafana, Prometheus, Catchpoint, and Datadog. Additionally, the candidate will create Service-Level Indicator/Service-Level Objective (SLI/SLO) dashboards and automated processes for updates and new dashboard creation. The position requires identifying and resolving application issues using DataDog, Prometheus, and Splunk, as well as creating, maintaining, and tuning monitors using ELK, OpenSearch, and OpenTelemetry. The candidate will support applications hosted in AWS Cloud and Kubernetes, and build, deploy, automate, and support application services across multiple technology platforms, frameworks, and languages.

Responsibilities

Provides automated solutions for business and technology operational activities and manual tasks.
Analyzes the observability, resiliency, availability, and performance of applications.
Triages, deep dives, and executes root cause analysis.
Provides resolution of business and system issues through enhancement initiatives.
Resolves issues during critical outages to avoid negative business impact.
Contributes to product architectural solutions addressing high impact system issues.
Deploys and supports distributed multi-tiered application systems.
Manages the scalability and resiliency of applications.
Ensures daily business operations are not impacted by system issues.
Consults across the enterprise to plan for and implement enhancements to systems.
Establishes end-to-end flow of application systems to quickly identify and resolve critical business issues.
Tests the resiliency of application systems using Chaos Engineering techniques.
Mentors junior team members.

Requirements

Bachelor’s degree in Computer Information Systems, Computer Science, Engineering, Information Technology, Information Systems, Mathematics, Physics, or a closely related field and five (5) years of experience as a Principal Site Reliability Engineer.
Or alternatively, a Master’s degree in a related field and three (3) years of experience as a Principal Site Reliability Engineer.
Demonstrated expertise in performing site reliability engineering to analyze observability, resiliency, availability, instrumentation, and performance of distributed applications.
Experience creating dashboards and monitors using Splunk, Grafana, Prometheus, Catchpoint, Telemetry, and Datadog.
Experience developing Kubernetes platforms and automations in public and private Cloud (RKS, EKS, AKS) using Python, Shell Scripting, GIT, Docker, and Kubernetes.
Experience automating business and technology operational activities using Jenkins Core, uDeploy, RunDeck, Ansible, and AWX.
Experience performing triage and root cause analysis in multi-tiered fund accounting application systems.