Lead Site Reliability Engineer (SRE)

T. Rowe Price Group - Baltimore, MD

posted 5 months ago

Full-time - Mid Level

Remote - Baltimore, MD

Funds, Trusts, and Other Financial Vehicles

About the position

At T. Rowe Price, we are seeking a highly motivated and experienced Lead Site Reliability Engineer (SRE) to join our CDO Technology Group. In this role, you will be instrumental in ensuring the availability, latency, performance, efficiency, and stability of our critical infrastructure that supports a variety of data platforms, applications, and services. You will work closely with development teams to implement and maintain reliable and scalable systems while adhering to industry best practices and security standards. Your contributions will directly impact the quality of service we provide to our clients and the overall success of our organization. As a Lead SRE, you will proactively monitor our systems to identify potential issues that could affect availability. You will implement automated alerting mechanisms to notify the appropriate parties of outages or performance degradation. Collaborating with development teams, you will design and implement solutions that enhance system resilience and reduce downtime. You will analyze performance metrics to identify and resolve latency bottlenecks, ensuring that new features and code changes do not introduce performance regressions. Your responsibilities will also include developing and maintaining metrics dashboards to track key performance indicators (KPIs) for our critical systems, identifying performance trends, and recommending optimization strategies. You will participate in the release planning process to ensure smooth software releases and develop automated deployment and rollback procedures to mitigate risks associated with updates. Additionally, you will design and maintain a comprehensive monitoring infrastructure to track system health and performance, respond to incidents, and analyze root causes to implement preventive measures. Staying abreast of emerging technologies and industry best practices, you will contribute to the continuous improvement of our practices and tools. You will collaborate with the reliability and infrastructure engineering team to build synergy in tooling for observability, tracing, and alerting, ensuring high availability and proper disaster recovery strategies are in place.

Responsibilities

Proactively monitor and identify potential issues impacting system availability.
Implement and maintain automated alerting mechanisms for outages or performance degradation.
Collaborate with development teams to enhance system resilience and reduce downtime.
Analyze performance metrics to identify and resolve latency bottlenecks.
Implement performance optimization techniques to improve system responsiveness.
Develop and maintain metrics dashboards to track key performance indicators (KPIs).
Identify performance trends and anomalies for potential issues or improvements.
Optimize resource utilization and minimize unnecessary IT infrastructure expenditure.
Participate in the release planning process for smooth software releases.
Develop and implement automated deployment and rollback procedures.
Monitor the performance of new releases and address issues promptly.
Design and maintain a comprehensive monitoring infrastructure for system health.
Analyze monitoring data to proactively troubleshoot potential issues.
Respond promptly to incidents and work collaboratively to resolve them.
Analyze root causes of incidents and implement preventive measures.
Document incident responses and lessons learned to enhance incident handling processes.
Participate in capacity planning exercises to anticipate future workloads.
Stay updated on emerging technologies and industry best practices in site reliability engineering.

Requirements

Bachelor's degree in Computer Science, Information Technology, or a related field.
8+ years of experience as a Site Reliability Engineer or equivalent role.
Proven experience in monitoring, analyzing, and optimizing large-scale distributed systems.
Expertise in Linux systems administration, including managing servers and network configurations.
Strong scripting and automation skills, preferably in Bash, Python, or similar languages.
Familiarity with AWS.
Experience with DevOps tools and practices, such as GitLab CI/CD and Docker.
Excellent troubleshooting and problem-solving skills.
Ability to work independently and collaboratively, communicating technical concepts effectively.
A passion for maintaining high availability, performance, and reliability in a fast-paced environment.

Nice-to-haves

Experience with cloud-native technologies and microservices architecture.
Knowledge of container orchestration tools like Kubernetes.
Familiarity with configuration management tools such as Ansible or Puppet.

Benefits

Competitive salary and comprehensive benefits package.
Opportunity to work with cutting-edge technologies.
Collaborative and supportive work environment with a focus on continuous learning and professional development.
Flexible and remote work opportunities.
Health care benefits (medical, dental, vision).
Tuition assistance.
Wellness programs (fitness reimbursement, Employee Assistance Program).

Lead Site Reliability Engineer (SRE)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company