T. Rowe Price Group - Baltimore, MD
posted 4 months ago
At T. Rowe Price, we are seeking a highly motivated and experienced Lead Site Reliability Engineer (SRE) to join our CDO Technology Group. In this role, you will be instrumental in ensuring the availability, latency, performance, efficiency, and stability of our critical infrastructure that supports a variety of data platforms, applications, and services. You will work closely with development teams to implement and maintain reliable and scalable systems while adhering to industry best practices and security standards. Your contributions will directly impact the quality of service we provide to our clients and the overall success of our organization. As a Lead SRE, you will proactively monitor our systems to identify potential issues that could affect availability. You will implement automated alerting mechanisms to notify the appropriate parties of outages or performance degradation. Collaborating with development teams, you will design and implement solutions that enhance system resilience and reduce downtime. You will analyze performance metrics to identify and resolve latency bottlenecks, ensuring that new features and code changes do not introduce performance regressions. Your responsibilities will also include developing and maintaining metrics dashboards to track key performance indicators (KPIs) for our critical systems, identifying performance trends, and recommending optimization strategies. You will participate in the release planning process to ensure smooth software releases and develop automated deployment and rollback procedures to mitigate risks associated with updates. Additionally, you will design and maintain a comprehensive monitoring infrastructure to track system health and performance, respond to incidents, and analyze root causes to implement preventive measures. Staying abreast of emerging technologies and industry best practices, you will contribute to the continuous improvement of our practices and tools. You will collaborate with the reliability and infrastructure engineering team to build synergy in tooling for observability, tracing, and alerting, ensuring high availability and proper disaster recovery strategies are in place.