Principal Site Reliability Engineer

Brightspeed - Charlotte, NC

posted 5 months ago

Full-time - Principal

Charlotte, NC

About the position

At Brightspeed, we are currently seeking a Principal Site Reliability Engineer to join our growing team. In this pivotal role, you will be responsible for implementing and maintaining monitoring systems that track the performance and availability of our business-critical systems and infrastructure. Your expertise will be crucial in using metrics to identify trends and potential issues, ensuring that our services are reliable and scalable. You will collaborate closely with development teams, operations, and other stakeholders to guarantee that new services and features meet the highest standards of reliability and performance. As a Principal Site Reliability Engineer, your duties will include responding to system outages and performance issues, performing root cause analysis to prevent recurrence, and developing scripts and tools to automate repetitive tasks such as deployment, scaling, and monitoring. You will work on reducing latency and improving the speed of data transmission across our network, while also defining and measuring Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure that our services meet required performance and availability targets. Additionally, you will conduct postmortems after incidents to identify areas for improvement and work with lead application owners and internal change management to review code changes and support deployments. In this leadership role, you will lead a team of site reliability engineers, both onshore and offshore, mentoring them in the support activities required for system reliability. Your ability to communicate effectively with multiple target audiences, including senior business and IT leadership, technology teams, and business teams, will be essential for success in this position.

Responsibilities

Implement and maintain monitoring systems to track the performance and availability of business-critical systems and infrastructure.
Use metrics to identify trends and potential issues.
Respond to system outages and performance issues, performing root cause analysis to prevent recurrence.
Develop scripts and tools to automate repetitive tasks, such as deployment, scaling, and monitoring.
Work closely with development teams, operations, and other stakeholders to ensure that new services and features are reliable and scalable.
Work on reducing latency and improving the speed of data transmission across the network.
Define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure services meet required performance and availability targets.
Conduct postmortems after incidents to identify what went wrong and what can be improved.
Work with lead application owners and internal change management to review code changes and support deployments.
Lead the team of site reliability engineers onshore/offshore, mentoring them for support activities required for system reliability.

Requirements

Master's degree in computer science, telecommunications, or similar areas.
Minimum of 10 years software engineering experience, including a minimum of 5 years as a site reliability engineer.
Proven track record of managing mission-critical customer-facing applications for reliability.
5+ years of experience supporting operations and maintenance for cloud-native applications in production that are fault-tolerant, self-healing, scalable, and highly available.
Excellent troubleshooting and problem-solving skills, with a keen attention to detail to identify and resolve complex production issues.
Deep understanding of cloud computing platforms (GCP) and containerization technologies (e.g., Docker, Kubernetes).
Solid experience with core Kubernetes concepts such as Pods, Workloads, Services, Ingress/Egress, Deployments, ConfigMaps, HPA, Liveliness Probe, and Secrets.
Strong knowledge of infrastructure as code tools (e.g., Terraform, Ansible, ArgoCD) and CI/CD pipelines.
Strong experience working with integration of code quality tools (SonarQube or Checkmarx) with CI/CD pipeline.
Strong experience with monitoring, logging, and observability tools like Splunk, GCP log, Dynatrace, etc.
Ability to work independently and as part of a collaborative team, effectively communicating technical concepts to both technical and non-technical stakeholders.
Proven written and verbal communication skills, including presentations using tools like PowerPoint.

Nice-to-haves

Certifications such as Google Professional Cloud DevOps Engineer or AWS Certified DevOps Engineer.

Benefits

Competitive medical, dental, vision, and life insurance.
Employee assistance program.
401K plan with company match.
Comprehensive benefits and paid time off programs promoting overall wellness through physical, emotional, and financial health.

Principal Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company