Zayo Group - Denver, CO

posted 19 days ago

Full-time - Senior
Denver, CO
Specialty Trade Contractors

About the position

Zayo is seeking a Principal Site Reliability Engineer (SRE) to ensure the uptime, performance, and scalability of its critical infrastructure. This role involves developing automation solutions, implementing monitoring systems, managing incidents, and collaborating with various teams to translate business needs into reliable technical solutions. The ideal candidate will have a strong background in Site Reliability Engineering and a passion for building efficient systems.

Responsibilities

  • Develop and implement automation solutions to streamline operations and reduce manual effort.
  • Design and implement effective monitoring and alerting systems to proactively identify and address issues.
  • Own the incident lifecycle, leading root cause analysis and resolution, and implementing preventative measures.
  • Be on-call to diagnose and resolve critical service outages.
  • Proactively identify and mitigate potential system risks, focusing on automation, monitoring, and tooling to ensure high service availability.
  • Design and implement solutions to ensure infrastructure can handle growing demands while maintaining optimal application performance.
  • Work closely with developers, product managers, and other engineers to translate business needs into robust and reliable technical solutions.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • Minimum of ten (10) years of experience in a Site Reliability Engineering or related role.
  • Strong understanding of system administration, Linux, and scripting languages (Python and various shells).
  • Expert at developing automation tools for monitoring, alerting, and deployment to ensure efficient and reliable operations.
  • Expert at designing and implementing monitoring systems at scale.
  • Expert at container orchestration (Kubernetes and Docker).
  • Experience with monitoring platforms such as SevOne, Assure1, and Nagios and various vendor NMS systems.
  • Previous work in large scale distributed production environments.
  • Experience with a variety of cloud platforms and tools (AWS, Google, etc.).
  • Experience with a variety of monitoring and alerting tools (Prometheus, Grafana, Cacti, etc.).
  • Strong working knowledge of networking concepts and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S.
  • Experience with infrastructure management tools such as Ansible, Terraform, Puppet, to deploy and manage infrastructure at scale.
  • Proven leadership skills, with the ability to mentor and inspire others.
  • Excellent problem-solving, analytical, and critical thinking skills.
  • A passion for automation and building efficient systems.

Nice-to-haves

  • Experience working with various vendor APIs (or netconf) including Nokia, Juniper, Fujitsu, Infinera, Cisco, and Ciena.
  • Experience with various network orchestration platforms such as Ciena Blue Planet MDSO, Cisco NSO, Nokia NSP, or others.

Benefits

  • Excellent Health, Dental & Vision Insurance
  • Retirement 401(k) Savings Plan
  • Fitness membership discounts
  • Generous paid time off policy including paid parental leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service