Site Reliability Engineering (SRE) and NOC Manager

Unclassified - Beaverton, OR

posted 3 months ago

Full-time - Manager

Beaverton, OR

About the position

DAT is seeking a Site Reliability Engineering (SRE) and NOC Manager to lead our teams in ensuring the reliability, availability, and performance of our systems and services. This role is pivotal in managing a team of Site Reliability Engineers and NOC engineers, combining technical expertise with leadership skills to develop and implement reliability strategies and operational practices. The successful candidate will be responsible for overseeing the development of best practices for gathering and reporting Site Reliability metrics, as well as managing our Incident Management processes. This position offers the opportunity to own and leverage our observability tooling, ensuring that our SRE team effectively ingests and manages work across our products and services. Additionally, the role includes ownership of incident management tooling and processes, as well as end-to-end remediation reporting. As a leader, you will mentor and guide your teams, providing support and career development opportunities. You will be expected to drive improvements in incident management processes and reporting, utilizing your hands-on experience with triaging and remediating incidents. The ideal candidate will have experience with observability tools such as NewRelic, Datadog, or Cloudwatch, and possess the ability to debug code in languages such as Java, Python, GoLang, and C. This role requires a proactive approach to improving processes, systems, and overall reliability, making it essential for the candidate to be constantly looking for ways to enhance our operational practices.

Responsibilities

Lead, mentor, and manage a team of Site Reliability Engineers and NOC engineers.
Develop and execute strategies to improve system reliability and performance.
Oversee the implementation of reliability strategies and operational practices.
Manage incident management processes and reporting.
Utilize observability tools to gather and report Site Reliability metrics.
Debug code in Java, Python, GoLang, and C as needed.
Drive improvements in incident management processes.

Requirements

Proven experience leading SRE and NOC teams.
Hands-on experience with incident triaging and remediation.
Experience with observability tools like NewRelic, Datadog, or Cloudwatch.
Ability to debug code in Java, Python, GoLang, and C.
Strong leadership and mentoring skills.

Nice-to-haves

Experience in a SaaS technology environment.
Familiarity with transportation supply chain logistics.

Benefits

Equal employment opportunities (EEO) for all employees and applicants.
Support for career development opportunities.
Diversity and inclusion initiatives.

Site Reliability Engineering (SRE) and NOC Manager

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company