Unclassified - Beaverton, OR
posted 3 months ago
DAT is seeking a Site Reliability Engineering (SRE) and NOC Manager to lead our teams in ensuring the reliability, availability, and performance of our systems and services. This role is pivotal in managing a team of Site Reliability Engineers and NOC engineers, combining technical expertise with leadership skills to develop and implement reliability strategies and operational practices. The successful candidate will be responsible for overseeing the development of best practices for gathering and reporting Site Reliability metrics, as well as managing our Incident Management processes. This position offers the opportunity to own and leverage our observability tooling, ensuring that our SRE team effectively ingests and manages work across our products and services. Additionally, the role includes ownership of incident management tooling and processes, as well as end-to-end remediation reporting. As a leader, you will mentor and guide your teams, providing support and career development opportunities. You will be expected to drive improvements in incident management processes and reporting, utilizing your hands-on experience with triaging and remediating incidents. The ideal candidate will have experience with observability tools such as NewRelic, Datadog, or Cloudwatch, and possess the ability to debug code in languages such as Java, Python, GoLang, and C. This role requires a proactive approach to improving processes, systems, and overall reliability, making it essential for the candidate to be constantly looking for ways to enhance our operational practices.