Lead Site Reliability Engineer, Networking

DraftKingsposted 20 days ago

$148,000 - $185,000/Yr

Full-time • Senior

Boston, MA

Performing Arts, Spectator Sports, and Related Industries

Upload and Match ResumeTrack Jobs with Teal

About the position

We're defining what it means to build and deliver the most extraordinary sports and entertainment experiences. Our global team is trailblazing new markets, developing cutting-edge products, and shaping the future of responsible gaming. Here, "impossible" isn't part of our vocabulary. You'll face some of the toughest but most rewarding challenges of your career. They're worth it. Channeling your inner grit will accelerate your growth, help us win as a team, and create unforgettable moments for our customers. As a Lead Site Reliability Engineer, you will drive key initiatives to enhance the reliability, scalability, and efficiency of our infrastructure. You'll collaborate across teams to architect infrastructure automation while mentoring other Engineers to foster a culture of continuous learning and innovation. In this role, you will shape deployment strategies, performance tuning, and monitoring frameworks to support our rapid growth.

Responsibilities

Lead SRE initiatives across multiple projects and products, collaborating with cross-functional teams to shape platform and infrastructure engineering efforts across the organization.
Drive technical excellence by mentoring and guiding engineers, fostering a culture of continuous learning and innovation.
Architect and automate self-healing, fault-tolerant infrastructure with declarative configurations, GitOps, and event-driven automation for scalable deployments across public clouds and on-premise.
Design, develop, and maintain software-driven infrastructure automation to build internal tools and eliminate repetitive operational tasks.
Own and drive decisions on product deployment, performance tuning, monitoring, and alerting to ensure high availability and system efficiency in production.
Define key metrics and SLAs around new web services being created to support our rapid traffic growth.
Design and implement monitoring and alerting strategies to enforce application SLAs.

Requirements

At least 6 years of experience managing distributed cloud environments (GCP, AWS, vSphere, Nutanix) and platform automation at scale.
Deep expertise in container orchestration (Kubernetes) and container runtimes (Docker, containerd), with the ability to design, scale, and troubleshoot complex workloads.
Expert-level understanding of networking and web concepts, with the ability to debug issues down to the packet level.
Strong experience developing software for automation and infrastructure tooling (Go, Python).
Strong understanding of Linux-based operating systems, including performance tuning, kernel debugging, and low-level system optimizations.
Experience with Infrastructure as Code (IaC) and configuration management tools (Terraform, Ansible, Chef, etc.), ensuring scalable and repeatable infrastructure provisioning.
Understanding of applications written in object-oriented languages (C#/.NET, Java).
Experience leading engineering teams and guiding technology roadmaps in large-scale, distributed environments.