Bytedance - Bellevue, WA

posted 2 months ago

Full-time - Mid Level
Bellevue, WA
Professional, Scientific, and Technical Services

About the position

As a Senior Site Reliability Engineer at ByteDance, you will play a crucial role in enhancing the lifecycle of our infrastructure services. This includes everything from the initial design and development phases to capacity planning, launch reviews, deployment, operation, and ongoing refinement of our systems. You will be responsible for designing and implementing software platforms and monitoring frameworks that support efficient, automated, and intelligent governance of our service-oriented architecture (SOA). Your work will directly contribute to scaling our systems sustainably through automation, while also evolving the reliability, efficiency, and velocity of our services by advocating for necessary changes. In this position, you will maintain services to meet our service-level agreements (SLAs) and service-level objectives (SLOs) by continuously measuring and monitoring the availability, performance, and overall health of our systems. You will also provide user support, respond to incidents, and conduct postmortems to analyze and learn from any issues that arise. Participation in technical operations and rotations will be expected, particularly in response to performance and reliability challenges. Additionally, you will have the opportunity to mentor junior Site Reliability Engineers and interns, fostering their growth and development within the team.

Responsibilities

  • Help improve the whole lifecycle of infrastructure services from inception and design, throughout development, capacity planning and launch reviews, to deployment, operation and refinement.
  • Design and implement software platforms and monitor frameworks for efficient, automated and intelligent service-oriented architecture (SOA) governance.
  • Scale systems sustainability through mechanisms such as automation and evolve systems reliability, efficiency, and velocity by pushing for changes.
  • Maintain services to meet service-level-agreements (SLAs) or service-level-objectives (SLOs) by measuring and monitoring availability, performance, and overall system health.
  • Provide user support, incident responses and postmortems.
  • Participate in technical operations and rotations in response to performance and reliability issues.
  • Mentor junior SREs and interns.

Requirements

  • Must have a Master's degree in Computer Science, Engineering (any), Information Technology, Mathematics, Statistics, Physics, or a related field, and 2 years of related work experience; OR a Bachelor's degree in Computer Science, Engineering (any), Information Technology, Mathematics, Statistics, Physics, or a related field, and 5 years of post-bachelor's, progressive related work experience.
  • Of the required experience, must have 2 years of experience in developing docker containers for multiple services and deploying, managing, and monitoring them in Kubernetes cluster.
  • 2 years of experience in developing CI/CD tools in Python and writing automation Bash Shell scripts.
  • 2 years of experience in automating and managing the full release lifecycle, including code development, build, testing, validation, and deployment.
  • 2 years of experience in configuring Linux virtual machines to execute builds and maintaining associated TCP/IP, routing, network topologies, storage, and operating systems.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service