Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This specialized field requires a deep understanding of various systems, networking, coding, database management, capacity management, continuous delivery, deployment, and open-source cloud technologies such as Kubernetes and OpenStack. The SRE team at NVIDIA is dedicated to ensuring that both internal and external GPU cloud services operate with maximum reliability and uptime, as promised to users. This involves enabling developers to implement changes to existing systems through careful preparation and planning, while also monitoring capacity, latency, and performance. SRE is not just a role but a mindset and a set of engineering practices aimed at optimizing production systems. Much of the software development within the SRE team is geared towards automating manual tasks, enhancing performance, and increasing the efficiency of production systems. SREs are responsible for understanding the interconnections between various systems, employing a wide range of tools and methodologies to address diverse challenges. Key practices include minimizing reactive operational work, conducting blameless postmortems, and proactively identifying potential outages, all of which contribute to iterative improvements that enhance product quality and create a dynamic work environment. The culture within the SRE team emphasizes diversity, intellectual curiosity, problem-solving, and openness, which are crucial for success. The organization values collaboration and encourages team members to think creatively and take calculated risks in a supportive, blame-free environment. NVIDIA promotes self-direction, allowing employees to engage in meaningful projects while providing the necessary support and mentorship for professional growth.

Responsibilities

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging, and alerting.
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
  • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Be part of an on-call rotation to support production systems.

Requirements

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
  • 5+ years of experience with infrastructure automation and distributed systems design, including experience in designing and developing tools for running large-scale private or public cloud systems in production.
  • Experience in one or more of the following programming languages: Python, Go, Perl, or Ruby.
  • In-depth knowledge of Linux, Networking, and Containers.

Nice-to-haves

  • Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
  • Ability to debug and optimize code and automate routine tasks.
  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Benefits

  • Equity and benefits eligibility based on position and experience.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service