Nvidia - Santa Clara, CA

posted 12 days ago

Full-time - Mid Level
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

The Site Reliability Engineering (SRE) position at NVIDIA focuses on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role combines software and systems engineering practices to ensure maximum reliability and uptime of GPU cloud services, while enabling developers to implement changes effectively. SREs at NVIDIA are responsible for automating processes, optimizing performance, and fostering a culture of continuous improvement and collaboration.

Responsibilities

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters.
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.
  • Support services before they go live through system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • Scale systems sustainably through automation and push for changes that improve reliability and velocity.
  • Practice sustainable incident response and conduct blameless postmortems.
  • Participate in an on-call rotation to support production systems.

Requirements

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
  • 5+ years of experience in a relevant field.
  • Experience with infrastructure automation and distributed systems design.
  • Experience in designing and developing tools for running large scale private or public cloud systems in production.
  • Proficiency in one or more programming languages such as Python, Go, Perl, or Ruby.
  • In-depth knowledge of Linux, Networking, and Containers.

Nice-to-haves

  • Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Systematic problem-solving approach with strong communication skills.
  • Ability to debug and optimize code and automate routine tasks.
  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Flexible work hours
  • Diversity and inclusion programs
  • Professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service