Senior Site Reliability Engineer - DGX Cloud

$148,000 - $276,000/Yr

Nvidia - Santa Clara, CA

posted 12 days ago

Full-time - Mid Level

Santa Clara, CA

Computer and Electronic Product Manufacturing

About the position

The Site Reliability Engineering (SRE) position at NVIDIA focuses on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role combines software and systems engineering practices to ensure maximum reliability and uptime of GPU cloud services, while enabling developers to implement changes effectively. SREs at NVIDIA are responsible for automating processes, optimizing performance, and fostering a culture of continuous improvement and collaboration.

Responsibilities

Design, implement and support operational and reliability aspects of large scale Kubernetes clusters.
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.
Support services before they go live through system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Scale systems sustainably through automation and push for changes that improve reliability and velocity.
Practice sustainable incident response and conduct blameless postmortems.
Participate in an on-call rotation to support production systems.

Requirements

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
5+ years of experience in a relevant field.
Experience with infrastructure automation and distributed systems design.
Experience in designing and developing tools for running large scale private or public cloud systems in production.
Proficiency in one or more programming languages such as Python, Go, Perl, or Ruby.
In-depth knowledge of Linux, Networking, and Containers.

Nice-to-haves

Interest in crafting, analyzing, and fixing large-scale distributed systems.
Systematic problem-solving approach with strong communication skills.
Ability to debug and optimize code and automate routine tasks.
Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Benefits

Equity options
Comprehensive health benefits
Flexible work hours
Diversity and inclusion programs
Professional development opportunities

Senior Site Reliability Engineer - DGX Cloud

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company