Senior Site Reliability Engineer - DGX Cloud

$148,000 - $276,000/Yr

Nvidia - Santa Clara, CA

posted 4 months ago

Full-time - Mid Level

Santa Clara, CA

5,001-10,000 employees

Computer and Electronic Product Manufacturing

About the position

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role combines software and systems engineering practices to ensure that our internal and external GPU cloud services operate with maximum reliability and uptime. SRE at NVIDIA requires a deep understanding of various systems, networking, coding, database management, capacity management, continuous delivery, and deployment, as well as familiarity with open-source cloud technologies like Kubernetes and OpenStack. The SRE team is responsible for the overall health of our services, ensuring that they run smoothly and efficiently. This involves careful preparation and planning to enable developers to make changes to existing systems while monitoring capacity, latency, and performance. SRE is not just a role but a mindset that emphasizes running better production systems through optimization and automation. Our software development efforts are geared towards eliminating manual work, enhancing performance, and increasing the efficiency of production systems. SREs at NVIDIA are tasked with the big picture of how our systems interrelate, employing a variety of tools and approaches to address a wide range of challenges. We prioritize practices such as minimizing reactive operational work, conducting blameless postmortems, and proactively identifying potential outages, all of which contribute to iterative improvements in product quality and create a dynamic work environment. Our culture values diversity, intellectual curiosity, problem-solving, and openness, fostering collaboration and encouraging team members to think big and take risks in a supportive, blame-free atmosphere. We aim to create an environment that promotes self-direction on meaningful projects while providing the necessary support and mentorship for personal and professional growth.

Responsibilities

Design, implement and support operational and reliability aspects of large scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging, and alerting.
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and conduct blameless postmortems.
Be part of an on-call rotation to support production systems.

Requirements

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
5+ years of experience in a relevant field.
Experience with infrastructure automation and distributed systems design, including the design and development of tools for running large-scale private or public cloud systems in production.
Proficiency in one or more programming languages such as Python, Go, Perl, or Ruby.
In-depth knowledge of Linux, networking, and containers.

Nice-to-haves

Interest in crafting, analyzing, and fixing large-scale distributed systems.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Ability to debug and optimize code and automate routine tasks.
Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Benefits

Equity options
Comprehensive health insurance
Retirement savings plan
Paid time off
Flexible working hours
Professional development opportunities

Senior Site Reliability Engineer - DGX Cloud

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company