Senior Site Reliability Engineer - Observability and Telemetry Platform

$148,000 - $419,750/Yr

Nvidia - Santa Clara, CA

posted 10 days ago

Full-time - Senior

Santa Clara, CA

Computer and Electronic Product Manufacturing

About the position

The Senior Site Reliability Engineer (SRE) at NVIDIA focuses on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role emphasizes the importance of observability and telemetry in ensuring the reliability and uptime of GPU cloud services. SREs at NVIDIA utilize a combination of software and systems engineering practices to automate processes, optimize performance, and enhance system reliability while fostering a culture of diversity and collaboration.

Responsibilities

Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform.
Engage in and improve the whole lifecycle of services from inception and design through deployment, operation and refinement.
Support services before they go live through system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.
Be part of an on-call rotation to support production systems.

Requirements

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
5+ years of experience with Infrastructure automation and distributed systems design.
5+ years experience delivering foundational infrastructure and observability platforms.
Experience in one or more of the following: Python, Go, Perl or Ruby.
In-depth knowledge on Linux, Networking and Containers.

Nice-to-haves

Interest in crafting, analyzing and fixing large-scale distributed systems.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Ability to debug and optimize code and automate routine tasks.
Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker.
Experience running Grafana, OpenTelemetry, Prometheus, and similar observability focused tools.

Benefits

Equity and benefits eligibility.

Senior Site Reliability Engineer - Observability and Telemetry Platform

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company