Nvidia - Santa Clara, CA
posted 3 months ago
NVIDIA is seeking a Senior Site Reliability Engineer (SRE) to join its GeForce Now (GFN) team. The SRE role at NVIDIA is crucial for ensuring that both internal and external GPU cloud gaming services maintain the reliability and uptime that users expect. This position involves enabling developers to implement changes to existing systems through meticulous preparation and planning, while also monitoring capacity, latency, and performance. SREs at NVIDIA are responsible for understanding the interconnectedness of various systems, employing a wide range of tools and methodologies to address complex challenges. The individual in this role will focus on Service Response and Workflows, driving the development of tools and services to uphold and enhance service SLOs. Collaboration with Service Owners is essential to ensure the reliability of the GFN service, which is a key player in the rapidly evolving game streaming industry. In this role, you will be tasked with building tools that enhance SRE observability and participating in the Kubernetes migration journey, including VMI setup and problem-solving. You will need to quickly debug and triage incidents as well as user-reported issues. A significant part of your responsibilities will involve automating, scripting, and tooling new and existing scripts to achieve 100% automation of daily tasks. Additionally, you will support services prior to their launch through system design consulting, software platform development, capacity management, and launch reviews. Being part of an on-call rotation to support production systems is also a requirement.