Nvidia - Santa Clara, CA

posted 3 months ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking a Senior Site Reliability Engineer (SRE) to join its GeForce Now (GFN) team. The SRE role at NVIDIA is crucial for ensuring that both internal and external GPU cloud gaming services maintain the reliability and uptime that users expect. This position involves enabling developers to implement changes to existing systems through meticulous preparation and planning, while also monitoring capacity, latency, and performance. SREs at NVIDIA are responsible for understanding the interconnectedness of various systems, employing a wide range of tools and methodologies to address complex challenges. The individual in this role will focus on Service Response and Workflows, driving the development of tools and services to uphold and enhance service SLOs. Collaboration with Service Owners is essential to ensure the reliability of the GFN service, which is a key player in the rapidly evolving game streaming industry. In this role, you will be tasked with building tools that enhance SRE observability and participating in the Kubernetes migration journey, including VMI setup and problem-solving. You will need to quickly debug and triage incidents as well as user-reported issues. A significant part of your responsibilities will involve automating, scripting, and tooling new and existing scripts to achieve 100% automation of daily tasks. Additionally, you will support services prior to their launch through system design consulting, software platform development, capacity management, and launch reviews. Being part of an on-call rotation to support production systems is also a requirement.

Responsibilities

  • Build tools to improve SRE observability.
  • Participate in Kubernetes migration journey with VMI setup and problem-solving.
  • Rapidly debug and triage incidents and user-reported issues.
  • Automate, script, and tool new/existing scripts to achieve 100% automation of daily tasks.
  • Support services before they go live through system design consulting, software platform development, capacity management, and launch reviews.
  • Participate in an on-call rotation to support production systems.

Requirements

  • MS or BS in Computer Science/Engineering or a related field or equivalent experience.
  • 8+ years of Site Reliability Engineering experience working on large scale distributed microservices in a production environment.
  • Strong background in Kubernetes with the ability to understand complex and highly available VMI setups on Kubernetes.
  • Experience leading significant production improvements including change management, post-mortem reviews, and workflow processes.
  • Proficiency in designing and delivering software automation in various programming languages.
  • Strong problem-solving skills and ability to root cause issues while seeking optimization and efficiency.

Nice-to-haves

  • Previous experience with Datadog, Prometheus, alert manager, or similar monitoring systems.
  • Experience with Jenkins (or similar CI/CD) setup, configuration, and deployment.
  • Excellent communication, presentation, social, and analytical skills; ability to communicate complex concepts clearly across different audiences.
  • Experience with Stack Storm, Prometheus, and Kubernetes is a bonus.
  • Prior experience as an SRE or in Service Engineering is a significant advantage.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Flexible work hours
  • Opportunities for professional development
  • Diversity and inclusion programs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service