Senior Site Reliability Engineer

$164,000 - $327,750/Yr

Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking a Senior Site Reliability Engineer to join the Infrastructure, Planning and Process (IPP) team, a global organization within NVIDIA. This team collaborates with various groups across NVIDIA Software, including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars, to address their infrastructure needs. The cloud services provided by this team facilitate nearly half a million automated jobs daily across thousands of servers, significantly enhancing the efficiency of NVIDIA's software engineers worldwide. The cloud environment is diverse, hosting a mix of machines and devices running various operating systems such as Windows, Linux, and Android, along with a range of hardware platforms including NVIDIA GPUs and Tegra Processors. As a Senior Site Reliability Engineer, you will be responsible for developing frameworks and scripts to automate workflows and deployments within a private cloud environment that supports numerous compute servers equipped with NVIDIA GPUs. A key focus will be on building and stabilizing the virtualization infrastructure, which includes ESXi, KVM, and Hyper-V. You will deploy and maintain a large farm of machines utilizing the latest Configuration Management and Infrastructure Automation tools such as Chef, Ansible, and Terraform. Additionally, you will develop comprehensive monitoring systems to ensure a fast, reliable, and real-time overview of various infrastructure subsystems using tools like Zabbix, Big Panda, and Grafana. Your role will also involve participating in on-call and rotational L1 support for continuous monitoring and remediation of the infrastructure, utilizing PagerDuty. You will tackle complex challenges related to infrastructure scaling, capacity planning, and debugging operating system, networking, configuration, and performance issues. Furthermore, you will assist in the rollout and deployment of new development features aimed at supporting the latest NVIDIA hardware and technologies.

Responsibilities

  • Develop frameworks and scripts to automate workflows and deployments in a private cloud environment.
  • Build and stabilize virtualization infrastructure including ESXi, KVM, and Hyper-V.
  • Deploy and maintain a large farm of machines using Configuration Management & Infrastructure Automation tools (Chef, Ansible, Terraform).
  • Develop extensive monitoring systems for real-time infrastructure oversight (Zabbix, Big Panda, Grafana).
  • Participate in on-call & rotational L1 support for infrastructure monitoring and remediation (PagerDuty).
  • Address complex problems related to infrastructure scaling, capacity, and planning.
  • Analyze and debug operating system, networking, configuration, and performance issues.
  • Assist in the rollout and deployment of new development features supporting NVIDIA hardware and technologies.

Requirements

  • Bachelor's or Master's Degree in Computer Science or Software Engineering, or equivalent experience.
  • Proven experience working in large scale enterprise production systems.
  • 6+ years of professional experience required.
  • Ability to debug and analyze source code to triage, root cause, and resolve infrastructure issues.
  • Experience working closely with platform engineering teams to understand hardware setups.
  • Familiarity with maintenance and setup of Linux and Windows hosts.
  • Hands-on coding experience with Python or Go.
  • Proficiency in Unix shell scripting.
  • Knowledge of Java and C programming languages.
  • Experience with version control systems like Perforce and GIT.

Nice-to-haves

  • Experience with VM and hardware virtualization technologies like VMware, KVM, Hyper-V, Docker, and Kubernetes.
  • Background in automating bare metal and VM provisioning.
  • Experience supporting GPUs, embedded device development, driver development, and CUDA/TensorRT applications.
  • Development experience in Chef, Ansible, and infrastructure orchestration.

Benefits

  • Equity and benefits package based on location and experience.
  • Ongoing application acceptance for diverse candidates.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service