Nvidia - Santa Clara, CA
posted 2 months ago
NVIDIA is seeking a Senior Site Reliability Engineer to join the Infrastructure, Planning and Process (IPP) team, a global organization within NVIDIA. This team collaborates with various groups across NVIDIA Software, including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars, to address their infrastructure needs. The cloud services provided by this team facilitate nearly half a million automated jobs daily across thousands of servers, significantly enhancing the efficiency of NVIDIA's software engineers worldwide. The cloud environment is diverse, hosting a mix of machines and devices running various operating systems such as Windows, Linux, and Android, along with a range of hardware platforms including NVIDIA GPUs and Tegra Processors. As a Senior Site Reliability Engineer, you will be responsible for developing frameworks and scripts to automate workflows and deployments within a private cloud environment that supports numerous compute servers equipped with NVIDIA GPUs. A key focus will be on building and stabilizing the virtualization infrastructure, which includes ESXi, KVM, and Hyper-V. You will deploy and maintain a large farm of machines utilizing the latest Configuration Management and Infrastructure Automation tools such as Chef, Ansible, and Terraform. Additionally, you will develop comprehensive monitoring systems to ensure a fast, reliable, and real-time overview of various infrastructure subsystems using tools like Zabbix, Big Panda, and Grafana. Your role will also involve participating in on-call and rotational L1 support for continuous monitoring and remediation of the infrastructure, utilizing PagerDuty. You will tackle complex challenges related to infrastructure scaling, capacity planning, and debugging operating system, networking, configuration, and performance issues. Furthermore, you will assist in the rollout and deployment of new development features aimed at supporting the latest NVIDIA hardware and technologies.