Nvidia - Santa Clara, CA
posted 4 months ago
NVIDIA is looking for a seasoned Site Reliability Engineer (SRE) to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization. In this role, you will be working as a Senior SRE Engineer, part of a dynamic team that develops and maintains NVIDIA's internal cloud provisioning product for GPUs and Tegra systems. This position involves collaboration with various business units within NVIDIA Software, including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars, to address their infrastructure and systems needs. As an SRE, you will work closely with software engineering teams to deploy new products and manage our infrastructure, associated processes, and systems. Attention to detail, problem-solving abilities, and a solid knowledge base are essential for success in this role. Your responsibilities will include Kubernetes System Administration for DevOps and CI/CD, designing and implementing clusters, cluster segmentation, and managing internal/external networking for multiple clusters and environments. You will monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization. Additionally, you will architect CI/CD pipelines for container build and deployment, develop tools for automating workflows, and maintain our infrastructure codebase. You will also craft and implement critical metrics using various analytics methods and dashboards, participate in prototyping and developing cloud infrastructure for NVIDIA, and leverage AI techniques to extract useful signals from the data generated by machines and jobs.