Nvidia - Santa Clara, CA
posted 2 months ago
NVIDIA is at the forefront of groundbreaking developments in Artificial Intelligence, High-Performance Computing (HPC), and Visualization. As a Senior Site Reliability Engineer focused on HPC storage, you will play a crucial role in designing, implementing, and optimizing on-prem HPC storage solutions while leveraging the power of cloud computing. Your responsibilities will include crafting and deploying distributed storage solutions, building automation tools, and ensuring the efficient operation of our growing IT ecosystem. You will work closely with engineering teams to align infrastructure with their evolving needs, document best practices, and contribute to the success of groundbreaking projects. In this role, you will design and implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of NVIDIA. You will also design and implement scalable and efficient storage solutions tailored for data-intensive applications, optimizing both performance and cost-effectiveness. Additionally, you will develop tooling to automate the deployment and management of large-scale infrastructure environments, automate operational monitoring and alerting, and enable self-service consumption of resources. Documenting general procedures and practices, performing technology evaluations related to distributed file systems, and collaborating across teams to better understand developers' workflows and gather their infrastructure requirements will also be key aspects of your role. Your influence will guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization. This position requires a strong background in technology solutions and performance optimization for HPC applications, as well as experience with parallel or distributed filesystems, enterprise NAS solutions, and large-scale on-prem object storage clusters. You will also need to have programming experience in Python or Golang and strong operational experience in leading cloud environments such as AWS, Azure, or GCP.