Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

NVIDIA is at the forefront of groundbreaking developments in Artificial Intelligence, High-Performance Computing (HPC), and Visualization. As a Senior Site Reliability Engineer focused on HPC storage, you will play a crucial role in designing, implementing, and optimizing on-prem HPC storage solutions while leveraging the power of cloud computing. Your responsibilities will include crafting and deploying distributed storage solutions, building automation tools, and ensuring the efficient operation of our growing IT ecosystem. You will work closely with engineering teams to align infrastructure with their evolving needs, document best practices, and contribute to the success of groundbreaking projects. In this role, you will design and implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of NVIDIA. You will also design and implement scalable and efficient storage solutions tailored for data-intensive applications, optimizing both performance and cost-effectiveness. Additionally, you will develop tooling to automate the deployment and management of large-scale infrastructure environments, automate operational monitoring and alerting, and enable self-service consumption of resources. Documenting general procedures and practices, performing technology evaluations related to distributed file systems, and collaborating across teams to better understand developers' workflows and gather their infrastructure requirements will also be key aspects of your role. Your influence will guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization. This position requires a strong background in technology solutions and performance optimization for HPC applications, as well as experience with parallel or distributed filesystems, enterprise NAS solutions, and large-scale on-prem object storage clusters. You will also need to have programming experience in Python or Golang and strong operational experience in leading cloud environments such as AWS, Azure, or GCP.

Responsibilities

  • Design and implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of NVIDIA.
  • Design and implement scalable and efficient storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, operational monitoring and alerting, and enable self-service consumption of resources.
  • Document general procedures and practices, perform technology evaluations related to distributed file systems.
  • Collaborate across teams to understand developers' workflows and gather their infrastructure requirements.
  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.

Requirements

  • BS in Computer Science or equivalent experience with 8+ years of relevant experience, MS with 5+ years of experience, or Ph.D. with 3 years of experience.
  • 8+ years of experience crafting technology solutions and resolving performance bottlenecks for HPC applications.
  • Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.
  • Design, deployment, and management of Enterprise NAS solutions like NetApp, Pure Storage.
  • Experience in designing and managing large scale on-prem object storage clusters.
  • Python/Golang programming/scripting experience is a must.
  • Strong experience operating services in any of the leading cloud environments (AWS, Azure, or GCP).
  • Excellent communication and collaboration skills.

Nice-to-haves

  • Background with RDMA (InfiniBand or RoCE) fabrics.
  • Experience with multiple monitoring stacks such as Prometheus+Grafana, Elasticsearch+Kibana, Splunk, Zabbix, etc.
  • Familiarity with newer and emerging monitoring products.
  • Prior experience with HPC cluster management tools such as Slurm, PBS, LSF, etc.
  • Experience with containerization technologies, such as Docker, Mesosphere DCOS, Kubernetes (k8s).

Benefits

  • Equity and benefits package based on location and experience.
  • Ongoing applications accepted for diverse candidates.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service