Nvidia - Durham, NC

posted 3 months ago

Full-time - Mid Level
Durham, NC
Computer and Electronic Product Manufacturing

About the position

As an SRE focused on metrics reporting at Nvidia, you will play a crucial role in collaborating with cross-functional teams, including software engineers, data scientists, and operations personnel. Your primary responsibility will be to monitor, analyze, and optimize our systems by collecting, analyzing, and presenting key performance indicators (KPIs) that drive operational excellence and inform strategic decisions. This position is integral to enhancing the use of our AI/ML and chip development infrastructure, ensuring that our engineering teams can develop at an unprecedented speed. In this role, you will be involved in the full life-cycle of tool development, from testing to deployment. You will work within a diverse team to provide operational and strategic metrics that empower engineers to improve productivity and efficiency. A significant aspect of your work will be to continuously enhance our chip development process through better observability, directly contributing to the overall quality and reducing the time to market for our next-generation chips. Your contributions will not only impact the immediate team but will also play a part in Nvidia's broader mission to amplify human creativity and intelligence through innovative technology. This is an opportunity to be part of a company that is at the forefront of AI and accelerated computing, tackling challenges that matter to the world.

Responsibilities

  • Develop, test, and deploy data collectors, pipelines, and services to enhance the use of our AI/ML and chip development infrastructure.
  • Participate in the full life-cycle of tool development, testing, and deployment.
  • Work in a diverse team to provide operational and strategic metrics that empower engineers to develop at high speeds.
  • Continuously improve the chip development process through better observability.
  • Directly contribute to the overall quality and improve time to market for next-generation chips.

Requirements

  • Experience in applying data analysis principles and influencing data-driven decisions.
  • Experience with turning raw data into actionable reports.
  • Hands-on experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open-source tools.
  • Authoritative level Python programming experience and use of API calls.
  • Extensive experience with CI/CD pipelines such as Jenkins and/or GitLab.
  • Passion for improving the productivity of others.
  • Excellent planning and interpersonal skills.
  • Flexibility/adaptability working in a dynamic environment with changing requirements.
  • MS (preferred) or BS in Computer Science, Electrical Engineering, or related field or equivalent experience.
  • 5+ years of relevant experience.

Nice-to-haves

  • Hands-on experience running GPU-based workloads in a batch computing environment.
  • Passion for gathering and visualizing metrics and data.
  • Experience with chip design workflows, such as front end verification, back end workflows, or mixed signal workflows.
  • Experience with job schedulers (in particular IBM Spectrum LSF and/or SLURM).
  • Mastery of distributed system principles.

Benefits

  • Highly competitive salaries
  • Comprehensive benefits package
  • Equity eligibility
  • Ongoing application acceptance
  • Diversity and inclusion commitment
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service