Nvidia - Santa Clara, CA

posted 3 months ago

Full-time - Mid Level
Santa Clara, CA
5,001-10,000 employees
Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking an outstanding individual to join our platform SWQA team, where you will be responsible for the development and execution of test plans for the NVIDIA HGX/DGX/MGX platforms. This role involves working with the OS, firmware, and CUDA software stack, starting from design documentation. You will be tasked with installing and testing various systems, including operating systems and system firmware, while ensuring the reliability and performance of our products. Your responsibilities will include driving support for root cause analysis on reliability and validation test failures, identifying root causes, and implementing mitigation strategies. In this position, you will build, develop, and debug both system and OS-level automation frameworks and tests. You will also review test results from partners and suppliers, prescribing additional reliability testing on components, systems, and packaging as necessary. Working within an agile software development team, you will uphold very high production quality standards and manage the bug lifecycle, collaborating with inter-groups to drive solutions. This role is ideal for someone who thrives in a diverse work environment and possesses strong interpersonal skills, along with a commitment to continuous process improvement.

Responsibilities

  • Develop and execute test plans for NVIDIA HGX/DGX/MGX platforms on OS, firmware, and CUDA software stack.
  • Install and test various systems, including operating systems and system firmware.
  • Drive support for root cause analysis on reliability and validation test failures.
  • Build, develop, and debug system and OS level automation frameworks and tests.
  • Review partner and supplier test results and prescribe additional reliability testing as needed.
  • Work in an agile software development team with high production quality standards.
  • Manage the bug lifecycle and collaborate with inter-groups to drive solutions.

Requirements

  • Bachelor's Degree (or equivalent experience) in a STEM field (Science, Technology, Engineering, Math, or Physics).
  • 5+ years of proven experience; or Master's Degree with 2+ years of meaningful work experience.
  • Proven experience in OS and server level automation using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript.
  • Strong troubleshooting and debugging experience in various operating systems (Ubuntu, RedHat, CentOS, SuSE, Fedora, Windows, etc.) in bare-metal and virtual environments.
  • Ability to write test plans focusing on functional, performance, stress, and negative testing.
  • Experience in developing CI/CD automation processes and contributing to DevOps with a passion for automation.
  • Good teamwork skills with the ability to work independently.
  • Strong experience in firmware, BMC/OpenBMC, network protocols, enterprise storage devices, PCIe buses, CPU and memory, ACPI, UEFI spec, and Redfish.

Nice-to-haves

  • Experience working with NVIDIA GPU hardware.
  • Solid understanding of virtualization in Linux (KVM, Docker orchestrated with Kubernetes).
  • Expertise in packaging software in Linux (rpm, debs).
  • Background in parallel programming ideally with CUDA/OpenCL.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Flexible work hours
  • Paid time off
  • Retirement savings plan
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service