Nvidia - Santa Clara, CA

posted 3 months ago

Full-time - Mid Level
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking a Senior Software QA Test Development Engineer to join our platform SWQA team. This role is pivotal in ensuring the reliability and performance of our cutting-edge HGX/DGX/MGX platforms. The successful candidate will be responsible for the development and execution of comprehensive test plans that encompass the OS, firmware, and CUDA software stack, derived from design documentation. This position requires a deep understanding of enterprise system integration and a strong background in operating systems, as well as experience in reliability testing utilizing various telemetry methods. The ideal candidate will thrive in a diverse work environment and possess exceptional interpersonal skills, demonstrating a commitment to continuous process improvement. In this role, you will install and test various systems, including operating systems and firmware, while driving support for root cause analysis on reliability and validation test failures. You will be tasked with identifying root causes and implementing effective mitigation strategies. Additionally, you will build and develop both front-end and back-end automation frameworks and tests at the system and OS levels. Collaboration is key, as you will review partner and supplier test results and recommend additional reliability testing for components, systems, and packaging as necessary. Working within an agile software development team, you will uphold high production quality standards and manage the bug lifecycle, collaborating with inter-groups to drive solutions. This position is ideal for a dedicated, forward-thinking individual who is passionate about technology and eager to contribute to NVIDIA's mission as the leading AI computing company. If you are looking for a challenging and rewarding opportunity to work with some of the most experienced professionals in the industry, this role is for you.

Responsibilities

  • Develop and execute NVIDIA HGX/DGX/MGX platform test plans on OS, firmware, and CUDA software stack from design documentation.
  • Install and test various systems including operating systems, system firmware, and software stacks.
  • Drive support for root cause analysis on reliability and validation test failures to identify root causes and achieve mitigation.
  • Build, develop, and debug system and OS level automation front-end and back-end frameworks and tests.
  • Review partner and supplier test results and prescribe additional reliability testing for components, systems, and packaging as needed.
  • Work in an agile software development team with very high production quality standards.
  • Manage the bug lifecycle and collaborate with inter-groups to drive for solutions.

Requirements

  • Bachelor's Degree (or equivalent experience) in a STEM field (Science, Technology, Engineering, Math, or Physics).
  • 5+ years proven experience; or Master's Degree with 2+ years of meaningful work experience.
  • Proven years of OS and server level automation experience using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript.
  • Strong OS troubleshooting and debugging experience in a bare-metal and KVM/VMWare/Hyper-V environment.
  • Ability to write test plans focusing on functional, performance, stress, and negative testing.
  • Experience in developing CI/CD automation processes and DevOps contributions with a real passion for automation.
  • Good teamwork with the ability to work independently.
  • Strong experience in firmware, BMC/OpenBMC, network protocols, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish.

Nice-to-haves

  • Experience working with NVIDIA GPU hardware.
  • Solid understanding of virtualization in Linux (KVM, Docker orchestrated with Kubernetes).
  • Expertise in packaging software in Linux (rpm, debs).
  • Background in parallel programming ideally CUDA/OpenCL.

Benefits

  • Equity and benefits package.
  • Competitive salary based on location and experience.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service