Nvidia - Santa Clara, CA

posted 8 days ago

Full-time - Mid Level
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

As a Software Engineer in Data Center Rack and Power Management at NVIDIA, you will play a crucial role in designing and implementing innovative rack-level solutions for next-generation AI supercomputing platforms. This position focuses on driving power management solutions to scale AI infrastructure, collaborating with various stakeholders to ensure high-quality product delivery, and contributing to all phases of product development.

Responsibilities

  • Drive next-generation power management solutions for scaling AI infrastructure using NVIDIA GPUs and CPU solutions.
  • Collaborate with customers, product management, and architects to accurately define requirements and ensure high quality products on accelerating schedules.
  • Develop architecture for power management at the server and rack levels, optimizing power consumption at the data center level.
  • Produce detailed architecture specifications and validate through POCs.
  • Educate partners on product architecture and incorporate their feedback.
  • Coordinate the development of comprehensive architecture specs and design documents.
  • Lead all aspects of product delivery by collaborating across teams.
  • Conduct code reviews, improve unit testing, and ensure a robust test plan is in place.
  • Support QA teams in leading product life cycles, ensuring their successful implementation.
  • Effectively use Jira and other tools to articulate requirements and carry out plans.
  • Contribute to all phases of product development, from definition and design to implementation, debugging, testing, and early customer support.

Requirements

  • BS, MS, or PhD in EE/CS or a related field (or equivalent experience).
  • Minimum of 8 years of experience in building rack or server management solutions.
  • Experience evaluating power usage at the component level and reducing power consumption in server systems.
  • Understanding of power metrics retrieval from devices.
  • Expertise in firmware architecture and optimizing firmware for low latency APIs.
  • Strong and proven skill in C/C++ and Python.
  • Proficient programming and debugging skills for server platforms.
  • Experience with SCM tools (e.g., Git, Perforce) and project management tools like Jira.
  • Excellent written and oral communication skills, strong work ethics, and a high sense of teamwork.
  • A self-starter who finds creative solutions to complex problems and is hands-on with coding.

Nice-to-haves

  • Proven track record of improving perf/watt or TCO/watt for Data Centers.
  • Experience developing OpenBMC solutions ideally with commits that have been upstreamed to the opensource repository.
  • Active OCP and DMTF contributor in relevant areas with hands-on experience in x86 or ARM system architecture.

Benefits

  • Equity options
  • Comprehensive health insurance
  • Retirement savings plan
  • Paid time off
  • Flexible work hours
  • Professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service