AMD - Santa Clara, CA

posted about 2 months ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

The DC GPU Fellow - AI Enablement - Application Performance Tuning is a leadership role focused on optimizing the deployment and operational capabilities of AMD's Instinct-based CPU and GPU systems in data center environments. This position requires extensive experience in network architecture, AI/ML deployments, and performance tuning, aimed at ensuring robust performance and efficient transitions from production qualification to large-scale deployment.

Responsibilities

  • Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability.
  • Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level.
  • Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations.
  • Provide second-level support and maintenance for ROCm and its integration with third-party tools across AI and HPC ecosystems.
  • Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions.
  • Port and optimize a variety of machine learning based applications for AMD CPU and GPU systems.
  • Provide domain specific knowledge to other groups at AMD, sharing lessons learnt to drive continuous improvement.
  • Engage with AMD product groups to drive resolution of application and customer issues.
  • Develop and present training materials to internal audiences, at customer venues, and at industry conferences.

Requirements

  • Expertise in networking and performance optimization for large-scale AI/ML networks.
  • Demonstrated leadership in network architecture with hands-on experience in NVMe-oF, NVMe-TCP, InfiniBand, RoCEv2, and storage architecture.
  • Proven ability to influence design and technology roadmaps with a deep understanding of datacenter products and market trends.
  • Extensive system, storage, and network software development/deployment expertise with a track record of delivering large projects on time.
  • Direct co-development/deployment experience with strategic customers/partners.
  • Proven leadership in engaging customers with diverse technical disciplines.
  • Extensive experience and mastery in Linux, Python, Ansible, and preferably C++.
  • Working experience with distributed pre-training, fine-tuning, and inference.
  • Familiarity with orchestrator/resource managers such as slurm and k8s.
  • Broad experience creating, adapting, and running workloads with widely used AI applications.
  • Strong system level performance analysis skills for both CPU and GPU.
  • Excellent communication skills from engineer to mid-management to C-level audiences.
  • Thought leader with patents or publications in relevant fields.
  • In-depth HPC, AI/ML experience.
  • Experience working with large customers such as Cloud Service Providers.
  • Ability to work well in a geographically dispersed team.
  • Certifications in Networking, Storage, AI/ML, or Cloud Technologies.

Nice-to-haves

  • Experience with large-scale AI/ML network deployments.
  • Knowledge of performance tuning and scalability improvements in AI/ML ecosystems.
  • Experience in benchmarking machine learning applications.

Benefits

  • Base pay dependent on skills and experience.
  • Eligibility for annual bonuses or sales incentives.
  • Opportunity to own shares of AMD stock through the Employee Stock Purchase Plan.
  • Competitive benefits package.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service