Fellow, DC GPU - AI Enablement - Application Performance Tuning

AMD - Santa Clara, CA

posted about 2 months ago

Full-time - Senior

Santa Clara, CA

Computer and Electronic Product Manufacturing

About the position

The DC GPU Fellow - AI Enablement - Application Performance Tuning is a leadership role focused on optimizing the deployment and operational capabilities of AMD's Instinct-based CPU and GPU systems in data center environments. This position requires extensive experience in network architecture, AI/ML deployments, and performance tuning, aimed at ensuring robust performance and efficient transitions from production qualification to large-scale deployment.

Responsibilities

Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability.
Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level.
Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations.
Provide second-level support and maintenance for ROCm and its integration with third-party tools across AI and HPC ecosystems.
Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions.
Port and optimize a variety of machine learning based applications for AMD CPU and GPU systems.
Provide domain specific knowledge to other groups at AMD, sharing lessons learnt to drive continuous improvement.
Engage with AMD product groups to drive resolution of application and customer issues.
Develop and present training materials to internal audiences, at customer venues, and at industry conferences.

Requirements

Expertise in networking and performance optimization for large-scale AI/ML networks.
Demonstrated leadership in network architecture with hands-on experience in NVMe-oF, NVMe-TCP, InfiniBand, RoCEv2, and storage architecture.
Proven ability to influence design and technology roadmaps with a deep understanding of datacenter products and market trends.
Extensive system, storage, and network software development/deployment expertise with a track record of delivering large projects on time.
Direct co-development/deployment experience with strategic customers/partners.
Proven leadership in engaging customers with diverse technical disciplines.
Extensive experience and mastery in Linux, Python, Ansible, and preferably C++.
Working experience with distributed pre-training, fine-tuning, and inference.
Familiarity with orchestrator/resource managers such as slurm and k8s.
Broad experience creating, adapting, and running workloads with widely used AI applications.
Strong system level performance analysis skills for both CPU and GPU.
Excellent communication skills from engineer to mid-management to C-level audiences.
Thought leader with patents or publications in relevant fields.
In-depth HPC, AI/ML experience.
Experience working with large customers such as Cloud Service Providers.
Ability to work well in a geographically dispersed team.
Certifications in Networking, Storage, AI/ML, or Cloud Technologies.

Nice-to-haves

Experience with large-scale AI/ML network deployments.
Knowledge of performance tuning and scalability improvements in AI/ML ecosystems.
Experience in benchmarking machine learning applications.

Benefits

Base pay dependent on skills and experience.
Eligibility for annual bonuses or sales incentives.
Opportunity to own shares of AMD stock through the Employee Stock Purchase Plan.
Competitive benefits package.

Fellow, DC GPU - AI Enablement - Application Performance Tuning

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company