AMD - Bellevue, WA
posted about 2 months ago
As a Principal Machine Learning Software Engineer at AMD, you will be at the forefront of transforming lives through advanced technology. Your primary focus will be on low-level performance optimization, which is crucial for enhancing AMD-based machine learning infrastructure. This role is pivotal in ensuring the efficient deployment of state-of-the-art large models, which are essential for various applications including data centers, artificial intelligence, gaming, and embedded systems. You will join a dynamic team dedicated to groundbreaking projects that push the limits of innovation and execution excellence. In this position, you will be responsible for optimizing model execution, particularly GPU kernels, for both inference and training in a multi-GPU and multi-node environment. Your work will directly influence AMD's ability to deliver cutting-edge AI solutions efficiently and at scale. You will engage in tasks such as developing and optimizing low-level GPU kernels to accelerate the performance of large machine learning models, maximizing computational efficiency, and reducing execution time while maintaining model accuracy. Additionally, you will design and implement strategies for distributed model training and inference across multiple GPUs and nodes, addressing challenges related to data and model parallelism. Performance profiling will be a key aspect of your role, as you will analyze system and application performance to identify bottlenecks and optimize hardware resource utilization. You will also explore model quantization techniques to minimize memory and computation overhead, particularly for edge and cloud deployments. Your collaboration with machine learning researchers, software engineers, and infrastructure teams will be essential to integrate optimized kernels into production systems, and you will be responsible for creating detailed documentation of your optimizations and best practices.