AMD - Austin, TX

posted 5 months ago

Full-time - Mid Level
Austin, TX
Computer and Electronic Product Manufacturing

About the position

At AMD, we are committed to transforming lives through our technology, and we are looking for a Solutions Architect with expertise in designing and building Clustered Systems. This role is pivotal in supporting the design and deployment of cutting-edge AI/ML training and inferencing systems. The ideal candidate will provide insights on at-scale system design and tuning mechanisms for large-scale compute runs, working with the latest accelerated computing and deep learning platforms. You will collaborate cross-functionally with various organizations within AMD and with customers to ensure successful and seamless deployments. As a Solutions Architect, you will be responsible for providing solutions to deploy large-scale clustered systems, ensuring strong technical relationships with both internal and external engineering teams. You will also be tasked with developing essential collateral such as white papers, guides, presentations, and test data to facilitate effective communication regarding the deployment and scaling of clustered systems. Your role will involve solving complex problems related to multi-site deployments of AMD products and partnering with OEM partners, AMD Engineering, Product, and Sales teams to secure design wins for customers. Additionally, you will enable the development and growth of AMD product features through customer feedback and deployment evaluations.

Responsibilities

  • Provide solutions to deploy large scale clustered systems.
  • Collaborate with multi-functional teams built of customers, external partners, and internal teams from concept to prototype to deployments.
  • Solve complex problems involving multi-site deployments of AMD products.
  • Partner with OEM partners, AMD Engineering, Product, and Sales teams to secure design wins for customers.
  • Enable development and growth of AMD product features through customer feedback and deployment evaluations.

Requirements

  • 5+ years of experience in accelerated computing for datacenter/HPC solutions or related experience.
  • Strong background in performance analysis, system profiling, and high-performance computing.
  • Deep understanding of dense data center design and architecture including compute, storage, networking, cloud APIs, and IaaS.
  • Conduct system profiling and performance analysis, utilizing tools such as perftest and rccl_test, to ensure systems operate at peak efficiency.
  • Solid understanding of accelerated computing scheduling and I/O stacks.
  • Experience with modern automation, development, and resource management tooling such as ansible, git, containers (docker), Kubernetes, etc.
  • Knowledge of container networking, particularly Kubernetes, and experience with DevOps practices.
  • Proficient in Linux based networking technologies and protocols such as RDMA, RoCE, CNI-based container networking, InfiniBand, Ethernet, NVLINK, and familiar with various network topologies, routing protocols and network security practices.
  • Clear verbal and written communication skills, capable of effectively teaching others and contributing to a team's success through collaboration and open information sharing.

Nice-to-haves

  • A networker that collaborates with both intra-team and inter-team members; who promotes knowledge sharing and is able to turn that knowledge into standard operating procedures.
  • Skilled in the development of SOPs and team knowledge base management.
  • Experience working with engineering or research community supporting high performance computing or deep learning.

Benefits

  • Base pay depending on skills, qualifications, experience, and location.
  • Eligibility for incentives such as annual bonuses or sales incentives.
  • Opportunity to own shares of AMD stock and discounts on AMD stock through the Employee Stock Purchase Plan.
  • Competitive benefits package.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service