Adobe - Seattle, WA

posted 2 months ago

Full-time - Mid Level
Seattle, WA
Publishing Industries

About the position

The position involves designing, developing, and maintaining robust AI/ML infrastructure solutions to support the training and deployment of large-scale AI models at Adobe. This role is strategic and highly visible, aimed at evolving Adobe's Firefly generative AI models, which are set to revolutionize how creatives work. The successful candidate will collaborate with machine learning researchers and engineers to enhance the performance and scalability of AI models, ensuring efficient resource utilization and driving innovation in infrastructure practices.

Responsibilities

  • Design, develop, and maintain robust AI/ML infrastructure solutions using Kubernetes and Python on AWS cloud.
  • Implement and optimize distributed training frameworks leveraging GPUs to improve performance and scalability.
  • Improve resiliency, elasticity, data loading, and provide support for GPU optimization methods such as FP8, FSDP, and model parallelism.
  • Write high-quality, product-level code that is easy to maintain and test following standard methodologies.
  • Collaborate closely with ML Researchers and Machine Learning Engineers to accelerate the training of cutting-edge ML models.
  • Keep track of the latest innovations in academia and the open-source community to implement rapid adoption of pioneering technologies.
  • Help train better models by improving orchestration and scheduling, scaling the number of jobs, and faster experimentation with AutoML.
  • Collaborate with data scientists and ML researchers to streamline the model training pipeline and ensure efficient resource utilization.
  • Drive innovation in infrastructure practices to support pioneering machine learning research and development.

Requirements

  • PhD or Master's in computer science or related field and 5+ years relevant industry experience.
  • Proven proficiency with Python and developing systems, frameworks, and SDKs.
  • Experience with infrastructure and understanding of model serving, training, orchestration, and management of GPU resources.
  • Experience with machine learning and distributed Pytorch.
  • Strong critical thinking, analytical and quantitative problem-solving ability.
  • Excellent communication, relationship skills, and a strong teammate.

Nice-to-haves

  • Experience with KubeFlow, MLFlow, Ray, SageMaker, or similar.
  • Experience with Nvidia HPC.
  • Experience with Pytorch distributed, MPI, Megatron, Horovod, and other AI training frameworks.

Benefits

  • Competitive salary range of $170,900 -- $325,200 annually based on location and experience.
  • Short-term incentives in the form of the Annual Incentive Plan (AIP).
  • Potential eligibility for long-term incentives in the form of a new hire equity award.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service