Snap Inc. - Fontana, CA

posted 8 days ago

Full-time - Principal
Fontana, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

The Principal Machine Learning, ML Training Platform at Snap Inc. is responsible for designing, implementing, and scaling critical machine learning components and services that support the company's strategic initiatives. This role involves building a next-generation training framework for large-scale model training, optimizing model performance, and developing an AutoML platform to automate the machine learning lifecycle. The position requires collaboration across teams to meet product requirements and advocate for best practices in operational excellence and cost management.

Responsibilities

  • Design, implement, and scale critical machine learning components and services.
  • Build a next-generation training framework for large-scale model training.
  • Perform training and model performance optimization with various GPUs.
  • Develop an AutoML platform to accelerate model generation and automate the machine learning model lifecycle.
  • Work across teams to understand product requirements and deliver necessary solutions.
  • Advocate for best practices in availability, scalability, operational excellence, and cost management.
  • Provide technical direction that influences the entire company.

Requirements

  • BS in technical field such as computer science, mathematics, statistics or equivalent experience.
  • 14+ years of industry machine learning experience.
  • Experience with GPU/TPU training and optimizations.
  • Strong understanding of machine learning approaches and algorithms.
  • Excellent programming and software design skills, including debugging and performance analysis.

Nice-to-haves

  • Masters/PhD in a technical field such as computer science.
  • Experience leading teams and driving technical roadmaps.
  • Experience with machine learning, recommendation and ranking systems, or vector similarity search.
  • Experience with TensorFlow, PyTorch, or related deep learning frameworks.
  • Experience with Docker, Kubernetes, Ray, NoSQL solutions, Memcache/Redis, Google/AWS services.
  • Experienced in MLOps and managing production machine learning lifecycle.

Benefits

  • Paid parental leave
  • Comprehensive medical coverage
  • Emotional and mental health support programs
  • Compensation packages that include equity in the form of RSUs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service