Twelve Labsposted 4 months ago
Full-time • Senior
San Francisco, CA

About the position

As Machine Learning Engineer, Distributed Training Infrastructure, you will be responsible for ensuring that compute performance and ease-of-use never delay our research timeline. You will own strategy and implementation for all compute & training infrastructure optimization, observability, scaling, and orchestration. You will collaborate closely with other engineers and scientists to define and implement your chosen roadmap. This role is a perfect fit for research minded compute specialists who want to build SOTA video, vision, and video-language modeling systems!

Responsibilities

  • Own our compute strategy e2e
  • Partner with researchers to understand our future research roadmap and to identify scaling limitations which will most imminently block us from achieving our goals
  • Be a hands on leader who is excited to debug perplexing node failures at odd hours
  • Mentor junior engineers/researchers, and hold a high bar around code quality / engineering best practices
  • Work across teams to understand and manage project priorities and product deliverables, evaluate trade-offs, and drive technical initiatives from ideation to execution to shipment

Requirements

  • 7+ years of industry experience
  • Owned large scale distributed training efforts across thousands of accelerators
  • Experience with a panoply of HPC related tools and have developed strong opinions about how we should build our stack
  • A passion for solving the most pressing technical challenges, as opposed to the most intellectually satisfying ones
  • Strong Python and infrastructure-as-code expertise
  • Excellent communication skills in written and spoken English

Benefits

  • An open and inclusive culture and work environment.
  • Work closely with a collaborative, mission-driven team on cutting-edge AI technology.
  • Full health, dental, and vision benefits
  • Extremely flexible PTO and parental leave policy. Office closed the week of Christmas and New Years.
  • Remote-flexible, offices in San Francisco and Seoul and coworking stipend
  • VISA support (such as H1B and OPT transfer for US employees)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service