Scale Ai - New York, NY

posted 5 days ago

Full-time - Mid Level
New York, NY
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

As a software engineer on the ML Infrastructure team, you will develop the platform for orchestrating post-training and model evaluation jobs, supporting the continuous development of new data sources and experiments to enhance ML models. The role requires navigating cloud infrastructure challenges and addressing research challenges in benchmarking and tuning large language models (LLMs).

Responsibilities

  • Develop re-usable platforms for running in-house and open-source LLM-benchmarks.
  • Ensure correctness and performance of post-training and eval jobs on the platform.
  • Improve APIs for managing ML workflows.
  • Contribute to foundational infrastructure for model inference and training.
  • Participate in the team's on-call process to ensure service availability.
  • Own projects end-to-end, from requirements, scoping, design, to implementation.

Requirements

  • 4+ years of experience developing ML platforms.
  • Strong fundamentals in machine learning and backend system design.
  • Prior ML Infrastructure experience.
  • Comfortable with infrastructure and large scale system design.
  • Ability to diagnose model performance and system failures.

Nice-to-haves

  • Experience building, deploying, and monitoring complex microservice architectures.
  • Experience working with a cloud technology stack (e.g., AWS or GCP).
  • Passion for working closely with researchers to drive business impact.
  • Experience training and/or benchmarking LLMs.
  • Experience with Python, Docker, Kubernetes, and Infrastructure as code (e.g., terraform).

Benefits

  • Comprehensive health, dental and vision coverage
  • Retirement benefits
  • Learning and development stipend
  • Generous PTO
  • Commuter stipend
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service