Software Engineer, Machine Learning Infrastructure

Scale Ai - New York, NY

posted 3 months ago

Full-time - Mid Level

New York, NY

Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

Scale is seeking an AI/ML Infrastructure Engineer to join our Machine Learning Infrastructure team, focusing on the development of our Training Platform. In this role, you will collaborate closely with Machine Learning researchers to understand their requirements and leverage your domain expertise along with our compute resources to enhance experimentation throughput. The ideal candidate will possess a solid foundation in machine learning, backend system design, and prior experience in ML Infrastructure. Comfort with infrastructure and large-scale system design, as well as the ability to diagnose both model performance and system failures, is essential. As an AI/ML Infrastructure Engineer, you will be responsible for building highly available, observable, performant, and cost-effective APIs for model training. You will also participate in our team's on-call process to ensure the availability of our services. This position requires you to own projects from start to finish, encompassing requirements gathering, scoping, design, and implementation, all within a highly collaborative and cross-functional environment. You will need to exercise good judgment in system and tool development, knowing when to build versus buy, with a keen eye for cost efficiency. The ideal candidate should have at least 4 years of experience in building machine learning training pipelines or inference services in a production environment. Familiarity with distributed training techniques such as DeepSpeed and FSDP is highly desirable, along with experience in building, deploying, and monitoring complex microservice architectures. Proficiency in Python, Docker, Kubernetes, and Infrastructure as Code (e.g., Terraform) is also required. Additionally, experience with LLM inference latency optimization techniques and cloud technology stacks (e.g., AWS or GCP) would be advantageous.

Responsibilities

Build highly available, observable, performant, and cost-effective APIs for model training.
Participate in the team's on-call process to ensure the availability of services.
Own projects end-to-end, from requirements, scoping, design, to implementation, in a collaborative environment.
Exercise good judgment in building systems and tools, making build vs. buy tradeoffs with a focus on cost efficiency.

Requirements

4+ years of experience building machine learning training pipelines or inference services in a production setting.
Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
Experience building, deploying, and monitoring complex microservice architectures.
Proficiency in Python, Docker, Kubernetes, and Infrastructure as Code (e.g., Terraform).

Nice-to-haves

Experience with LLM inference latency optimization techniques, e.g., kernel fusion, quantization, dynamic batching, etc.
Experience working with a cloud technology stack (e.g., AWS or GCP).

Benefits

Comprehensive health, dental and vision coverage
Retirement benefits
Learning and development stipend
Generous PTO
Commuter stipend

Software Engineer, Machine Learning Infrastructure

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company