Machine Learning Operations Engineer

Unclassified - Cambridge, MA

posted 3 months ago

Full-time

Cambridge, MA

About the position

FL97 is seeking a dedicated and skilled Machine Learning Operations Engineer (ML Ops) to join our team. This role will focus on building and maintaining private cloud infrastructure used to train large scale machine learning models. You will be part of a dynamic, cross-functional team responsible for developing new artificial intelligence models that push the frontier of science. Working closely with biologists, bioinformaticians, software developers, machine learning scientists, and automation engineers, you will contribute to the development of ML models for a range of scientific applications. The ideal candidate has a strong background in machine learning, as well as either experience in the biotech industry or a record of scientific achievement, with a focus on MLOps, model training, and deployment. Responsibilities include developing and managing a large cloud-based cluster with over 100 GPUs in support of FL97 machine learning scientists. You will implement MLOps practices to streamline the model development and deployment process, collaborating with cross-functional teams to integrate ML models into the data pipelines for our labs. Additionally, you will be responsible for implementing rigorous testing, documentation, and performance benchmarking to ensure the reliability and efficiency of the models developed. At FL97, we are uniquely cross-functional and collaborative. We are actively reimagining the way teams work together and communicate. Therefore, we seek individuals with an inclusive mindset and a diversity of thought. Our teams thrive in unstructured and creative environments. All voices are heard because we know that experience comes in many forms, skills are transferable, and passion goes a long way. If this sounds like an environment you'd love to work in, even if you only have some of the experience listed below, please apply.

Responsibilities

Developing and managing a large cloud-based cluster with >100 GPUs in support of FL97 machine learning scientists.
Implementing MLOps practices to streamline the model development and deployment process.
Collaborating with cross-functional teams to integrate ML models into the data pipelines for our labs.
Implementing rigorous testing, documentation, and performance benchmarking.

Requirements

Master's degree (or equivalent experience) in computer science, computational biology, physics, or other quantitative disciplines.
Experience managing Kubernetes clusters with kubectl on cloud-based GPU infrastructure such as Lamda Labs or AWS.
Experience with MLOps practices and tools including version control, automated testing, and CI/CD.
Experience with GPU accelerated ML computing in at least pytorch and robust experience in the Python data science ecosystem.
Knowledge of additional high-performance libraries like Accelerate, DeepSpeed, etc is a plus.
Experience with managing large, containerized multi-GPU training runs for large language models on Ray, Dask, Kueue, or Slurm or similar libraries.

Nice-to-haves

Experience in the biotech industry or a record of scientific achievement.
Familiarity with AI experimental design and simulation.
Experience with automated custom instrumentation.
Knowledge of generative molecular and material design.

Machine Learning Operations Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company