Data Direct Networks - Remote, OR

posted 17 days ago

Full-time - Senior
Remote, OR
Computer and Electronic Product Manufacturing

About the position

DDN Storage is looking for a Senior ML Engineer to design, deploy, and optimize AI/ML training and advanced Retrieval-Augmented Generation (RAG) pipelines for high-performance AI applications. This role involves collaboration with data scientists and software developers to operationalize models using open-source tools, ensuring robust and efficient deployment of AI solutions. The position is critical in enhancing DDN's capabilities in AI and data management, contributing to the company's leadership in the industry.

Responsibilities

  • Design and deploy large-scale AI/ML training pipelines using open-source tools such as Apache Spark and Apache Airflow.
  • Integrate MLflow with DDN's Infinia product for tracking and managing machine learning experiments, model versioning, and deployment.
  • Implement and scale Retrieval-Augmented Generation (RAG) pipelines to enable efficient retrieval of knowledge for generative models.
  • Automate, monitor, and optimize the end-to-end ML workflows and pipelines for production-grade applications.
  • Work collaboratively with cross-functional teams including data science, engineering, and product to operationalize AI/ML models.
  • Maintain and improve CI/CD pipelines for ML models, ensuring smooth transitions from research to production environments.
  • Utilize cloud platforms (AWS, GCP, or Azure) for scalable infrastructure management.
  • Monitor and troubleshoot pipeline performance issues, implementing solutions to optimize runtime and resource usage.
  • Ensure best practices in version control, containerization (Docker, Kubernetes), and infrastructure as code (Terraform, Ansible).
  • Keep up-to-date with the latest developments in MLOps, AI/ML frameworks, and tooling.

Requirements

  • Bachelor's or Master's degree in Computer Science, Data Science, Machine Learning, or related fields.
  • 8+ years of experience in machine learning operations (MLOps) or related roles.
  • Extensive experience with Apache Spark, Apache Airflow, and MLflow or equivalent.
  • Proven expertise in building and scaling AI/ML pipelines.
  • Strong understanding of machine learning frameworks and libraries (TensorFlow, PyTorch, NVIDIA NeMo).
  • Experience in deploying open-source vector databases at scale.
  • Solid understanding of cloud infrastructure (AWS, GCP, Azure) and distributed computing.
  • Proficiency with containerization tools (Docker, Kubernetes) and infrastructure as code.
  • Excellent problem-solving and troubleshooting skills, with attention to detail and performance optimization.
  • Strong communication and collaboration skills.

Nice-to-haves

  • Experience with large-scale data processing and storage solutions (Hadoop, Hive, HDFS).
  • Knowledge of NLP techniques and tools for model deployment.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service