This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Cloudberyl LLC - Dallas, TX

posted 19 days ago

Full-time - Senior
Dallas, TX

About the position

The Senior Infrastructure Engineer will be responsible for architecting and deploying high-performance computing clusters with multi-GPU support specifically for AI/ML workloads. This role involves optimizing GPU resource scheduling, implementing infrastructure automation, and managing container orchestration systems to ensure efficient operation of GPU-accelerated workloads. The engineer will also focus on performance tuning, security compliance, and collaboration with MLOps teams to integrate GPU clusters into CI/CD pipelines.

Responsibilities

  • Architect and deploy high-performance computing clusters with multi-GPU support for AI/ML workloads.
  • Implement and optimize GPU resource scheduling, job queuing, and distributed training setups.
  • Leverage NVIDIA CUDA to optimize performance for AI/ML models and workloads.
  • Fine-tune GPU configurations for multi-GPU systems, ensuring maximum throughput and minimal latency.
  • Build Infrastructure as Code (IaC) solutions using Terraform to automate the provisioning and management of on-premise infrastructure.
  • Create scalable templates for consistent resource deployment.
  • Deploy and manage container orchestration systems (e.g., Kubernetes, Docker Swarm) to run scalable GPU-accelerated workloads.
  • Monitor and troubleshoot issues in distributed systems with tools like NVIDIA DCGM, Prometheus, or similar.
  • Optimize AI/ML pipelines for distributed training across multi-GPU nodes.
  • Develop strategies to efficiently utilize NVLink, NCCL, and other NVIDIA technologies.
  • Set up robust monitoring and alerting systems to track GPU utilization, node health, and workload performance.
  • Collaborate with MLOps teams to integrate GPU clusters into CI/CD pipelines.
  • Implement security best practices for sensitive AI/ML workloads in an on-premise environment.
  • Ensure compliance with organizational policies and industry standards.

Requirements

  • 7+ years in infrastructure engineering, with at least 5 years of direct experience in GPU-accelerated systems and NVIDIA CUDA.
  • Proven experience in deploying and managing multi-GPU systems for AI/ML workloads.
  • Proficiency with NVIDIA CUDA for GPU programming and performance tuning.
  • Hands-on experience with NVIDIA tools and libraries, including NVLink, NCCL, and cuDNN.
  • Familiarity with MIG (Multi-Instance GPU) configurations and multi-GPU scaling techniques.
  • Advanced knowledge of Terraform and scripting languages like Python or Bash for automation.
  • Proficiency with container orchestration tools like Kubernetes or similar.
  • Expertise in workload management systems and GPU monitoring tools (e.g., NVIDIA DCGM, Slurm).
  • Experience in deploying and optimizing distributed training frameworks (e.g., TensorFlow MultiWorkerMirroredStrategy, PyTorch DDP).
  • Strong understanding of networking, storage, and system architecture for high-performance compute environments.

Nice-to-haves

  • Strong problem-solving abilities and critical thinking skills.
  • Excellent communication skills for cross-functional collaboration.
  • Leadership capabilities to guide junior engineers and manage projects.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service