This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Senior AI Infrastructure Engineer

$160,000 - $230,000/Yr

Together AI - San Francisco, CA

posted 4 days ago

Full-time - Senior
Remote - San Francisco, CA

About the position

As a Senior AI Infrastructure Engineer, you will be responsible for building the next generation, highly available, global, multi-cloud PaaS platform with open-source technologies to enable and accelerate Together AI's rapid growth. This system spans many diverse environments (Kubernetes, VMs, bare metal compute, and edge deployments) and provides a cohesive and reliable abstraction for running AI workloads in them. You will get to be a technology thought leader, evangelize new, cutting-edge technologies, and solve complex problems. To be successful, you'll need to be deeply technical and possess excellent communication, collaboration, and diplomacy skills. You have experience practicing infrastructure-as-code, including using tools like Terraform and Ansible. You have strong software development fundamentals and skills. In addition, you have strong systems knowledge and troubleshooting abilities.

Responsibilities

  • Perform architecture and research work for decentralized AI workloads
  • Work on the core, open-source Together AI platform
  • Create services, tools, and developer documentation
  • Create testing frameworks for robustness and fault-tolerance

Requirements

  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
  • Demonstrated experience with high performance or distributed cloud microservices architectures and ideally experience building them in operation at a global scale using multiple cloud providers such as AWS, Azure, or GCP
  • Excellent understanding of low level operating systems concepts including multi-threading, memory management, networking and storage, performance, and scale
  • Pragmatic, methodical, well-organized, detail-oriented, and self-starting
  • Experience with Kubernetes and containerization, VPNs, AI workloads, and blockchain based protocols a plus
  • GPU programming, NCCL, CUDA knowledge a plus
  • Experience with Pytorch or Tensorflow a plus
  • 5+ years experience writing high-performance, well-tested, production quality code

Nice-to-haves

  • Experience with Kubernetes and containerization
  • VPNs, AI workloads, and blockchain based protocols
  • GPU programming, NCCL, CUDA knowledge
  • Experience with Pytorch or Tensorflow

Benefits

  • Competitive compensation
  • Startup equity
  • Health insurance
  • Flexibility in terms of remote work
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service