About the position

As a Platform Engineer supporting Applied AI systems, you will build and maintain the infrastructure that powers AI-driven models, agents, and data pipelines. Your role will ensure that AI solutions run reliably, securely, and efficiently at scale, supporting model serving, agent runtimes, continuous deployment, and infrastructure automation. You will work on optimizing GPU utilization, containerized deployments, autoscaling, observability, and CI/CD improvements, ensuring that AI models and applications perform optimally across diverse workloads.

Responsibilities

  • Design and implement scalable AI model inference environments, optimizing GPU utilization and autoscaling strategies.
  • Develop and maintain infrastructure as code (IaC) using tools like Terraform, Pulumi, or CloudFormation.
  • Optimize model deployment pipelines with CI/CD automation, ensuring fast rollouts and seamless rollback mechanisms.
  • Monitor system performance, track latency spikes, GPU memory constraints, and workload distribution, and refine autoscaling policies.
  • Set up alerts, dashboards, and system observability tools (Prometheus, Grafana, Datadog) to improve reliability.
  • Implement failover mechanisms so that model-serving failures automatically trigger rerouting to healthy nodes.
  • Compare containerized deployments across AWS, GCP, Azure, and optimize for cost-efficiency and performance.
  • Integrate security tooling that scans for vulnerabilities in AI models, data pipelines, and containerized environments.
  • Experiment with Kubernetes (K8s), Nomad, and container orchestration tools to improve resilience and scalability of AI workloads.
  • Collaborate with AI engineers and ML researchers to align infrastructure with evolving AI applications and model-serving needs.

Requirements

  • 5–8+ years of experience in DevOps, Platform Engineering, or Cloud Infrastructure with a focus on AI/ML workloads.
  • 3+ years of experience with deploying and scaling AI models in production, including model-serving infrastructure and GPU optimization.
  • Expertise in containerization and orchestration tools (Docker, Kubernetes, Nomad).
  • Strong knowledge of cloud platforms (AWS, GCP, Azure) and optimizing GPU-based model deployments (NVIDIA Triton, TensorRT, or ONNX Runtime).
  • Experience with CI/CD pipelines, infrastructure automation, and GitOps methodologies.
  • Proficiency in Infrastructure as Code (IaC) tools like Terraform, Pulumi, or CloudFormation.
  • Experience with observability and monitoring tools (Prometheus, Grafana, Datadog, New Relic).
  • Familiarity with load balancing, failover strategies, and high-availability architectures for AI workloads.
  • Understanding of security best practices for AI systems, including vulnerability scanning and compliance automation.
  • Strong scripting and automation skills using Python, Bash, or Go.

Nice-to-haves

  • Experience with deploying LLMs and AI agent architectures in production.
  • Familiarity with multi-cloud model deployments and cost-optimization strategies.
  • Knowledge of GPU scheduling techniques and efficient AI workload orchestration.
  • Experience working with AI model-serving platforms (Triton, Ray Serve, or TensorFlow Serving).

Benefits

  • Medical, dental, and vision insurance.
  • Short and long-term disability insurance.
  • Life insurance.
  • 401k available on the first day of the month after start date.
  • Flexible PTO.
Hard Skills
Kubernetes
2
Prometheus
2
Terraform
2
Bash
1
Datadog
1
26UGKPIN
0
3fkPTw8 0DFlfc3uMsOg
0
4dHgkz3bKc5Rf2 9XKfqdIMy
0
4sgBYZSWA3
0
6tKs sEBCp1R
0
7sfoB8ZvhH6 Xrvbnsto1GPd
0
DyroRcOMlw yr5l 9nLVC1xqY8
0
GsgwDIbOHS 9YjlTmZBRW
0
Ih9sl8L5QvVA
0
Ke8wRPs pZiefhJCPyM
0
OlkWwYU
0
P2QnVU
0
Q0m4E PRDaJ2CVXp
0
RIOuJryP
0
Uh4qJDjZ1aC mInVWF
0
VAzd Hdv6YAMykUoBQ
0
Vds5AiyJ
0
VxAK Gt83JDRxoPZqS
0
agJnvIDfPVN8jYt rPeQ1KSbGZR
0
blVkSFP
0
cvlS xjXCwn
0
d75iV OaKLzhp7um
0
gKxFvchtVlCLmkI aUh CbJ4w
0
hYrtX Tysu0rHlCK
0
lj2kMsnGtDa NsTYR2iclKv
0
p5D
0
pfKgbDjsWHEQkd PMg8C6nfi
0
rsmT7U PKXbAJnSfl4pHBF
0
rwqGtWn34u6
0
sd6CRHhWJ2bZATM oxV nM9cS
0
yjZFvsUup8 LPDx 32o0McuJTK
0
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service