Teslaposted about 2 months ago
Palo Alto, CA
Motor Vehicle and Parts Dealers

About the position

As a Site Reliability Engineer (SRE) for the AI Platform team, you will manage bleeding-edge bare-metal servers for Tesla's advanced generative AI platform. You will be responsible for the imaging, configuration management, observability, security, and scalability of these systems. You'll also manage the model benchmarks and their outputs. You should have a focus on automating anything required of this AI platform team and use various platforms to make it as easy as possible for the software engineers on the team to run their services reliably on the bare-metal platform.

Responsibilities

  • Help image bare-metal servers
  • Building tooling around it, evaluating its usage, and helping to ensure its reliability, availability and security
  • Design software and systems that enable the generative AI platform at Tesla
  • Assist the AI Platform team with onboarding and integrating services into the Tesla stack (Kubernetes/VMWare/Bare-metal)
  • Ensuring best practices and observability of the service, such as metrics, logging, tracing, and alerting
  • Automate configuration and deployment of services
  • Consult on and design infrastructure, systems and software architecture

Requirements

  • Experience with bare-metal imaging and management
  • Expert skills in Linux and its administration (Ubuntu 22.04/24.04)
  • Experience in a high-level language such as Go, Python and/or Java
  • Observability (OpenTelemetry, Prometheus, AlertManager, Grafana, Jaeger, and Splunk)
  • Infrastructure as Code (Ansible) and CI/CD pipeline experience (GitHub Actions, Jenkins)
  • Artifact management (Artifactory)
  • Strong bias for action vs endless planning, willing to get hands dirty and make mistakes sometimes
  • Habitual documenter and spreader of knowledge
  • Willing to mentor other team members and engineers with less SRE type knowledge
  • Comfortable on an on-call rotation and doing live troubleshooting of issues on NOC bridges/outage calls
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service