Tesla - Palo Alto, CA

posted 10 days ago

Full-time - Mid Level
Palo Alto, CA
Transportation Equipment Manufacturing

About the position

As a Site Reliability Engineer on Tesla's Supercomputing/AI infrastructure team, you will play a crucial role in maintaining and enhancing the platform that supports Full-Self-Driving (FSD), Tesla Bot, and Dojo engineering teams. Your responsibilities will include managing AI infrastructure, monitoring performance metrics, troubleshooting Linux systems, and ensuring security, all aimed at facilitating neural network training at scale and optimizing compute resources.

Responsibilities

  • Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale
  • Improve our monitoring & self-healing pipelines, as well as security posture
  • Optimize our server, storage and network performance
  • Develop new tools in Python, Golang or Bash/Shell
  • Use Infrastructure as Code best practices
  • Participate in 24x7 on-call rotation

Requirements

  • Proficiency in Python, Golang and/or Bash
  • Proficiency with Linux fundamentals and performance optimizations
  • Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.)
  • Experience with containerization technologies such as Kubernetes
  • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus
  • Experience with Slurm, LSF and storage management of parallel file systems is a plus
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field
  • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position

Nice-to-haves

  • Experience with Slurm, LSF and storage management of parallel file systems is a plus
  • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus

Benefits

  • Aetna PPO and HSA plans with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental and vision plans with $0 paycheck contribution options
  • Company Paid HSA Contribution when enrolled in the High Deductible Aetna medical plan with HSA
  • Healthcare and Dependent Care Flexible Spending Accounts (FSA)
  • LGBTQ+ care concierge services
  • 401(k) with employer match
  • Employee Stock Purchase Plans
  • Company paid Basic Life, AD&D, short-term and long-term disability insurance
  • Employee Assistance Program
  • Sick and Vacation time (Flex time for salary positions), and Paid Holidays
  • Back-up childcare and parenting support resources
  • Voluntary benefits including critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
  • Weight Loss and Tobacco Cessation Programs
  • Tesla Babies program
  • Commuter benefits
  • Employee discounts and perks program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service