Tesla - Palo Alto, CA
posted 4 months ago
Tesla's Supercomputing/AI infrastructure team is at the forefront of high-performance computing and machine learning infrastructure, which is essential for the operation of our machine learning algorithms. This encompasses a wide range of applications, including virtual simulations, Autopilot hardware, silicon design, and the Dojo supercomputer. As the demand for data and optimized compute resources continues to grow, our cluster builds are becoming larger and more complex. Therefore, the ongoing development and automation of deployment, monitoring, self-healing, and alerting processes are critical to the success of our engineering teams. The importance of this team and its contributions will only increase as we scale our Full-Self-Driving (FSD) and Robotaxi initiatives. In the role of Site Reliability Engineer, you will play a vital part in maintaining and enhancing our infrastructure to ensure that engineering teams across Autopilot, AI, and Dojo have the necessary tools and resources to maximize their productivity. Your responsibilities will include managing and operating our high-performance computing (HPC) clusters, monitoring compute, GPU, and network metrics, troubleshooting Linux systems, and performance tuning. You will also collaborate closely with our Data Center team to ensure the smooth operation of hundreds of servers and the successful deployment of new GPU capacity. Your efforts will directly support neural network training at scale, streamline the development of FSD, and help Dojo achieve its goal of becoming the most powerful supercomputer to date.