Google-posted 10 months ago
$278,000 - $399,000/Yr
Full-time • Senior
New York, NY
Web Search Portals, Libraries, Archives, and Other Information Services

This job is no longer available

There are still lots of open positions. Let's find the one that's right for you.

As the Principle Site Reliability Engineer for ML Acceleration, you will be responsible for ensuring Google's ML resources are delivered in a speed-optimal way. You will understand the end-to-end technical and logistical challenges of taking chips received from fabs around the world, and turning them into highly-connected ML supercomputers operating in gigawatt-scale data centers. This means that you will be reviewing all capacity acceleration programs and providing technical direction and decision-making in order to make sure that the usable ML capacity is maximized over the smallest delivery time. You will knit together different technical organizations to help them produce a globally-optimal outcome for ML capacity. Acceleration is a multi-constrained problem, full of nuance, complexity and hard trade-offs. You will work closely with technical and planning teams across Data Center Construction, Networking, and Machine Delivery to make critical decisions and drive strategy. Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google's product portfolio possible. We're proud to be our engineers' engineers and love voiding warranties by taking things apart so we can rebuild them. We keep our networks up and running, ensuring our users have the best and fastest experience possible.

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service