Meta - Austin, TX
posted 2 months ago
Meta is seeking a Hardware Systems Engineer to join our Release to Production (RTP) team, focusing on new NPI hardware. The RTP team is integral to the end-to-end Hardware Lifecycle of all Meta servers, which serve as the backbone of our rapidly scaling infrastructure. This role involves prototyping experimental hardware, conducting pre-production hands-on system and hardware debugging, and stress testing to ensure production-ready system monitoring. The RTP team also plays a crucial role in exploring, developing, and productizing high-performance software and hardware technologies for AI at datacenter scale. As a Hardware Systems Engineer, you will collaborate closely with hardware/software co-design teams, hardware designers, networking teams, system manufacturers, component vendors, capacity engineering, production engineering, production services, and data center operations teams to enable the deployment of new systems in our production data centers. You will work across service and hardware architectures for new AI systems, build prototypes to demonstrate their value, facilitate go/no-go decisions, and optimize these systems in production. In this role, you will interface with external vendors and internal teams, including hardware, mechanical, power, thermal, manufacturing, and software engineers, to understand system architecture and guide the development of Hardware Fault Management for various server products. You will leverage your deep understanding of RAS (reliability, availability, serviceability) to enhance error reporting and handling mechanisms, thereby improving operational quality and cost efficiency. You will champion engineering and operational excellence by establishing metrics and processes for regular assessment and improvement. Additionally, you will develop visibility through data visualization and implement systemic solutions to hardware health issues. Proactively creating experiments and tooling to detect and diagnose hardware, firmware, and software health issues will be a key part of your responsibilities. You will troubleshoot, diagnose, and root cause system failures while collaborating with internal and external stakeholders, driving discussions on test specifications and methodologies to continuously improve test quality.