Hardware Systems Engineer, RAS

$163,000 - $225,000/Yr

Meta - Seattle, WA

posted 2 months ago

Full-time - Mid Level
Seattle, WA
5,001-10,000 employees
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The Hardware Systems Engineer will join Meta's Release to Production (RTP) team, focusing on new NPI hardware. This role is crucial for managing the end-to-end hardware lifecycle of Meta servers, including prototyping, debugging, and stress testing. The engineer will collaborate with various teams to develop and optimize high-performance software and hardware technologies for AI at datacenter scale, ensuring that new systems are production-ready and meet operational excellence standards.

Responsibilities

  • Interface with external vendors and internal teams to understand system architecture and develop Hardware Fault Management for server products.
  • Leverage knowledge of reliability, availability, and serviceability (RAS) to enhance error reporting and handling mechanisms.
  • Establish metrics and processes for regular assessment and improvement of engineering and operational excellence.
  • Develop data visualization tools to enhance visibility into hardware health issues.
  • Create experiments and tooling to detect and diagnose hardware, firmware, and software health issues.
  • Troubleshoot and diagnose system failures, isolating components and failure scenarios in collaboration with stakeholders.
  • Drive discussions on test specifications and methodologies to continuously improve test quality.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, or a relevant technical field, or equivalent practical experience.
  • 8+ years of work experience in domains such as ASIC development, compute hardware, or AI-ML hardware/software.
  • Knowledge of architecture and components of server, PC, or laptop products.
  • Development or debugging experience in hardware fault management, error reporting, or error handling.

Nice-to-haves

  • 10+ years of experience with AI systems such as GPU/ASIC accelerators, kernel development, or performance optimization.
  • Experience with disaggregated systems architecture at scale.
  • Understanding of the hardware development process and test plan scoping.
  • Experience troubleshooting at the system level across multiple components and hardware/firmware/software boundaries.

Benefits

  • Health insurance
  • Dental insurance
  • Vision insurance
  • 401(k) plan with matching
  • Paid holidays
  • Paid time off
  • Flexible scheduling
  • Professional development opportunities
  • Employee stock purchase plan
  • Tuition reimbursement
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service