This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Hardware Systems Engineer, RAS

$163,000 - $225,000/Yr

Meta - Bellevue, WA

posted 2 months ago

Full-time - Mid Level
Bellevue, WA
5,001-10,000 employees
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The Hardware Systems Engineer will join Meta's Release to Production (RTP) team, focusing on new NPI hardware. This role is crucial for managing the end-to-end hardware lifecycle of Meta servers, including prototyping, debugging, and stress testing. The engineer will collaborate with various teams to develop and optimize high-performance hardware and software technologies for AI at a datacenter scale, ensuring that new systems are production-ready and meet operational excellence standards.

Responsibilities

  • Interface with external vendors and internal teams to understand system architecture and develop Hardware Fault Management for server products.
  • Leverage knowledge of reliability, availability, and serviceability (RAS) to enhance error reporting and handling mechanisms.
  • Establish metrics and processes for regular assessment and improvement of engineering and operational excellence.
  • Develop data visualization tools to enhance visibility into hardware health issues.
  • Create experiments and tooling to detect and diagnose hardware, firmware, and software health issues.
  • Troubleshoot and diagnose system failures, isolating components and failure scenarios in collaboration with stakeholders.
  • Drive discussions on test specifications and methodologies to continuously improve test quality.

Requirements

  • 8+ years of work experience in ASIC development, compute hardware, or AI-ML hardware/software.
  • Knowledge of server, PC, or laptop architecture and components.
  • Development or debugging experience in hardware fault management and error handling.

Nice-to-haves

  • 10+ years of experience with AI systems such as GPUs or ASICs, kernel development, and performance optimization.
  • Experience with disaggregated systems architecture at scale.
  • Understanding of the hardware development process and test plan scoping.
  • Experience troubleshooting at the system level across multiple components.

Benefits

  • Health insurance
  • Dental insurance
  • Vision insurance
  • 401(k) plan
  • Paid holidays
  • Paid time off
  • Flexible scheduling
  • Professional development opportunities
  • Employee stock purchase plan
  • Tuition reimbursement
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service