Hardware Systems Engineer, RAS

$163,000 - $225,000/Yr

Meta - Austin, TX

posted 2 months ago

Full-time - Mid Level
Austin, TX
5,001-10,000 employees
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The Hardware Systems Engineer will join Meta's Release to Production (RTP) team, focusing on new NPI hardware. This role is crucial for ensuring the efficient operation of Meta's servers and data centers, which are foundational to the company's rapidly scaling infrastructure. The engineer will be involved in the end-to-end hardware lifecycle, including prototyping, debugging, and optimizing systems for production deployment, particularly in the context of AI technologies.

Responsibilities

  • Interface with external vendors and internal teams to understand system architecture and develop Hardware Fault Management for server products.
  • Leverage knowledge of reliability, availability, and serviceability (RAS) to enhance error reporting and handling mechanisms.
  • Establish metrics and processes for regular assessment and improvement of engineering and operational excellence.
  • Develop data visualization tools to enhance visibility into hardware health issues.
  • Create experiments and tooling to detect and diagnose hardware, firmware, and software health issues.
  • Troubleshoot and diagnose system failures, isolating components and failure scenarios in collaboration with stakeholders.
  • Drive discussions on test specifications and methodologies to continuously improve test quality.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, or a relevant technical field, or equivalent practical experience.
  • 8+ years of work experience in domains such as ASIC development, compute hardware, or AI-ML hardware/software.
  • Knowledge of architecture and components of server, PC, or laptop products.
  • Development or debugging experience in hardware fault management, error reporting, or error handling.

Nice-to-haves

  • 10+ years of experience with AI systems such as GPUs or ASICs, kernel development, or performance optimization.
  • Experience with disaggregated systems architecture at scale.
  • Understanding of the hardware development process and test plan scoping.
  • Experience troubleshooting at the system level across multiple components and boundaries.

Benefits

  • Health insurance
  • Dental insurance
  • Vision insurance
  • 401(k) plan
  • Paid holidays
  • Paid time off
  • Flexible scheduling
  • Professional development opportunities
  • Employee stock purchase plan
  • Tuition reimbursement
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service