This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Hardware Systems Engineer, RAS

$163,000 - $225,000/Yr

Meta - Menlo Park, CA

posted 2 months ago

Full-time - Mid Level
Menlo Park, CA
5,001-10,000 employees
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The Hardware Systems Engineer will join Meta's Release to Production (RTP) team, focusing on new NPI hardware. This role is crucial for ensuring the efficient operation of Meta's servers and data centers, which are foundational to the company's rapidly scaling infrastructure. The engineer will be involved in the end-to-end hardware lifecycle, including prototyping, debugging, and optimizing systems for production deployment, particularly in the context of AI technologies.

Responsibilities

  • Interface with external vendors and internal teams to understand system architecture and develop Hardware Fault Management for server products.
  • Leverage knowledge of reliability, availability, and serviceability (RAS) to enhance error reporting and handling mechanisms.
  • Establish metrics and processes for regular assessment and improvement of engineering and operational excellence.
  • Develop data visualization tools to enhance visibility into hardware health issues.
  • Create experiments and tooling to detect and diagnose hardware, firmware, and software health issues.
  • Troubleshoot and diagnose system failures, isolating components and failure scenarios in collaboration with stakeholders.
  • Drive discussions on test specifications and methodologies to continuously improve test quality.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, or a relevant technical field, or equivalent practical experience.
  • 8+ years of work experience in domains such as ASIC development, compute hardware, or AI-ML hardware/software.
  • Knowledge of architecture and components of server, PC, or laptop products.
  • Development or debugging experience in hardware fault management, error reporting, or error handling.

Nice-to-haves

  • 10+ years of experience with AI systems such as GPUs or ASICs, kernel development, or performance optimization.
  • Experience with disaggregated systems architecture at scale.
  • Understanding of the hardware development process and test plan scoping.
  • Experience troubleshooting at the system level across multiple components and hardware/firmware/software boundaries.

Benefits

  • Health insurance
  • 401k
  • Paid holidays
  • Flexible scheduling
  • Professional development
  • Tuition reimbursement
  • Employee stock purchase plan
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service