Meta - Menlo Park, CA

posted 2 months ago

Full-time
Menlo Park, CA
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The DC Networking team at Meta is responsible for the development, deployment, and operation of the company's global data center networks. This role encompasses the entire network lifecycle, which includes hardware development, capacity planning, and the implementation of both distributed and centralized control systems. The team is engaged in various aspects of network management, including modeling, provisioning, automation, monitoring, troubleshooting, analytics, and simulation/design/failure analysis. We are actively seeking Software Engineers who are passionate about networking and have the aptitude for building scalable distributed systems. This position offers the opportunity to work on one of the most dynamic and fast-paced networks in the world, where you will develop innovative solutions to complex challenges and deploy them into production. As a Software Engineer in Data Center Networking, you will be tasked with designing and implementing drivers and firmware for network ethernet adapter functions, as well as transport stack for RDMA and control functions with the host and accelerators. You will also design and implement platform services that involve programming, monitoring, and controlling various system components such as optics, PHY, FPGAs, sensors, and power management systems. Additionally, you will develop and enhance high-performance computing (HPC) collective communication and parallel computing libraries, including NCCL, RCCL, OneCCL, and MPI. Debugging complex, system-level, multi-component issues that span across multiple layers from kernel to user-mode applications will also be a key responsibility.

Responsibilities

  • Design and implement drivers (and/or Firmware) for (network) ethernet adapter functions, Transport stack for RDMA, control functions with the host/accelerators.
  • Design and implement Platform services such as programming, monitoring, and controlling system components (Optics, PHY, FPGAs, sensors, fan control, power etc).
  • Develop and enhance HPC collective communication and parallel computing libraries such as NCCL, RCCL, OneCCL, and MPI.
  • Debug complex, system-level, multi-component issues that typically span across multiple layers from Kernel, and user-mode applications.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
  • 5+ years of experience in C/C++ and Python.
  • 5+ years experience in Systems programming, TCP/IP, HTTP/HTTPS, SPDY, DNS, and load balancers.
  • Experience with network devices (routers, switches, load balancers) and an understanding of network routing protocols.

Nice-to-haves

  • Experience with Linux Kernel, especially drivers and network stack.
  • Working knowledge of transport stack particularly RDMA (RoCEv2).
  • Experience with Qemu, FPGA Emulation environment is a plus.
  • Experience with parallel computing platforms such as CUDA, RoCM and OpenCL.
  • Platform services (program, control, and monitor Optics, PHY, FPGAs, sensors, fan control, power etc), BSP/Board Support Package, Operating Systems, Kernel, Bootloader, Power Management, RTOS, Linux.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service