AI Hardware Systems Engineer

$149,200 - $234,850/Yr

eBay - San Jose, CA

posted 4 months ago

Full-time - Mid Level
Remote - San Jose, CA
Professional, Scientific, and Technical Services

About the position

At eBay, we're more than a global ecommerce leader - we're changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We're committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts. Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work - every day. We're in this together, sustaining the future of our customers, our company, and our planet. Join a team of passionate thinkers, innovators, and dreamers - and help us connect people and build communities to create economic opportunity for all. At eBay, we have started a new chapter in our iconic internet history of being the largest online marketplace in the world. We have more than 800 million listings with 80% of them selling new items, in over 400 markets around the world. The collection of services runs on a significant server and storage infrastructure, and the hardware engineering team is chartered to drive reliability, efficiency and performance of this layer. More and more of this workload is taking advantage of AI accelerators, and we are creating a role to focus on this area. We are looking for a Systems Software Engineer to join our team to qualify and automate testing of new hardware technologies related to AI, as well as support some of our traditional qualifications efforts. This person will interface with internal eBay teams working on AI platforms, other platform teams, key technology and systems integration vendors, AI open source software communities, and with other members of the hardware engineering team.

Responsibilities

  • Work as part of the Hardware Engineering team to reduce the cost of purchasing and operating eBay's fleet of servers, saving millions of dollars a year.
  • Focus on AI hardware platforms to leverage AI in business operations.
  • Translate internal customer requests into requirements, and develop benchmarks and test suites to ensure our platforms meet their needs.
  • Evaluate the performance and reliability of new hardware platforms and hardware components using automated tests, with a strong focus on AI accelerators.
  • Expand and maintain our automation that we use daily for testing and reliability work.
  • Develop performance test plans and experiments with our customer teams to ensure we are able to utilize our hardware to the fullest of its ability.
  • Work with our customers to debug and address any reliability or performance issues they have with our server products.
  • Identify and suggest the ideal OS and BIOS settings for our systems.
  • Explore and propose new hardware/software technologies that improve performance or reduce cost of our products, particularly new AI accelerators.
  • Improve our monitoring and data collection tooling to ensure we're recording relevant information.

Requirements

  • At least 5-8 years of systems engineering experience using Linux as an operating system.
  • Understanding of how to configure servers to expose AI accelerators.
  • Experience with AI frameworks and platforms, ideally with experience benchmarking services or accelerators such as pytorch, deepspeed, or MLPerf.
  • Ability to explain how Linux utilizes various hardware components and what tunables it provides.
  • Proficiency in Python and Bash for automating tasks.
  • Experience using a revision control system like GIT, familiar with concepts like branching and merging.
  • Ability to build and use containers using Docker or another technology.
  • Understanding of how to compile and build source code, especially the Linux kernel.
  • BS in Electrical Engineering or Computer Science with continued formal or informal education.

Nice-to-haves

  • Understanding the differences between AI accelerators from multiple vendors and their architectures.
  • Familiarity with extending a monitoring framework like Prometheus to collect additional data from testing.
  • Familiarity with Kubernetes and cloud computing concepts.
  • Experience using various profiling and performance tools like perf, vtune, or performance co-pilot.
  • Experience analyzing logs and working with data repositories to help drive technical decisions.
  • Experience deploying and configuring systems at scale using standard technology like PXE, Ansible, Salt, and Puppet.

Benefits

  • 401(k) eligibility
  • Various paid time off benefits, such as PTO and parental leave
  • Target bonus and restricted stock units (as applicable)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service