Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Principal
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking a Principal Engineer to join our Distributed Machine Learning team, which is focused on GPU-accelerated Apache Spark. In this role, you will be at the forefront of developing solutions that enable data scientists to apply machine learning (ML) and deep learning (DL) algorithms over large datasets for training AI models. The goal is to enhance the performance and usability of existing solutions, allowing data scientists to build AI models more efficiently and cost-effectively. You will be responsible for designing and developing user-friendly APIs and libraries that optimize the use of existing DL/ML frameworks in GPU-enabled Spark clusters for distributed training and inference at scale. Your work will involve creating GPU-accelerated ML libraries for distributed training and inference on Spark clusters, including improvements to our existing spark-rapids-ml open source library. You will demonstrate the superior performance of your developed solutions on industry-standard benchmarks and datasets, and make significant technical contributions to enhance the capabilities of open source projects such as RAPIDS, XGBoost, spark-rapids-ml, and Apache Spark. Collaboration with NVIDIA partners and customers will be essential as you help deploy distributed ML algorithms in both cloud and on-premise environments. Staying updated with the latest advances in distributed ML systems and algorithms will be crucial, as will providing technical mentorship to a team of engineers.

Responsibilities

  • Design and develop new user-friendly APIs and libraries for distributed DL/ML training and inference in GPU-enabled Spark clusters.
  • Create GPU-accelerated ML libraries for distributed training and inference on Spark clusters, improving existing open source libraries.
  • Demonstrate superior performance of developed solutions on industry-standard benchmarks and datasets.
  • Make technical contributions to enhance capabilities of open source projects such as RAPIDS, XGBoost, spark-rapids-ml, and Apache Spark.
  • Collaborate with NVIDIA partners and customers on deploying distributed ML algorithms in cloud or on-premise environments.
  • Stay updated with published advances in distributed ML systems and algorithms.
  • Provide technical mentorship to a team of engineers.

Requirements

  • BS, MS, or PhD in Computer Science, Computer Engineering, or closely related field (or equivalent experience).
  • 12+ years of work or research experience in software development.
  • 5+ years of experience as a technical lead in distributed machine learning and/or deep learning.
  • 3+ years of open source development experience.
  • 3+ years of hands-on experience with Spark MLlib, XGBoost, and/or PyTorch.
  • Knowledge of internals of Apache Spark MLlib.
  • Experience with Kubernetes, YARN, Spark, and/or Ray for distributed ML orchestration.
  • Proven technical skills in designing, implementing, and delivering high-quality distributed systems.
  • Excellent programming skills in C++, Scala, and Python.
  • Familiarity with agile software development practices.

Nice-to-haves

  • Familiarity with NVIDIA libraries (RAPIDS cuML, Spark-RAPIDS, NVTabular) is a plus.
  • Familiarity with NVIDIA GPUs and CUDA is also a strong plus.
  • Familiarity with Horovod, Petastorm, and other existing/past distributed learning libraries is desirable.
  • Experience working with multi-functional teams across organizational boundaries and geographies.

Benefits

  • Equity options as part of compensation package.
  • Comprehensive health insurance coverage.
  • Flexible work hours and remote work options.
  • Paid time off and holidays.
  • Professional development opportunities.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service