Nvidia - Santa Clara, CA
posted 2 months ago
NVIDIA is seeking a Principal Engineer to join our Distributed Machine Learning team, which is focused on GPU-accelerated Apache Spark. In this role, you will be at the forefront of developing solutions that enable data scientists to apply machine learning (ML) and deep learning (DL) algorithms over large datasets for training AI models. The goal is to enhance the performance and usability of existing solutions, allowing data scientists to build AI models more efficiently and cost-effectively. You will be responsible for designing and developing user-friendly APIs and libraries that optimize the use of existing DL/ML frameworks in GPU-enabled Spark clusters for distributed training and inference at scale. Your work will involve creating GPU-accelerated ML libraries for distributed training and inference on Spark clusters, including improvements to our existing spark-rapids-ml open source library. You will demonstrate the superior performance of your developed solutions on industry-standard benchmarks and datasets, and make significant technical contributions to enhance the capabilities of open source projects such as RAPIDS, XGBoost, spark-rapids-ml, and Apache Spark. Collaboration with NVIDIA partners and customers will be essential as you help deploy distributed ML algorithms in both cloud and on-premise environments. Staying updated with the latest advances in distributed ML systems and algorithms will be crucial, as will providing technical mentorship to a team of engineers.