Nvidia - Santa Clara, CA

posted about 2 months ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

The Principal Engineer for Performance Analysis in AI Applications and Services will focus on optimizing distributed cloud-native accelerated video analytics applications. This role involves working closely with application teams to profile, identify bottlenecks, and enhance performance across CPU, GPU, and network accelerators in a Kubernetes environment. The engineer will drive performance initiatives, develop strategies for optimization, and standardize performance measurement processes to ensure efficient resource utilization and application performance.

Responsibilities

  • Plan, enable, and drive performance initiatives across Cloud Native application teams.
  • Review, develop, deploy, and manage tools and strategies to systematically run performance experiments.
  • Collect and organize performance data with key partners.
  • Work closely with application teams to understand application resource utilization characteristics and identify performance issues through profiling.
  • Learn and understand various accelerators in the system for application workloads and recommend end-to-end performance optimizations.
  • Assist developers and product teams on the best accelerators and systems for end-to-end system performance.
  • Improve and standardize performance measurement processes across applications and GPU systems.
  • Collaborate with GPU cloud-native teams at Nvidia to deploy optimal GPU resource sharing strategies in a Kubernetes environment.

Requirements

  • Masters degree or PhD in Computer Science or a related field, or equivalent experience.
  • 15+ years of experience in optimizing system design, complexity analysis, software design in Unix/Linux systems, performance, and application issues.
  • Experience in real-time streaming AI inference systems.
  • A history of working on distributed accelerated systems and solving sophisticated performance problems.
  • Deep hands-on experience with distributed systems based on Kubernetes.
  • Experience with on-prem and cloud systems and ability to work with partners across multiple teams.
  • Experience using and handling and optimizing modern cloud and container-based enterprise computing architectures.
  • Strong verbal and written communication and teamwork skills.
  • Ability to multitask effectively in a multifaceted environment and action-driven with strong analytical skills.

Nice-to-haves

  • Background with real-time computer vision AI inference and/or analytics platforms.
  • Experience in application issues, algorithms, and data structures.
  • Understanding of the functioning of AI services, deep learning, and AI.
  • Exposure to scheduling and resource management systems.
  • Knowledge of GPU programming such as OpenCL or CUDA and knowledge of multi-node GPU setups, GPU clusters, or cloud computing.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Flexible work hours
  • Diversity and inclusion programs
  • Professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service