Hewlett Packard Enterprise - Houston, TX

posted about 2 months ago

Full-time - Senior
Onsite - Houston, TX
Computer and Electronic Product Manufacturing

About the position

Hewlett Packard Enterprise (HPE) is seeking a Senior AI and Machine Learning Engineer to join our High Performance Computing, AI and Labs team. This role is primarily remote, allowing you to work from home while contributing to innovative solutions that accelerate our customers' digital transformation. As a global edge-to-cloud company, HPE is dedicated to helping organizations connect, protect, analyze, and act on their data and applications, enabling them to derive insights and outcomes swiftly in today's complex environment. Our culture is built on collaboration, diversity, and the pursuit of excellence, making it an ideal place for professionals looking to grow their careers. In this position, you will focus on enhancing the performance of Large Language Models on HPE GPU servers, conducting system-level analyses of HPC and AI workloads across various HPE platforms. You will run machine learning and deep learning code on advanced hardware, including NVIDIA and AMD GPUs, and high-speed networks like InfiniBand. Your responsibilities will also include developing software and scripts to automate AI workloads, installing and configuring complex IT infrastructure components, and documenting performance data to understand workload behavior. You will communicate your findings effectively to both technical and non-technical colleagues, mentor junior staff, and collaborate with software and hardware partners to optimize systems and resolve performance issues. This role requires a strong educational background, typically a Master's degree or PhD in Computer Science, Engineering, Information Technology, or a related field, along with at least three years of relevant experience in machine learning and artificial intelligence. You will need proficiency in AI and machine learning frameworks such as TensorFlow, PyTorch, and ONNX, as well as experience with high-performance computing servers and networking. Strong analytical skills and the ability to work independently in a semi-remote setting are essential for success in this role.

Responsibilities

  • Studies and improves performance of Large Language Models running on HPE GPU servers
  • Performs system level analysis of HPC & AI workloads on various HPE platforms
  • Runs ML/DL code on accelerated hardware like NVIDIA and AMD GPUs and high-speed networks like InfiniBand
  • Develops software and scripts to automate AI workloads and analyze performance data
  • Installs and configures complex IT infrastructure components (servers, storage, network)
  • Writes white papers and other guidance documents for AI workload and model selection
  • Captures and reviews system performance data, logs, traces to understand workload behavior
  • Communicates technical work well and presents work to non-technical colleagues
  • Works with software and hardware partners in optimizing systems and resolving performance issues
  • Documents and reports issues when testing and evaluating systems
  • Communicates project status and concerns to management in a timely manner
  • Mentors less-experienced staff members

Requirements

  • Master's degree or PhD in Computer Science, Engineering, Information Technology or Systems, or relevant field
  • Typically 3 years of experience in Machine Learning/Artificial Intelligence
  • Proficiency in one or more AI & Machine Learning frameworks or libraries (TensorFlow, PyTorch, ONNX, DeepSpeed, Horovod, TensorRT, NeMo)
  • Experience with containers and distributed deep learning and neural networks, including transformers used in generative AI projects
  • Experience with High Performance Computer Servers, High Performance Networking, and associated software
  • Experience with Weka I/O, NTFS and Lustre File Systems
  • Programming experience in Python or C/C++ is strongly desired
  • Strong analytical and critical thinking skills
  • Must be a self-starter, able to work with minimum supervision in a semi-remote setting

Nice-to-haves

  • Artificial Intelligence Technologies and performance benchmarking
  • Cross Domain Knowledge
  • Data Engineering
  • Data Science
  • Design Thinking
  • Development Fundamentals
  • Full Stack Development
  • IT Performance
  • Machine Learning Operations
  • Scalability Testing
  • Security-First Mindset

Benefits

  • Comprehensive suite of benefits supporting physical, financial, and emotional wellbeing
  • Programs for personal and professional development
  • Flexibility to manage work and personal needs
  • Inclusive work environment celebrating individual uniqueness
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service