Hewlett Packard Enterprise

posted 2 months ago

Full-time - Senior
Remote
5,001-10,000 employees
Computer and Electronic Product Manufacturing

About the position

Hewlett Packard Enterprise (HPE) is seeking a Senior AI and Machine Learning Engineer to join our High Performance Computing, AI and Labs team. This role is designated as ‘Remote/Teleworker’, allowing you to primarily work from home. HPE is a global edge-to-cloud company that is committed to advancing the way people live and work. We focus on delivering innovative solutions that accelerate our customers' digital transformation, enabling them to tackle complex, data-intensive workloads. Our team combines deep expertise with the development of cutting-edge supercomputers, defining the next era of computing and delivering valuable insights and innovations. As a Senior AI and Machine Learning Engineer, you will play a critical role in this mission, working on high-performance computing systems and AI workloads. In this position, you will be responsible for installing and configuring complex IT infrastructure components, including servers, storage, and networks. You will develop software scripts and configurations to automate deployment processes and study the performance of Large Language Models running on HPE GPU servers. Your role will involve performing system-level analysis of server workloads across various HPE platforms, including those running deep learning and machine learning code, utilizing accelerated hardware and high-speed networks like InfiniBand. You will also write white papers and guidance documents for AI workload and model selection, capturing and reviewing system performance data to understand workload behavior. Additionally, you will communicate technical work effectively to non-technical colleagues and provide guidance to less-experienced staff members. This position requires a Master's degree or PhD in Computer Science, Engineering, Information Technology, or a relevant field, along with typically 3+ years of experience in the field. You will need to have a strong background in Machine Learning and Artificial Intelligence, experience with containers and distributed deep learning, and familiarity with High Performance Computer Servers and Networking. Programming experience in languages such as Python, C, C++, and Fortran is strongly desired. Strong analytical and critical thinking skills are essential, as well as the ability to work independently in a semi-remote setting. HPE values diversity and inclusion, and we are committed to creating a workplace that reflects a variety of backgrounds and perspectives.

Responsibilities

  • Installs and configures complex IT infrastructure components (servers, storage, network)
  • Develop software scripts and configurations for automating deployment
  • Study and improve the performance of Large Language Models run on HPE GPU servers
  • Performs system level analysis of server workloads on various HPE platforms running DL and ML code to include accelerated hardware and high speed networks like InfiniBand
  • Writes white papers and other guidance documents for AI workload and model selection
  • Captures and reviews system performance data, logs, traces to understand workload behavior
  • Develops software and scripts that help analyze AI workload performance data
  • Communicates technical work well and can provide summaries of work to non-technical colleagues
  • Works with software and hardware partners in optimizing systems and resolving performance issues
  • Documents and reports issues discovered when testing and evaluating the systems
  • Communicates project status and concerns to management in a timely manner
  • Provides guidance to less-experienced staff members.

Requirements

  • Master's degree or PhD in Computer Science, Engineering, Information Technology or Systems, or relevant field
  • Typically 3+ years of experience
  • 3+ years of experience in Machine Learning/Artificial Intelligence
  • Experience working with containers and distributed deep learning and neural networks, to include transformers used in generative AI projects
  • Experience working with High Performance Computer Servers, High Performance Networking, and associated software
  • Experience working with Weka I/O, NFTS and Lustre File Systems
  • Programming experience in Python, C, C++, Fortran programming language is strongly desired
  • Strong analytical and critical thinking skills
  • Must be a self-starter and be able to work with minimum supervision in a semi-remote setting.

Nice-to-haves

  • Artificial Intelligence Technologies
  • Cross Domain Knowledge
  • Data Engineering
  • Data Science
  • Design Thinking
  • Development Fundamentals
  • Full Stack Development
  • IT Performance
  • Machine Learning Operations
  • Scalability Testing
  • Security-First Mindset

Benefits

  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Diversity, Inclusion & Belonging initiatives
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service