University of Chicago - Chicago, IL

posted 4 months ago

Full-time
Chicago, IL
Educational Services

About the position

The University of Chicago Research Computing Center (RCC) is seeking a highly qualified HPC system engineer to join its system and operation team that builds and manages RCC HPC systems and facility operations. The individual in this position will be involved in the management and administration of RCC hardware and software. This role is critical in designing automated, scalable, and rapidly deployable solutions to infrastructure development and server configuration. The engineer will work independently to install, configure, and maintain operating systems, utilizing best practices and systems knowledge to monitor and alert systems, utility software, and firewalls. The position requires guiding maintenance for production servers as well as Windows and Linux servers. The responsibilities include installing, configuring, and maintaining large computer clusters/servers and software, managing day-to-day operations of the systems including systems administration, monitoring, and storage performance up to and including network components. The engineer will also manage the system's network switch, parallel file system, and HPC software stack and tools. This role involves diagnosing and resolving system operational problems quickly and effectively, coordinating with vendors to resolve hardware and software issues, and assisting users with access and help desk ticket requests or issues. Additionally, the engineer will be responsible for building and deploying open source software and software from vendors/partners, providing reliable and efficient backups/restores for all managed systems, and maintaining and monitoring the security of the HPC systems and servers. Documentation of system administration procedures for routine and complex tasks is also a key responsibility, along with planning and installing necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems.

Responsibilities

  • Installing, configuring, and maintaining large computer clusters/servers and software.
  • Managing day-to-day operations of the systems including systems administration, monitoring, and storage performance up to and including network components.
  • Management of the system's network switch, parallel file system, and HPC software stack and tools.
  • Configuration of the scheduling and queuing system.
  • Diagnosing and resolving system operational problems quickly and effectively.
  • Coordinating with vendors to resolve hardware and software problems.
  • Assisting users with access and other help desk ticket requests or issues.
  • Building and deploying open source software and software from vendors/partners.
  • Providing reliable and efficient backups/restores for all managed systems.
  • Maintaining and monitoring the security of the HPC systems and servers.
  • Documenting system administration procedures for routine and complex tasks.
  • Planning and installing necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems.
  • Installing and maintaining an appropriate level of intrusion detection, monitoring, and auditing software as required.
  • Tracking compliance and maintaining documentation for hardware, software, and service inventories for management reports.
  • Performing other related work as needed.

Requirements

  • Minimum requirements include a college or university degree in a related field.
  • Minimum requirements include knowledge and skills developed through 5-7 years of work experience in a related job discipline.

Nice-to-haves

  • Bachelor's degree in Computer Science or closely related field.
  • A minimum of three years of Linux system administration experience in a large distributed computing environment.
  • At least two years experience in HPC system administration or managing large HPC clusters.
  • Knowledge of Linux.
  • Experience scripting with one or more languages such as Python, Shell, Perl.
  • Experience with Linux build automation tools such as puppet, Ansible, GIT, Docker, highly preferred.
  • Experience implementing automation and monitoring using shell scripting and other related tools strongly preferred.
  • Experience with installing, configuring, and maintaining job management tools (such as SLURM, Moab, TORQUE, PBS, etc.) strongly preferred.
  • Experience with operating system deployment tools (e.g. XCAT, ROCKS) strongly preferred.
  • Experience configuring, administering, and supporting network storage subsystems (e.g. IBM, NetAppl DataDirect Network, LSI, etc.) strongly preferred.
  • Experience with one or more distributed file systems (GPFS, Lustre, Gluster, etc.) strongly preferred.
  • Experience configuring, installing, tuning and maintaining scientific application software strongly preferred.
  • Experience configuring, installing, maintaining and/or using performance monitoring and optimization tools strongly preferred.
  • Experience documenting implementations and system related tasks.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service