University of Chicago - Chicago, IL

posted 13 days ago

Full-time
Hybrid - Chicago, IL
Educational Services

About the position

The HPC Systems & Operations Manager at the University of Chicago will oversee the systems and operations team responsible for the design, configuration, deployment, and maintenance of High Performance Computing (HPC) systems within the Research Computing Center (RCC). This role involves hands-on management of production servers, ensuring the stability and integrity of HPC systems, and leading a team of professionals in delivering reliable services to support research activities.

Responsibilities

  • Lead the design, configuration, deployment, and management of RCC HPC systems.
  • Ensure the stability, integrity, and efficient operation of RCC HPC systems that support core organizational functions.
  • Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.
  • Manage a growing team of HPC system administrators and systems programmers to ensure reliable service delivery.
  • Oversee the project management of the team's initiatives, ensuring that all projects receive the necessary management oversight and resources for successful completion.
  • Serve as the primary point of contact for other university units regarding systems and operations-related matters.
  • Diagnose and resolve system operational problems promptly and effectively, coordinating with vendors to address hardware and software issues.
  • Foster automation within HPC systems.
  • Troubleshoot and identify failed hardware, implement parts replacement and resolve system failures.
  • Develop and implement strategies for HPC data management, backup, disaster recovery, and security.
  • Create standard operating procedures for routine and complex system tasks.
  • Maintain and monitor the security of HPC systems and servers, implementing robust security measures as necessary.
  • Provide technical leadership, guidance, and support to the HPC systems and operations team.
  • Manage a single team's progress by maintaining accurate and up-to-date logs, ensuring that all projects have the necessary management oversight and approvals for successful completion.
  • Ensure the implementation of approved best practices and information technology policies that result in the highest quality systems administration.
  • Perform other related work as needed.

Requirements

  • Minimum requirements include a college or university degree in a related field.
  • Minimum requirements include knowledge and skills developed through 7+ years of work experience in a related job discipline.

Nice-to-haves

  • Advanced degree strongly preferred.
  • A minimum of seven years of Linux system administration experience in a large, distributed computing environment.
  • At least three years' experience in providing support for Linux HPC cluster used for scientific research strongly preferred.
  • Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
  • Proficiency in the installation, maintenance, operation, tuning and troubleshooting of Linux and related systems and software.
  • Experience in installing, configuring, and maintaining a job scheduler/workload manager (such as SLURM, TORQUE, or PBS).
  • Experience in configuring, installing and troubleshooting MPI and OpenMP.
  • Experience with at least one HPC cluster management tool (e.g., XCAT, Confluent, Warewulf, or Bright).
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service