University of Chicago - Chicago, IL

posted 10 days ago

Full-time - Principal
Hybrid - Chicago, IL
Educational Services

About the position

The Principal HPC System Administrator at the University of Chicago's Research Computing Center (RCC) is responsible for designing, configuring, deploying, and maintaining high-performance computing (HPC) systems. This role requires specialized knowledge and expertise to develop automated and scalable solutions for infrastructure and server configuration. The position involves hands-on management of HPC resources, ensuring optimal performance and security, and providing technical leadership on complex projects.

Responsibilities

  • Design, configure, deploy, and maintain large computer clusters, servers, and software.
  • Perform day-to-day operations leadership, including systems administration, monitoring, and storage performance.
  • Manage the system's network switch, parallel file system, and HPC software stack and tools.
  • Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.
  • Serve as the technical lead on complex projects and system-related tasks.
  • Configure, install, and maintain the job scheduler/workload manager.
  • Diagnose and resolve system operational problems promptly and effectively.
  • Use scripting/programming to enable system-level automation, monitoring, and problem detection.
  • Build and deploy open-source software as well as software from vendors/partners.
  • Develop and implement strategies for HPC data management, backup, disaster recovery, and security.
  • Create standard operating procedures for routine and complex system tasks.
  • Maintain and monitor the security of HPC systems and servers.
  • Troubleshoot and identify failed hardware, implement parts replacement, and resolve system failures.
  • Stay updated with the latest developments in HPC technologies and apply this knowledge to improve RCC systems.
  • Provide expertise in planning and installing necessary patches and upgrades for servers.

Requirements

  • Minimum college or university degree in a related field.
  • 7+ years of work experience in a related job discipline.
  • Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
  • Proficiency in the installation, maintenance, operation, tuning, and troubleshooting of Linux and related systems and software.
  • Experience in installing, configuring, and maintaining a job scheduler/workload manager (such as SLURM, TORQUE, or PBS).
  • Experience configuring, installing, and troubleshooting MPI and OpenMP.
  • Experience with at least one HPC cluster management tool (e.g., XCAT, Confluent, Warewulf, or Bright).
  • Experience in configuring, administering, and supporting network storage subsystems.
  • Hands-on experience with at least one parallel file system (e.g., Spectrum Scale-GPFS, Lustre, BeeGFS, or Ceph).
  • Direct experience working with Infiniband and Gigabit Ethernet.
  • Experience with networking and security.
  • Experience with systems automation tools such as Ansible or Puppet.
  • Experience with versioning tools such as Git or Subversion.
  • Strong knowledge of scripting languages such as Python or bash.

Nice-to-haves

  • Bachelor's degree in Computer Science or closely related field.
  • Ability to work well with faculty and researchers.
  • Ability to identify and gain expertise in appropriate new technologies and/or software tools.
  • Ability to function as part of an interactive team while demonstrating self-initiative.
  • Strong analytical skills and problem-solving ability.

Benefits

  • Health insurance
  • Paid holidays
  • Professional development opportunities
  • Flexible scheduling
  • Tuition reimbursement
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service