University of Chicago - Chicago, IL

posted 14 days ago

Full-time - Senior
Hybrid - Chicago, IL
Educational Services

About the position

The University of Chicago is seeking a Principal HPC System Administrator to join the Research Computing Center (RCC). This role involves designing, configuring, deploying, and maintaining High Performance Computing (HPC) systems, as well as managing facility operations. The position requires specialized knowledge and expertise in infrastructure development and server configuration, with a focus on optimizing performance and resource utilization in a hybrid work environment.

Responsibilities

  • Design, configure, deploy, and maintain large computer clusters, servers, and software.
  • Perform day-to-day operations leadership, including systems administration, monitoring, and storage performance.
  • Manage the system's network switch, parallel file system, and HPC software stack and tools.
  • Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.
  • Serve as the technical lead on complex projects and system-related tasks.
  • Configure, install, and maintain the job scheduler/workload manager.
  • Diagnose and resolve system operational problems promptly and effectively.
  • Coordinate with vendors to address hardware and software issues.
  • Use scripting/programming to enable system-level automation, monitoring, and problem detection.
  • Build and deploy open-source software as well as software from vendors/partners.
  • Develop and implement strategies for HPC data management, backup, disaster recovery, and security.
  • Create standard operating procedures for routine and complex system tasks.
  • Maintain and monitor the security of HPC systems and servers, implementing robust security measures.
  • Troubleshoot and identify failed hardware, implement parts replacement, and resolve system failures.
  • Stay updated with the latest developments in HPC technologies and apply this knowledge to improve RCC systems.
  • Solve complex problems to configure, install, upgrade, and maintain server applications and hardware.
  • Implement operating system enhancements to improve the reliability and performance of the system.
  • Provide expertise in planning and installing necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems.
  • Install and maintain an appropriate level of intrusion detection, monitoring, and auditing software as required.

Requirements

  • A college or university degree in a related field.
  • 7+ years of work experience in a related job discipline.
  • Bachelor's degree in Computer Science or closely related field (preferred).
  • A minimum of seven years of full-time Linux system administration experience in a large distributed computing environment (preferred).
  • Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
  • Proficiency in the installation, maintenance, operation, tuning, and troubleshooting of Linux and related systems and software.
  • Experience in installing, configuring, and maintaining a job scheduler/workload manager (such as SLURM, TORQUE, or PBS).
  • Experience configuring, installing, and troubleshooting MPI and OpenMP.
  • Experience with at least one HPC cluster management tool (e.g., XCAT, Confluent, Warewulf, or Bright).
  • Experience in configuring, administering, and supporting network storage subsystems.
  • Hands-on experience with at least one parallel file system (e.g., Spectrum Scale-GPFS, Lustre, BeeGFS, or Ceph).
  • Direct experience working with Infiniband, including a working knowledge of Infiniband concepts, OFED layers, subnet managers, as well as Gigabit Ethernet.
  • Experience with networking and security.
  • Experience with systems automation tools such as Ansible or Puppet.
  • Experience with versioning tools such as Git or Subversion.
  • Experience configuring, installing, maintaining, and using monitoring and optimization tools.
  • Strong knowledge of scripting languages such as Python or bash.

Nice-to-haves

  • Ability to work well with faculty and researchers.
  • Ability to identify and gain expertise in appropriate new technologies and/or software tools.
  • Ability to function as part of an interactive team while demonstrating self-initiative to achieve project's goals and Research Computing Center's mission.
  • Strong analytical skills and problem-solving ability.

Benefits

  • Health insurance coverage
  • Paid holidays
  • Flexible scheduling
  • Professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service