University of Chicago - Chicago, IL

posted 6 days ago

Full-time - Senior
Hybrid - Chicago, IL
Educational Services

About the position

The Principal HPC System Administrator at the University of Chicago's Research Computing Center (RCC) is responsible for designing, configuring, deploying, and maintaining high-performance computing (HPC) systems. This role involves hands-on technical expertise to manage complex assignments, including the procurement and management of HPC hardware and software. The position requires a hybrid work model, with three days onsite, and focuses on ensuring optimal performance and resource utilization of HPC systems.

Responsibilities

  • Design, configure, deploy, and maintain large computer clusters, servers, and software.
  • Perform day-to-day operations leadership, including systems administration, monitoring, and storage performance.
  • Manage the system's network switch, parallel file system, and HPC software stack and tools.
  • Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.
  • Serve as the technical lead on complex projects and system-related tasks.
  • Configure, install, and maintain the job scheduler/workload manager.
  • Diagnose and resolve system operational problems promptly and effectively.
  • Use scripting/programming to enable system-level automation, monitoring, and problem detection.
  • Build and deploy open-source software as well as software from vendors/partners.
  • Develop and implement strategies for HPC data management, backup, disaster recovery, and security.
  • Create standard operating procedures for routine and complex system tasks.
  • Maintain and monitor the security of HPC systems and servers.
  • Troubleshoot and identify failed hardware, implement parts replacement, and resolve system failures.
  • Stay updated with the latest developments in HPC technologies and apply this knowledge to improve RCC systems.
  • Provide expertise in planning and installing necessary patches and upgrades for servers.

Requirements

  • A college or university degree in a related field.
  • 7+ years of work experience in a related job discipline.
  • Minimum of seven years of full-time Linux system administration experience in a large distributed computing environment.

Nice-to-haves

  • Bachelor's degree in Computer Science or closely related field.
  • Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
  • Proficiency in the installation, maintenance, operation, tuning, and troubleshooting of HPC systems.

Benefits

  • Health insurance
  • Retirement savings plan
  • Paid holidays
  • Flexible scheduling
  • Professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service