University of Chicagoposted 3 months ago
$85,750 - $109,500/Yr
Full-time • Mid Level
Chicago, IL

About the position

The University of Chicago Research Computing Center (RCC), a unit in the Office of Research, provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High-Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The job participates in the design of automated, scalable, and rapidly deployable solutions to systems infrastructure and server configuration. Installs, configures, and maintains operating systems, monitoring and alerting systems, utility software, and firewalls. Plans and executes hands-on maintenance for production servers as well as Windows and Linux servers. The University of Chicago Research Computing Center (RCC) is seeking a skilled HPC System Administrator to join its Systems and Operations Team. This position will support the deployment, maintenance, and automation of RCC's HPC systems, including CPU/GPU clusters, storage, and networking infrastructure. The HPC System Administrator will assist in system-level administration, troubleshooting, performance tuning, and automation while collaborating with faculty and researchers to enable cutting-edge computational science. This is a hybrid position requiring 3 days onsite.

Responsibilities

  • Administer, install, monitor, and maintain HPC systems, including compute nodes, storage, networking, and software stacks.
  • Develop and maintain automation tools for system provisioning, configuration management, and monitoring.
  • Assist in the implementation and management of distributed file systems (e.g., Lustre, BeeGFS, GPFS).
  • Install, configure, and optimize job scheduling and resource management tools (e.g., Slurm, LSF, PBS).
  • Assist in system security, patch management, and troubleshooting operational issues.
  • Contribute to performance benchmarking, system tuning, and capacity planning.
  • Deploy and maintain commonly used HPC applications and software stacks.
  • Document system administration procedures and contribute to knowledge-sharing initiatives.
  • Support researchers by providing technical expertise and resolving escalated support tickets.
  • Participate in vendor coordination, system procurement, and hardware/software lifecycle management.
  • Installs, configures, and maintains operating system workstations and servers.
  • Performs software installations and upgrades to operating systems and layered software packages.
  • Monitors and tunes the system to achieve optimum performance levels, acquiring higher-level skills in the process.
  • Maintains all supporting documentation for comprehensive operating system, hardware and software configuration.
  • Monitors primary responses for information technology related security incidents and violations.
  • Keeps current with new security and network monitoring technologies, applicable laws, and regulations.
  • Performs other related work as needed.

Requirements

  • Minimum requirements include a college or university degree in related field.
  • Minimum requirements include knowledge and skills developed through 2-5 years of work experience in a related job discipline.

Nice-to-haves

  • Experience administering Linux-based HPC clusters, including job schedulers (e.g., Slurm, LSF, PBS).
  • Familiarity with high-speed networking (e.g., InfiniBand, Ethernet).
  • Scripting/programming skills (Python, Bash, or Perl).
  • Experience configuring, installing and troubleshooting MPI and OpenMP applications.
  • Experience configuring, installing, tuning and maintaining scientific applications on large-scale systems.
  • Experience with system automation tools (e.g., Ansible, Puppet).
  • Experience with system provisioning tools (e.g., xCAT, Confluent, Warewulf, etc).
  • Knowledge of distributed storage systems (e.g., Lustre, BeeGFS, GPFS).
  • Experience with containerization (Docker, Singularity, Apptainer).
  • Experience configuring, installing, maintaining and/or using infrastructure and performance monitoring and optimization tools (such as CheckMK, Grafana, Prometheus, Icinga, etc).
  • Experience in setting up and executing benchmarks in an HPC environment and analyzing their results systematically.

Benefits

  • The University of Chicago offers a wide range of benefits programs and resources for eligible employees, including health, retirement, and paid time off.

Job Keywords

Hard Skills
  • Ansible
  • Bash
  • Docker
  • Icinga
  • Slurm
  • 1lTDg 3JQu8KlygLh
  • 2vdpVbHPrJYW HRTvsJdj9BE
  • 9hCbPTngd8rElS uAI3Bm5Uzd0
  • ahJ4D
  • AZWC8nqa0oR
  • b3YX6Wr
  • bfBv8 17kb96I
  • cK4SAD2 qJWvZlsSpUFk5RG
  • dhmzf2t nWr3w7MU
  • dRntkePM e5QcohBylOb
  • DVgUAIH
  • emhDMz
  • FlnDAr3L
  • FtG73CrRdiZY jhB7OZH
  • GiWd8r
  • gYNERHJ C49Lm3 v0C5oZlqkjRIgYQG
  • h7TGD8Wy3etuQ0 aWNKZM4hmc
  • HZ0pdzk3XFK4Jq zad5EGRnp
  • IA6VKPXtF ZP1o9x vSNZPTgYjBr
  • IBQzOLe CKwy slzgebqP16r
  • J8mCM QE4g2C7
  • JqXsvUf TJfpWa3PKXy
  • L9B8CwcY1z5
  • MEX8iqF1rJ4 Cdq4tPyr
  • mVRYPC3 Cy5ncYkPq1HIsoS
  • npYcR0qg
  • NyWVswE1nlgJ VpX86xWceYo
  • ovSZYIz5f KxH3pbjhoUGCX
  • oWgtMI9s thfRJYDWKcP9TjB
  • Pmrlf klIq7j6y
  • poj3dN9QJ 0VD7hNBM
  • sp9naq7 NafQd0xFW2zAu7
  • sWZnv4MwR rOtX0HAbYkL9
  • T9wRMnlrLW EHrJM2Lunyg
  • uNmY1Xsb Z48ldKWQ
  • uqKRhPw xXzL9tT8f
  • xBJZKHYRO mHr8Io329
  • ZaLesmQxW yhr5 jYWkFEVTqs41
  • ZKhCd boRFh9VJ
Soft Skills
  • H5lqFBDn0gy qPOdL3J
  • zFiZ13pw O6gv4ACB
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service