University of Chicago - Chicago, IL
posted 4 months ago
The University of Chicago Research Computing Center (RCC) is seeking a highly qualified HPC system engineer to join its system and operation team that builds and manages RCC HPC systems and facility operations. The individual in this position will be involved in the management and administration of RCC hardware and software. This role is critical in designing automated, scalable, and rapidly deployable solutions to infrastructure development and server configuration. The engineer will work independently to install, configure, and maintain operating systems, utilizing best practices and systems knowledge to monitor and alert systems, utility software, and firewalls. The position requires guiding maintenance for production servers as well as Windows and Linux servers. The responsibilities include installing, configuring, and maintaining large computer clusters/servers and software, managing day-to-day operations of the systems including systems administration, monitoring, and storage performance up to and including network components. The engineer will also manage the system's network switch, parallel file system, and HPC software stack and tools. This role involves diagnosing and resolving system operational problems quickly and effectively, coordinating with vendors to resolve hardware and software issues, and assisting users with access and help desk ticket requests or issues. Additionally, the engineer will be responsible for building and deploying open source software and software from vendors/partners, providing reliable and efficient backups/restores for all managed systems, and maintaining and monitoring the security of the HPC systems and servers. Documentation of system administration procedures for routine and complex tasks is also a key responsibility, along with planning and installing necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems.