University of California - Los Angeles, CA

posted 4 months ago

Full-time - Mid Level
Los Angeles, CA
Educational Services

About the position

The HPC System Administrator position at UCLA's Office of Advanced Research Computing (OARC) is a critical role that supports the university's mission of education, research, and service through innovative technology practices. The OARC High Performance Computing (HPC) Systems Research Technology Group (RTG) is responsible for supporting thousands of UCLA researchers and over 300 research groups. This is achieved through consultation and the operation of the Hoffman2 High Performance Research Cluster, which consists of approximately 1000 compute nodes, GPU nodes, high-speed networking, high-performance storage, and extensive hardware and software support infrastructure across multiple data centers. As an HPC System Administrator, you will serve as a technical expert in the areas of systems and application software development, HPC cluster system administration, and management of the backup system environment. The role requires a strong understanding of HPC cluster architectures and concepts, as well as the ability to stay current with industry best practices. You will be expected to work independently or as part of a development team, effectively estimating time and effort required to complete tasks, and analyzing, benchmarking, debugging, and testing software in a technically sound manner. The position requires the ability to communicate effectively with diverse stakeholders, including researchers, peers, and management, and to write well-organized, complete, and technically correct documents and procedures. The HPC System Administrator will also need to demonstrate problem-solving skills, the ability to prioritize tasks, and the capability to manage projects effectively. Flexibility in work schedules may be considered based on operational needs, and the position requires working from UCLA's Westwood campus as operational demands dictate.

Responsibilities

  • Support the operation and administration of the Hoffman2 High Performance Research Cluster.
  • Develop and maintain systems and application software for the HPC environment.
  • Manage the backup system environment and ensure data integrity.
  • Analyze, benchmark, debug, and test software in a technically sound manner.
  • Create high-quality system tools and software to enhance HPC capabilities.
  • Establish and maintain cooperative working relationships with staff, students, and vendors.
  • Communicate technical information effectively to diverse audiences, including researchers and management.
  • Write well-organized and grammatically correct documentation and procedures for technical and non-technical personnel.
  • Stay current with industry best practices in HPC cluster architectures and concepts.

Requirements

  • 3 years of experience with software and applications development, Linux system administration, and two or more modern programming languages (e.g., Python, C++, Java).
  • Expert knowledge of Python, SQL, bash, git, and associated build systems, libraries, and development tools.
  • Demonstrated knowledge of common programming paradigms (e.g., asynchronous, concurrent, and object-oriented).
  • Ability to analyze, benchmark, debug, and test software in a technically sound manner.
  • Detailed knowledge of Red Hat Enterprise Linux and related distributions.
  • Solid system administration skills including scripting, pipelines, and UNIX operating system fundamentals.
  • Working knowledge of protocols, applications, and formats including TCP/IP, HTTP, DHCP, SSH, NFS, JSON, XML, and HTML.
  • Demonstrated ability to troubleshoot and debug computing problems across multiple complex operating systems and software components.
  • Knowledge of validation, verification, and disaster recovery capabilities for both hardware and software.
  • Demonstrated skill in writing well-organized, complete, and technically correct documents and procedures.

Nice-to-haves

  • Master's degree in computer science, software engineering, or a related field.

Benefits

  • Comprehensive health insurance coverage starting on day one.
  • Flexible work schedules considered based on operational needs.
  • Access to professional development opportunities.
  • Support for ongoing education and skill development.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service