Advanced Software Talent - South San Francisco, CA

posted 2 months ago

Full-time
South San Francisco, CA
Professional, Scientific, and Technical Services

About the position

As the Delivery Lead, you will be the driving force behind building and maintaining robust site reliability engineering (SRE) functions for our growing data organization. This role is pivotal in ensuring smooth operations and high availability of our data infrastructure and services. You will act as the primary point of contact for all SRE-related activities, which requires a blend of technical expertise, project management skills, and a passion for data-driven solutions. Your leadership will be essential in developing comprehensive support and SRE processes from the ground up, ensuring that our systems are not only functional but also optimized for performance and reliability. In this position, you will collaborate closely with data engineers, scientists, and IT teams to ensure seamless integration and optimal performance of data systems. You will own the incident management process, which includes responding to incidents, troubleshooting issues, and resolving them efficiently to minimize downtime and impact on data operations. Establishing proactive monitoring and alerting mechanisms will be a key responsibility, allowing you to identify and address potential issues before they escalate into significant problems. Your role will also involve continuous performance tuning of our systems to meet the evolving needs of the data organization. You will drive automation initiatives aimed at streamlining support workflows and improving overall efficiency. Additionally, maintaining thorough documentation of processes, procedures, and system configurations will be crucial to ensure clarity and consistency across the team.

Responsibilities

  • Develop and implement comprehensive support and SRE processes from the ground up.
  • Partner with data engineers, scientists, and IT teams to ensure seamless integration and optimal performance of data systems.
  • Own incident response, troubleshooting, and resolution, minimizing downtime and impact on data operations.
  • Establish proactive monitoring and alerting mechanisms to identify and address potential issues before they escalate.
  • Continuously optimize system performance, scalability, and reliability to meet the evolving needs of the data organization.
  • Drive automation initiatives to streamline support workflows and improve efficiency.
  • Maintain thorough documentation of processes, procedures, and system configurations.

Requirements

  • Proven experience in building and managing support and SRE functions in a data-centric environment.
  • Strong experience with storage technologies such as NFS, HDFS, and Amazon S3, as well as dynamic resource management frameworks like Kubernetes.
  • Strong understanding of data infrastructure, cloud technologies, and DevOps principles.
  • Excellent communication and interpersonal skills to collaborate effectively with diverse teams.
  • Demonstrated ability to lead and execute complex projects with minimal supervision.
  • Proactive problem-solver with a passion for continuous improvement.
  • Experience in the biotech industry is a plus.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service