Senior SRE Engineer

$156,000 - $187,200/Yr

Trillium Staffing - Santa Clara, CA

posted 3 months ago

Full-time - Senior
Santa Clara, CA
Administrative and Support Services

About the position

Trillium Professional is seeking a Senior Site Reliability Engineer (SRE) to join our client's Infrastructure, Planning and Processes organization in Santa Clara, CA. This role is pivotal in developing and maintaining a sophisticated internal cloud provisioning product specifically designed for GPUs and Tegra systems. As a Senior SRE Engineer, you will be part of a dynamic team that collaborates with various business units within the company, including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars, to address their infrastructure and systems needs. In this fast-paced environment, you will be responsible for ensuring the availability and reliability of systems deployed in the company's internal cloud. Your role will involve monitoring system performance, troubleshooting issues related to CPU, memory, disk, and network utilization, and providing high-quality user support. You will also be tasked with monitoring key performance indicators (KPIs) to ensure that the team's service level agreements (SLAs) are met. Additionally, you will manage and maintain production Kubernetes clusters, drive automation of monitoring processes to gain insights into application and system health, and develop tools necessary for automating workflows. Your expertise will be crucial in improving and maintaining the infrastructure codebase, implementing critical metrics using various analytics methods and dashboards, and participating in the prototyping and development of cloud infrastructure. You will also leverage AI techniques to extract valuable signals from the data generated by machines and jobs.

Responsibilities

  • Working on systems deployed in the company's internal cloud to ensure availability and reliability for end users.
  • Monitoring system performance and troubleshooting issues related to CPU, memory, disk, and network utilization.
  • Providing high quality user support.
  • Monitoring KPIs and ensuring that the team's SLAs are met.
  • Managing and maintaining production Kubernetes clusters.
  • Driving automation of monitoring to gain insights into applications and system health.
  • Crafting and developing tools needed for automating workflows.
  • Developing, improving, and maintaining the infrastructure codebase.
  • Crafting and implementing critical metrics using various analytics methods and dashboards.
  • Participating in prototyping, crafting, and developing cloud infrastructure.
  • Reusing AI techniques to extract useful signals about machines and jobs from generated data.

Requirements

  • Experience of maintaining cloud infrastructure and highly available production environments.
  • Experience managing systems installed in data centers, proficient with BMC (Redfish), KVM, and IPMI tools.
  • Working knowledge of OpenStack.
  • Background in databases like SQL (MySQL) and time-series databases like Prometheus.
  • Strong knowledge of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs.
  • Experience with data analytics/visualization tools like Kibana, Grafana, Splunk, etc.
  • Strong Ansible skills, with experience in Ansible AWX.
  • Strong background with Jenkins and/or other CI/CD systems.
  • Proficient with Kubernetes, Docker, and virtualization.
  • Proficient using source code management and binary repository systems like GitLab, GitHub, Artifactory, Perforce, etc.
  • Knowledge of monitoring systems such as Zabbix, Prometheus, PagerDuty, and/or similar systems.
  • Advanced knowledge of standard methodologies related to security.
  • 5+ years of proven experience in a relevant field.
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.

Nice-to-haves

  • Previous experience with SRE teams managing on-prem infrastructure.
  • Experience managing company hardware like GPUs and Tegras.
  • Ability to thrive in a multi-tasking environment with constantly evolving priorities.
  • Prior experience with large scale operations teams.
  • Experience with Windows server infrastructure.
  • Outstanding interpersonal skills and communication with all levels of management.
  • Experience with using and improving data centers.
  • Ability to analyze complex problems into simpler sub-problems and implement solutions using available resources.
  • Ability to design simple systems that can operate efficiently with minimal support.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service