Senior SRE Engineer

$156,000 - $187,200/Yr

Trillium Staffing - Santa Clara, CA

posted 4 months ago

Full-time - Mid Level
Santa Clara, CA
Administrative and Support Services

About the position

Trillium Professional is seeking a Senior Site Reliability Engineer (SRE) to join its Infrastructure, Planning and Processes organization in Santa Clara, California. This role is crucial for developing and maintaining sophisticated internal cloud provisioning products specifically designed for GPUs and Tegra systems. The successful candidate will be part of a dynamic team that collaborates with various business units, including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars, to address their infrastructure and systems needs. As a Senior SRE Engineer, you will be responsible for ensuring the reliability and availability of systems deployed in the company's internal cloud, providing high-quality user support, and monitoring system performance to troubleshoot issues related to CPU, memory, disk, and network utilization. In this fast-paced environment, you will manage and maintain production Kubernetes clusters, drive automation of monitoring processes, and develop tools to automate workflows. You will also be tasked with crafting and implementing critical metrics using various analytics methods and dashboards, as well as participating in the prototyping and development of cloud infrastructure. A keen attention to detail, strong problem-solving abilities, and a solid knowledge base are essential for success in this role. The position requires a proactive approach to managing infrastructure and systems, ensuring that the team's service level agreements (SLAs) are met, and continuously improving the infrastructure codebase.

Responsibilities

  • Working on systems deployed in the company's internal cloud to ensure availability and reliability for end users.
  • Monitoring system performance and troubleshooting issues related to CPU, memory, disk, and network utilization.
  • Providing high quality user support.
  • Monitoring KPIs and ensuring that the team's SLAs are met.
  • Managing and maintaining production Kubernetes clusters.
  • Driving automation of monitoring to gain more insight into applications and system health.
  • Crafting and developing tools needed for automating workflows.
  • Developing, improving, and maintaining the infrastructure codebase.
  • Crafting and implementing critical metrics using various analytics methods and dashboards.
  • Participating in prototyping, crafting, and developing cloud infrastructure.
  • Reusing AI techniques to extract useful signals about machines and jobs from the data generated.

Requirements

  • Experience of maintaining cloud infrastructure and highly available production environments.
  • Experience managing systems installed in data centers; proficient with BMC (Redfish), KVM, and IPMI tools.
  • Working knowledge of OpenStack.
  • Background in databases like SQL (MySQL) and time-series databases like Prometheus.
  • Strong knowledge of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs.
  • Experience with data analytics/visualization tools like Kibana, Grafana, Splunk, etc.
  • Strong Ansible skills; experience with Ansible AWX.
  • Strong background with Jenkins and/or other CI/CD systems.
  • Proficient with Kubernetes, Docker, and virtualization.
  • Proficient using source code management and binary repository systems like GitLab, GitHub, Artifactory, Perforce, etc.
  • Knowledge of monitoring systems such as Zabbix, Prometheus, PagerDuty, and/or similar systems.
  • Advanced knowledge of standard methodologies related to security.
  • 5+ years of proven experience in a relevant field.
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.

Nice-to-haves

  • Previous experience with SRE teams managing on-prem infrastructure.
  • Experience managing company hardware like GPUs and Tegras.
  • Thrives in a multi-tasking environment with constantly evolving priorities.
  • Prior experience with large scale operations teams.
  • Experience with Windows server infrastructure.
  • Outstanding interpersonal skills and communication with all levels of management.
  • Experience with using and improving data centers.
  • Ability to analyze sophisticated problems into simple sub-problems and reuse available solutions to implement most of those.
  • Ability to design simple systems that can work efficiently without needing much support.

Benefits

  • Health insurance participation
  • Retirement plans participation
  • Paid holidays
  • State required leave
  • Vacation days
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service