Trillium Staffing - Santa Clara, CA
posted 3 months ago
Trillium Professional is seeking a Senior Site Reliability Engineer (SRE) to join our client's Infrastructure, Planning and Processes organization in Santa Clara, CA. This role is pivotal in developing and maintaining a sophisticated internal cloud provisioning product specifically designed for GPUs and Tegra systems. As a Senior SRE Engineer, you will be part of a dynamic team that collaborates with various business units within the company, including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars, to address their infrastructure and systems needs. In this fast-paced environment, you will be responsible for ensuring the availability and reliability of systems deployed in the company's internal cloud. Your role will involve monitoring system performance, troubleshooting issues related to CPU, memory, disk, and network utilization, and providing high-quality user support. You will also be tasked with monitoring key performance indicators (KPIs) to ensure that the team's service level agreements (SLAs) are met. Additionally, you will manage and maintain production Kubernetes clusters, drive automation of monitoring processes to gain insights into application and system health, and develop tools necessary for automating workflows. Your expertise will be crucial in improving and maintaining the infrastructure codebase, implementing critical metrics using various analytics methods and dashboards, and participating in the prototyping and development of cloud infrastructure. You will also leverage AI techniques to extract valuable signals from the data generated by machines and jobs.