Trillium Staffing - Santa Clara, CA
posted 4 months ago
Trillium Professional is seeking a Senior Site Reliability Engineer (SRE) to join its Infrastructure, Planning and Processes organization in Santa Clara, California. This role is crucial for developing and maintaining sophisticated internal cloud provisioning products specifically designed for GPUs and Tegra systems. The successful candidate will be part of a dynamic team that collaborates with various business units, including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars, to address their infrastructure and systems needs. As a Senior SRE Engineer, you will be responsible for ensuring the reliability and availability of systems deployed in the company's internal cloud, providing high-quality user support, and monitoring system performance to troubleshoot issues related to CPU, memory, disk, and network utilization. In this fast-paced environment, you will manage and maintain production Kubernetes clusters, drive automation of monitoring processes, and develop tools to automate workflows. You will also be tasked with crafting and implementing critical metrics using various analytics methods and dashboards, as well as participating in the prototyping and development of cloud infrastructure. A keen attention to detail, strong problem-solving abilities, and a solid knowledge base are essential for success in this role. The position requires a proactive approach to managing infrastructure and systems, ensuring that the team's service level agreements (SLAs) are met, and continuously improving the infrastructure codebase.