Site Reliability Engineer

Trillium Staffing•Santa Clara, CA

425d•$166,400 - $187,200

About The Position

Trillium Professional is seeking a Site Reliability Engineer to join their team in Santa Clara. This role focuses on monitoring and recovering assets in a private cloud environment, with a specific emphasis on building and stabilizing virtualization infrastructure. The engineer will deploy and maintain a large farm of machines using configuration management and infrastructure automation tools, while also participating in on-call support for infrastructure issues.

Requirements

Bachelor's or Master's Degree in Computer Science or Software Engineering, or equivalent experience.
5+ years of professional experience required.
Good with system and platform debugging.
Virtualization experience (vSphere, Hyper-V, KVM, Xen server).
Familiar with Client Configuration tools (Chef preferred, Ansible).
Experience working in large scale enterprise production systems.
Ability to debug and analyze system issues, code to triage, root cause and resolve issues in the infrastructure.
Familiar with maintenance and setup of Linux and Windows hosts.
Scripting experience with Python, Go, or Unix shell proficiency.
Experience with version control systems like Perforce and GIT.

Nice To Haves

Familiar with private cloud setups (VMware, Dell, Apple).
Experience with VM and hardware virtualization technologies like VMware, KVM, Hyper-V, Docker, and Kubernetes.
Background with automating bare metal and VM provisioning.
Experience with supporting GPUs, embedded device development, driver development, and CUDA/TensorRT applications.
Development experience in Chef, Ansible, and infrastructure orchestration.

Responsibilities

Fleet monitoring & recovery of assets in a private cloud environment.
Building and stabilizing virtualization infrastructure of ESXi, KVM, and Hyper-V.
Deploying and maintaining a large farm of machines using Configuration Management & Infrastructure Automation tools (Chef, Ansible, Terraform).
Participating in on-call & rotational L1 support for round-the-clock monitoring and remediation of infrastructure issues (PagerDuty).
Analyzing and debugging operating system, networking, configuration, and performance problems.
Assisting in the roll-out and deployment of infrastructure configurations to support the latest hardware and technologies.
Contributing to the development of monitoring systems for real-time infrastructure monitoring (Zabbix, Big Panda, Grafana).

Benefits

Competitive pay rate of $80 - $90/hour.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Industry

Administrative and Support Services

Education Level

Bachelor's degree

Site Reliability Engineer

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company