Datacenter Site Reliability Engineer

Kforce - Los Angeles, CA

posted 3 months ago

Full-time

Los Angeles, CA

Administrative and Support Services

About the position

Kforce is immediately adding a full-time Datacenter Site Reliability Engineer in support of our industry-leading technology development client in Los Angeles, CA. Our client is seeking candidates who are driven to make positive changes to the way people live and work by creating breakthrough technology solutions. The role involves a variety of responsibilities centered around ensuring the reliability and performance of data center operations. This includes data monitoring and alerting, data quality assurance, and anomaly detection. The engineer will be responsible for documenting team processes and policies, including methods of engagement and Service Level Objectives (SLOs). In this position, the engineer will analyze, design, and implement solutions at the system level to remove bottlenecks and improve edge service performance. Implementing monitoring and alerting systems to enhance issue detection and response is a key part of the role. The engineer will work in a fast-paced environment and participate in technical operations and rotations in response to performance and reliability issues. Additionally, participation in on-call rotations is required, where the engineer will be responsible for resolving or escalating incoming events. A strong understanding of maintaining and operating a Linux and Kubernetes environment is essential for success in this role.

Responsibilities

Data monitoring and alerting, data quality assurance and anomaly detection
Document team processes and policies, including methods of engagement and SLOs
Analyze, design, and implement solutions at the system level to remove bottlenecks and improve edge service performance
Implement monitoring and alerting to improve issue detection and response
Work in a fast-paced environment; Participate in technical operations and rotations in response to performance and reliability issues
Participate in on-call rotations, responsible for resolving or escalating incoming events
Maintain and operate a Linux and Kubernetes environment

Requirements

Bachelor's degree or above, majoring in Computer Science or related fields, with at least 2 years of related work experience
3+ years of experience working with Unix Linux systems from kernel to shell and beyond
Experience working with system libraries, file systems, and client-server protocols
Experience reading python scripts for platform operations
Experience in networking technologies such as TCP/IP, BGP, DNS, etc. in a carrier-grade environment
Experience in developing and operating one or more of the following systems: OpenStack, Kubernetes, Nginx, ipvs, ELK stack, Hadoop, etc.

Benefits

Medical insurance
Dental insurance
Vision insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
401(k)
Life insurance
Disability insurance
Accidental Death and Dismemberment (ADD) insurance
Paid time off for salaried personnel
Paid sick leave for hourly employees on a Service Contract Act project

Datacenter Site Reliability Engineer

About the position

Responsibilities

Requirements

Benefits

Tools

Career Hubs

Guides

Company