SRE

Tek Ninjas - Fort Worth, TX

posted 3 months ago

Full-time - Mid Level

Fort Worth, TX

About the position

The Site Reliability Engineer (SRE) position at SRE Tek Ninjas is a hybrid role based in either Tempe, AZ or Fort Worth, TX, with an estimated duration of 6 months. The contractor will be primarily involved in supporting the Voice Services product, specifically focusing on Virtual Assistance/Chat, Interactive Voice Response (IVR), and Integrated Information Delivery Services (IIDS). This role requires a strong understanding of monitoring and alarming processes, as well as experience with cloud service platforms. The SRE will be responsible for establishing a new cloud environment, determining system interactions and dependencies, and providing on-call support as needed based on a rotation schedule and the severity of issues encountered. In this role, the SRE will also be tasked with routing defects to the appropriate internal or external teams for remediation, managing problem management policies, and collaborating with monitoring teams to develop new alarms and alerts. The engineer will define monitoring types and thresholds, maintain documentation for Mean Time to Repair (MTTR), and participate in weekly calls to stay updated on system changes. Additionally, the SRE will leverage their experience in call center platforms and capacity planning to ensure efficient resource provisioning and system performance. The ideal candidate will possess a degree in Computer Science, Computer Engineering, or a related technical discipline, along with significant IT engineering and operations experience. A strong background in Kubernetes, Linux environments, and cloud infrastructures is essential, as well as familiarity with databases and web servers. The ability to thrive in a fast-paced environment and manage multiple priorities is crucial for success in this position.

Responsibilities

Monitor and manage alarming processes and technology.
Operate with cloud service platforms and establish new cloud environments.
Determine system interactions and dependencies within the Voice Services product.
Provide on-call support based on rotation and issue severity.
Route defects to appropriate internal/external teams for remediation.
Manage problem management policies for the organization.
Collaborate with monitoring teams to develop new alarms and alerts based on incidents.
Define monitoring types and thresholds for implementation.
Document and maintain playbooks for Mean Time to Repair (MTTR).
Participate in weekly calls to understand system changes.
Utilize call center platform experience for system management.
Forecast and plan capacity for system and network resources.
Provision resources in a cost-efficient manner.

Requirements

Degree in Computer Science, Computer Engineering, Technology, Information Systems, or related technical discipline, or equivalent experience/training.
5+ years of IT Engineering and/or Operations experience with increasing levels of responsibility.
1-3 years of experience with Kubernetes clusters and hands-on experience with NGINX Ingress controllers.
1-3 years of knowledge in Linux environments.
Working knowledge of databases such as Oracle and SQL.
Knowledge of cloud infrastructures including AWS, Google, and IBM.
Familiarity with web servers like Nginx and Apache.
Ability to work effectively in a fast-paced changing environment.
Strong multi-tasking and prioritization skills.

Nice-to-haves

Experience with monitoring and alarming processes.
Knowledge of SSL certificates.
General networking knowledge.
Experience in forecasting and capacity planning.

SRE

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company