SRE

$79,040 - $87,360/Yr

INSPYR Solutions - Phoenix, AZ

posted 3 months ago

Full-time - Mid Level

Hybrid - Phoenix, AZ

Administrative and Support Services

About the position

The Site Reliability Engineer (SRE) position is a critical role within our IT DevOps team, focusing on ensuring the reliability and performance of our cloud-based services. This role requires a deep understanding of monitoring and alarming processes, as well as experience with cloud service platforms. The SRE will be responsible for standing up new cloud environments, determining system interactions and dependencies within the Voice Services product, and providing on-call support as needed based on a rotation schedule and the severity of issues. The successful candidate will also route defects to the appropriate internal or external teams for remediation and manage problem management policies effectively. Collaboration is key in this role, as the SRE will work closely with monitoring teams to develop new alarms and alerts based on undetected incidents, ensuring that all relevant data triggering the alarms is captured. The SRE will define monitoring types and thresholds to implement, maintain documentation for Mean Time to Repair (MTTR), and participate in weekly calls to stay updated on system changes. Experience with call center platforms, forecasting, and capacity planning of system and network resources is essential, as is the ability to provision resources cost-effectively. The SRE will also need to have a solid understanding of SSL certificates, general networking, and the ability to multi-task and prioritize effectively in a fast-paced environment.

Responsibilities

Knowledge and experience with monitoring and alarming processes and technology
Experience operating with cloud service platforms
Ability to stand up a new cloud environment
Determine system interactions and dependencies with other applications and components within the Voice Services product
On-call support as needed based on rotation and severity of issues
Route defects to appropriate internal/external squads for defect remediation
Responsible for managing AA problem management policies
Collaborate with monitoring teams to develop new alarms and alerts based on undetected incidents
Define monitoring types and thresholds to implement
Knowledge of SSL Certs
General Networking knowledge
Document and maintain playbook for obtaining Mean Time to Repair (MTTR)
Participate in weekly calls to understand changes being implemented to the systems
Call Center platform experience
Experience of forecasting and capacity planning of system and network resources
Experience with cost efficient provisioning of resources

Requirements

Degree or years of experience in Computer Science, Computer Engineering, Technology, Information Systems (CIS/MIS), Engineering or related technical discipline, or equivalent experience/training
5+ years of IT Engineering and/or Operations experience with increasing levels of responsibility
1-3 years of experience in Kubernetes cluster and good understanding of containerization
Good hands-on experience in NGINX Ingress controllers
1-3 years of knowledge with Linux environments
Working knowledge of databases: Oracle, SQL, etc.
Knowledge of Cloud infrastructures (AWS, Google, IBM, etc.)
Knowledge of Web Servers (Nginx, Apache, etc.)
Ability to work effectively in a fast-paced changing environment
Ability to multi-task and prioritize

Benefits

Comprehensive medical benefits
Competitive pay
401(k) retirement plan

SRE

About the position

Responsibilities

Requirements

Benefits

Tools

Career Hubs

Guides

Company