SRE

$79,040 - $87,360/Yr

INSPYR Solutions - Phoenix, AZ

posted 3 months ago

Full-time - Mid Level
Hybrid - Phoenix, AZ
Administrative and Support Services

About the position

The Site Reliability Engineer (SRE) position is a critical role within our IT DevOps team, focusing on ensuring the reliability and performance of our cloud-based services. This role requires a deep understanding of monitoring and alarming processes, as well as experience with cloud service platforms. The SRE will be responsible for standing up new cloud environments, determining system interactions and dependencies within the Voice Services product, and providing on-call support as needed based on a rotation schedule and the severity of issues. The successful candidate will also route defects to the appropriate internal or external teams for remediation and manage problem management policies effectively. Collaboration is key in this role, as the SRE will work closely with monitoring teams to develop new alarms and alerts based on undetected incidents, ensuring that all relevant data triggering the alarms is captured. The SRE will define monitoring types and thresholds to implement, maintain documentation for Mean Time to Repair (MTTR), and participate in weekly calls to stay updated on system changes. Experience with call center platforms, forecasting, and capacity planning of system and network resources is essential, as is the ability to provision resources cost-effectively. The SRE will also need to have a solid understanding of SSL certificates, general networking, and the ability to multi-task and prioritize effectively in a fast-paced environment.

Responsibilities

  • Knowledge and experience with monitoring and alarming processes and technology
  • Experience operating with cloud service platforms
  • Ability to stand up a new cloud environment
  • Determine system interactions and dependencies with other applications and components within the Voice Services product
  • On-call support as needed based on rotation and severity of issues
  • Route defects to appropriate internal/external squads for defect remediation
  • Responsible for managing AA problem management policies
  • Collaborate with monitoring teams to develop new alarms and alerts based on undetected incidents
  • Define monitoring types and thresholds to implement
  • Knowledge of SSL Certs
  • General Networking knowledge
  • Document and maintain playbook for obtaining Mean Time to Repair (MTTR)
  • Participate in weekly calls to understand changes being implemented to the systems
  • Call Center platform experience
  • Experience of forecasting and capacity planning of system and network resources
  • Experience with cost efficient provisioning of resources

Requirements

  • Degree or years of experience in Computer Science, Computer Engineering, Technology, Information Systems (CIS/MIS), Engineering or related technical discipline, or equivalent experience/training
  • 5+ years of IT Engineering and/or Operations experience with increasing levels of responsibility
  • 1-3 years of experience in Kubernetes cluster and good understanding of containerization
  • Good hands-on experience in NGINX Ingress controllers
  • 1-3 years of knowledge with Linux environments
  • Working knowledge of databases: Oracle, SQL, etc.
  • Knowledge of Cloud infrastructures (AWS, Google, IBM, etc.)
  • Knowledge of Web Servers (Nginx, Apache, etc.)
  • Ability to work effectively in a fast-paced changing environment
  • Ability to multi-task and prioritize

Benefits

  • Comprehensive medical benefits
  • Competitive pay
  • 401(k) retirement plan
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service