About the position

We are seeking an experienced and strategic Site Reliability Engineer (SRE) to drive the stability, reliability, and observability of our mission-critical systems. This role is crucial to ensuring high availability, performance, and operational excellence for our services. The SRE will be responsible for designing and implementing robust reliability frameworks, overseeing system monitoring, incident response, and leading key initiatives to improve system performance. This role requires a strong leadership mindset, balancing proactive risk mitigation with rapid incident response. The ideal candidate will work closely with engineering, operations, and leadership teams to define and uphold service-level objectives (SLOs) and optimize system resilience.

Responsibilities

  • Develop and enforce service-level indicators (SLIs) and objectives (SLOs) to measure and improve system health.
  • Implement and manage comprehensive observability strategies, ensuring real-time visibility into system performance, availability, and health.
  • Oversee incident management and response processes, ensuring quick mitigation of production issues and leading post-mortem investigations to drive systemic improvements.
  • Optimize system reliability through failure analysis, capacity planning, and proactive risk assessment.
  • Define and implement best practices for on-call management, reducing alert fatigue while ensuring critical issues are addressed efficiently.
  • Assist with writing RCAs by providing technical details of the incident.
  • Continuously refine operational runbooks, incident response plans, and system reliability guidelines to enhance organizational resilience.
  • Analyze system performance trends, production issues, and historical outages to proactively address weaknesses before they impact customers.
  • Drive cultural change within the organization, promoting a reliability-first mindset across all teams.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in a Site Reliability Engineering, Production Engineering, or Systems Engineering role.
  • Proven expertise in managing high-availability, distributed systems in a production environment.
  • Deep understanding of observability practices, including monitoring, logging, and tracing with tools such as Prometheus, Grafana, Datadog, New Relic, or OpenTelemetry.
  • Extensive experience in incident response, RCAs, post-mortems, and continuous improvement processes.
  • Strong background in capacity planning, load balancing, and performance tuning for large-scale applications.
  • Experience with operational leadership, on-call management, and defining reliability strategies within complex environments.
  • Familiarity with networking, security best practices, and risk management strategies for distributed architectures.
  • Strong analytical and problem-solving skills to diagnose system failures and implement long-term solutions.

Nice-to-haves

  • Incident Management & Alerting: Experience with Jira Service Management, PagerDuty, Opsgenie, or equivalent tools.
  • Cloud Infrastructure Management: Hands-on expertise with AWS, GCP, or Azure.
  • Database Performance Optimization: Experience working with relational and NoSQL databases.
  • Capacity Planning & Scalability Strategies: Ability to assess and predict infrastructure needs for growth.
  • Technical Leadership & Communication: Proven ability to work cross-functionally and drive reliability initiatives at scale.
Hard Skills
Datadog
1
JIRA
1
New Relic
1
NoSQL
1
Prometheus
1
1IJzdS8B AnkqZD
0
1rgaJku UCQfO21YgnqZa
0
2SYdaKk1 M0lVRH
0
3CNBPs7cupVx rSZhtfe
0
3gCd9JWu 2Cd7FvzZT
0
ASlKt hLB30QAeYui
0
DPVNRTkE Ak8X4WT706K
0
Dqnb67ReX 7el1gYsb9
0
EbMD5nsG4Ttu 9uL1ZrpBeSO
0
IWsBJ b1PIyGlkq8
0
OxL vLXZk 4gQI1s2DSMi
0
QwEoZ NSyrb 0gBInsAH
0
UCn5 k2jeAF3XZGa
0
WgFHrMEYb UFHPL9Ax
0
XyM4bEiC
0
ZUKLlMj5 wcQRKNvuMdnZ
0
ZnQEabDBX zAV2pZiXr
0
hpzXRC93 KOVfdkRhu
0
kdwpoyH0Njxi QrlXRLdk9Djz
0
lic4jxLrB ZIvnOWUFP
0
mDeKiWxzvtpP 9RvP84w0fla5
0
nAhOfMRtI L2wpvNmDnsW
0
ofw2nsi5R Oabp47KFo
0
pEaSLU1 uNOP7igoYFf
0
qPHhpfg 2jLeZNU4FSviH
0
xBJpfF Nlmpb9O02TCKd5y
0
xcNDy7FGw hJnXD4brojQ
0
zDCu08O fYm1RBt8
0
Soft Skills
2x9Ki U6JW5ZKCcEh
0
7E8OBVJp 3skLgoVP
0
GI16g0J5YV 3A9GEbqz7PK
0
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service