Unclassified - Atlanta, GA

posted 15 days ago

Full-time - Senior
Atlanta, GA

About the position

The Senior Site Reliability Engineer (SRE) will be responsible for enhancing the reliability, performance, and security of our platform. This role involves deep troubleshooting, monitoring, and automation to ensure 24/7 operational readiness. The SRE will work closely with engineering teams to build tools, define metrics, and participate in incident postmortems, all while maintaining a high degree of reliability in the services provided.

Responsibilities

  • Review current workload patterns and prioritize areas of weakness within the platform through log and metric investigation.
  • Work with senior engineering and testing team members to build tools and recommend testing strategies for problem prevention and detection.
  • Employ deep troubleshooting skills to improve availability, performance, and security of services.
  • Perform in-depth postmortems on production incidents to assess business impact and facilitate learning.
  • Create dashboards and alerts for monitoring the IDP platform, defining key metrics and service level indicators.
  • Participate in the 24/7 on-call rotation.
  • Automate toil by building software and automation for seamless application deployment and third-party tool integration.
  • Ensure the platform maintains a high degree of reliability, achieving at least three 9s.
  • Define non-functional requirements as part of the product lifecycle to influence new designs and standards.
  • Own technically intricate issues that cross between DevOps, Databases, Networking, Code, Infrastructure, and people, driving them to satisfactory completion.
  • Provide recommendations and feedback in design reviews and review sessions.

Requirements

  • Bachelor's degree in computer science, a related field, or equivalent practical experience.
  • Minimum 5 years of experience as a Software Engineer, Platform, SRE, or DevOps engineer supporting large scale SAAS Production B2C or B2B Cloud Platforms.
  • Development skills in Java, TypeScript, Python, and OOP expertise.
  • Hands-on Azure Cloud experience, particularly with AKS, API management, Azure Cache for Redis, Azure Blob Storage, Cosmos DB, Service Bus, and Azure Functions.
  • Proficiency in monitoring, APM, and profiling tools such as New Relic, Splunk, Prometheus, and Grafana.
  • Working experience with containers, Kubernetes, and Helm.
  • Functional knowledge of Cloud Network, Firewalls, Ingress and Egress controllers, Service Mesh, and experience with Auth0, Secret management, Cloudflare, CDN, Load Balancer, Cache, and Firewall features.
  • Experience with ArgoCD, GitLab, CI/CD, Terraform, and Infrastructure as Code.
  • Strong communication skills and ability to explain technical concepts clearly and simply.
  • Willingness to dive into understanding, debugging, and improving any layer of the stack.

Benefits

  • Opportunity to work on bleeding-edge projects
  • Work with a highly motivated and dedicated team
  • Competitive salary
  • Flexible schedule
  • Benefits package - medical insurance
  • Corporate social events
  • Professional development opportunities
  • Well-equipped office
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service