Edible Arrangements - Atlanta, GA

posted 9 days ago

Full-time - Senior
Remote - Atlanta, GA
501-1,000 employees
Sporting Goods, Hobby, Musical Instrument, Book, and Miscellaneous Retailers

About the position

As a Senior Site Reliability Engineer (SRE), you will be responsible for ensuring the resilience and reliability of our e-commerce applications through monitoring, automation, and proactive site maintenance. You will leverage Datadog, Azure Application Insights, and other industry-standard tools to develop robust monitoring systems that enhance site awareness, detect and respond to incidents, and maintain high availability. This role involves collaboration across engineering teams to build a proactive approach to system health, site reliability, and incident management.

Responsibilities

  • Develop, implement, and manage monitoring and alerting systems using Datadog, Azure Application Insights, and other related technologies.
  • Ensure integration of Datadog with .NET, Node.js and React-based applications for comprehensive monitoring of application performance and health.
  • Establish proactive monitoring practices to reduce site outages and gain insight into system performance.
  • Design and implement Standard Operating Procedures (SOPs) for incident response and resolution.
  • Collaborate with engineering and product teams to execute comprehensive incident response plans.
  • Optimize Azure DevOps pipelines to address blockers, errors, and build issues.
  • Maintain and improve application performance and resilience through Azure services.
  • Execute SQL queries to assess and troubleshoot database performance and availability issues.
  • Work closely with developers to embed monitoring tools into the development cycle.
  • Create detailed documentation, including SOPs, best practices, and monitoring configurations.
  • Stay current with emerging monitoring technologies to enhance platform reliability and scalability.
  • Promote a culture of learning and proactive improvement through root cause analysis.

Requirements

  • 5+ years of experience in Site Reliability Engineering, preferably within an e-commerce or high-traffic web application environment.
  • Strong expertise with Datadog, including setting up integrations and creating custom metrics, dashboards, and alerts.
  • Proven experience with Azure Application Insights, Azure DevOps, and implementing monitoring solutions in cloud environments.
  • Hands-on experience managing and optimizing Azure App Services, Azure Front Door, Azure Application Gateway, and SQL databases.
  • Familiarity with SOP development for incident management and proactive monitoring.
  • Knowledge of CI/CD pipelines in Azure DevOps and experience in resolving build blockers.

Nice-to-haves

  • Advanced certifications in Azure (e.g., Azure DevOps Engineer Expert, Azure Solutions Architect).
  • Extensive experience with high-traffic e-commerce applications and a track record of ensuring uptime and resilience.
  • Experience with other monitoring and observability tools (e.g., Grafana, Prometheus).

Benefits

  • Onsite work environment with work-from-home flexibility.
  • Growth & Development opportunities for personal and professional growth.
  • Healthcare plans including health/dental/vision insurance, 401K Plan, company-paid life insurance, and short-term disability.
  • Paid time off, including sick days & holidays.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service