Ramy Infotech - Dallas, TX

posted 4 months ago

Full-time - Mid Level
Dallas, TX
Professional, Scientific, and Technical Services

About the position

As a Senior Site Reliability Engineer (SRE) specializing in Observability, you will play a crucial role in implementing and maintaining observability solutions that enhance the performance and reliability of our applications and infrastructure. Your primary responsibility will be to utilize tools such as Prometheus and Grafana to develop and manage dashboards that visualize key metrics and performance data. You will also be tasked with optimizing and configuring licensing mechanisms for various observability tools, ensuring that they are effectively utilized across the organization. In this role, you will bridge the gap between application development teams and SRE operations, facilitating communication and collaboration to ensure that observability solutions meet the needs of all stakeholders. You will manage and optimize OpenShift and Linux environments, as well as Grafana Enterprise Metrics, to ensure that our observability solutions are robust and scalable. Your expertise in MELT (Metrics, Events, Logs, and Traces) will be essential as you plan for long-term data migration to AWS S3, ensuring that our data is stored efficiently and securely. You will be responsible for configuring and managing monitoring, alerts, and observability using a range of tools including Splunk, Netcool, ELK, and AIM. Your deep technical knowledge and operational experience with tools like AppDynamics, DataDog, Dynatrace, NewRelic, and Sumologic will be critical in maintaining the effectiveness of our observability solutions. Additionally, you will be expected to write and manage complex queries and alert definitions, as well as establish design patterns for monitoring application uptime and performance. As a thought leader in this space, you will provide strategic guidance on implementing and maintaining observability solutions, onboarding new teams and data sources into these systems. You will also create and maintain operational process documentation to ensure that best practices are followed and that knowledge is shared across the organization. Your ability to write code in languages such as Java, Python, Ruby, and Node.js will be essential in developing programs, config files, and complex queries that support our observability initiatives.

Responsibilities

  • Implement and maintain observability solutions using Prometheus as the backend and GEM as the middle end.
  • Develop and manage Grafana dashboards for visualizing metrics and performance data.
  • Optimize and configure licensing mechanisms for observability tools.
  • Write and manage complex queries and alert definitions.
  • Bridge the gap between application development teams and SRE operations.
  • Manage and optimize OpenShift, Linux environments, and Grafana Enterprise Metrics.
  • Utilize MELT (Metrics, Events, Logs, and Traces) and plan for long-term data migration to AWS S3.
  • Configure and manage monitoring, alerts, and observability using a range of tools including Splunk, Netcool, ELK, and AIM.
  • Maintain deep technical knowledge and operational experience with tools like AppDynamics, DataDog, Dynatrace, NewRelic, Sumologic, Splunk, Prometheus, and Grafana.
  • Understand and write code (Java, Python, Ruby, Node.js, etc.), programs, config files, and complex queries.
  • Implement and manage Infrastructure as Code (IAC) using Terraform.
  • Manage and optimize cloud platforms (AWS/Azure) and Kubernetes environments.
  • Establish design patterns for monitoring and benchmarking application uptime and performance.
  • Provide thought leadership and strategy in implementing and maintaining observability solutions.
  • Onboard new teams and data sources into the observability solutions.
  • Create and maintain operational process documentation for observability solutions.
  • Optimize the Observability Suite for monitoring applications and infrastructure.
  • Write queries for alerts, dashboards, and reporting.

Requirements

  • 8+ years of experience in AWS, configuring alerts, monitoring, Open Telemetry framework, Terraform, and scripting.
  • In-depth knowledge of observability tools such as Prometheus, Grafana, Splunk, Netcool, ELK, AIM, Sumologic, and New Relics.
  • Strong understanding of licensing mechanisms and MELT.
  • Experience with Cloud Platforms (AWS/Azure), Kubernetes, CI/CD (Jenkins), and Infrastructure as Code (Terraform).
  • Ability to read and write code in Java, Python, Ruby, Node.js, and other relevant languages.
  • Proven experience in creating dashboards, establishing design patterns, and understanding application flows in containerized/microservice environments.
  • Excellent communication skills and the ability to work effectively across teams.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service