Ramy Infotech - Dallas, TX
posted 4 months ago
As a Senior Site Reliability Engineer (SRE) specializing in Observability, you will play a crucial role in implementing and maintaining observability solutions that enhance the performance and reliability of our applications and infrastructure. Your primary responsibility will be to utilize tools such as Prometheus and Grafana to develop and manage dashboards that visualize key metrics and performance data. You will also be tasked with optimizing and configuring licensing mechanisms for various observability tools, ensuring that they are effectively utilized across the organization. In this role, you will bridge the gap between application development teams and SRE operations, facilitating communication and collaboration to ensure that observability solutions meet the needs of all stakeholders. You will manage and optimize OpenShift and Linux environments, as well as Grafana Enterprise Metrics, to ensure that our observability solutions are robust and scalable. Your expertise in MELT (Metrics, Events, Logs, and Traces) will be essential as you plan for long-term data migration to AWS S3, ensuring that our data is stored efficiently and securely. You will be responsible for configuring and managing monitoring, alerts, and observability using a range of tools including Splunk, Netcool, ELK, and AIM. Your deep technical knowledge and operational experience with tools like AppDynamics, DataDog, Dynatrace, NewRelic, and Sumologic will be critical in maintaining the effectiveness of our observability solutions. Additionally, you will be expected to write and manage complex queries and alert definitions, as well as establish design patterns for monitoring application uptime and performance. As a thought leader in this space, you will provide strategic guidance on implementing and maintaining observability solutions, onboarding new teams and data sources into these systems. You will also create and maintain operational process documentation to ensure that best practices are followed and that knowledge is shared across the organization. Your ability to write code in languages such as Java, Python, Ruby, and Node.js will be essential in developing programs, config files, and complex queries that support our observability initiatives.