Computer World Services Corp. (Cws) - Atlanta, GA

posted 21 days ago

Full-time - Senior
Remote - Atlanta, GA
Professional, Scientific, and Technical Services

About the position

The Senior Systems Engineer - Observability (SSE) plays a crucial role in defining and implementing observability solutions for IT operations within the Office of Financial Research (OFR). This position focuses on creating a consolidated operating picture for monitoring and governance, collaborating with various teams to enhance data collection and analysis, and ensuring the effective migration and integration of observability tools. The SSE will serve as a subject matter expert, responsible for building an enterprise observability strategy and providing Tier 3 support for complex systems.

Responsibilities

  • Design, implement, and maintain high-performance and scalable observability solutions in a cloud environment.
  • Collaborate with cross-functional teams to gather requirements, architect solutions, and deploy logging and monitoring environments that align with business needs.
  • Configuration and maintenance of Datadog integrations including Webhooks, Amazon, Cisco, CrowdStrike, Cribl Stream, Container, VMWare, SNMP, journald, Okta, python, Zscaler, Microsoft 365, Webhooks, Palo Alto.
  • Configuration of telemetry logs through Cribl Stream including syslog, SNMP traps, JSON, AWS CloudWatch, AWS S3.
  • Development of custom data/telemetry pipelines including Grok parsing, GeoIP parsing, field remapping, and error tracking.
  • Ingest telemetry logs directly from cloud SaaS providers such as Zscaler, Okta, CrowdStrike, ServiceNow, Microsoft 365.
  • Installation and configuration of the Datadog Agent and Datadog Synthetics Agent on Windows servers, Linux servers, and Docker/Kubernetes containers.
  • Configuration of the Datadog Agent to collect host logs, processes, custom metrics (including SNMP), and network performance monitoring (NPM).
  • Configuration of Synthetic testing to monitor infrastructure uptime SLAs and SLOs using private locations.
  • Configuration of service-related monitors based on metrics, logs, live processes, service checks, anomalies/outliers, including monitoring of serverless such as AWS Lambda functions.
  • Development of custom dashboards with a focus on reliability and performance of services.
  • Configuration and management of Service Catalog, including the definition of services and associated dashboards, monitors, SLOs, synthetic tests, metrics, and logs.
  • Configuration of incident management and service-based analytics including integration with JIRA and/or ServiceNow.
  • Maintain code repositories and versioning of any scripting or automation.
  • Provides technical leadership, oversight, governance, and direction for integrating with, and reporting on, observability pipelines.
  • Provide consultative services to support the application integrations required to be observed/monitored, such as Hadoop HDFS, Hadoop Map Reduce, Hive.
  • Identify opportunities for monitoring improvement, including incorporating APM and RUM monitoring.
  • Update documentation and user guides as needed.
  • Collaborate with cross-functional teams.
  • Configure monitors & alerts to integrate with Incident Management tools.

Requirements

  • Undergraduate degree in an engineering or computer science discipline and/or equivalent experience/certification.
  • 7+ years of experience in information technology with hands-on technical/engineering roles.
  • 2+ years of experience working with Datadog, including hands-on experience administering AND supporting a Datadog migration or implementation.
  • 3+ years of experience with AWS.
  • 3+ years data onboarding within a large-scale enterprise environment.
  • Experience in DataDog including building dashboards, reports, and alerts to meet customer requirements.
  • Experience with Infrastructure & Monitoring as Code tools.
  • Experience configuring and supporting additional Datadog modules.
  • Solid understanding of networking and device configuration.
  • Experience with migrating from other monitoring platforms to Datadog.
  • Experience with Incident Response tools.
  • Knowledge of Agile and continuous integration practices.
  • Collaborative mindset that thrives in fast paced environments.
  • Excellent verbal and written communication skills.

Nice-to-haves

  • DataDog, Cribl and AWS certifications.

Benefits

  • Remote work flexibility
  • Equal employment opportunity
  • Affirmative action employer
  • Reasonable accommodations for individuals with disabilities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service