Computer World Services Corp. (Cws) - Mansfield, TX

posted 3 months ago

Full-time
Remote - Mansfield, TX
Professional, Scientific, and Technical Services

About the position

The mission of the Office of Financial Research (OFR) is to support the Financial Stability Oversight Council (FSOC) in promoting financial stability. This involves collecting data on behalf of FSOC, providing such data to FSOC and member agencies, standardizing the types and formats of data reported and collected, performing applied research and essential long-term research, developing tools for risk measurement and monitoring, and performing other related services. The results of the OFR's activities are made available to financial regulatory agencies, assisting member agencies in determining the types and formats of data authorized for collection. The Senior Systems Engineer - Observability (SSE) will play a crucial role in defining and implementing infrastructure and application observability. This includes setting up governance, optimization, monitoring, and control for a consolidated common operating picture for IT operations. The SSE will collaborate with engineering, application, security operations, Service Desk, and enterprise/solution architects to develop and implement services, monitor, report, and automate processes where applicable. This position serves as a subject matter expert in a complex array of full stack solutions, responsible for the migration of feeds from Splunk to Cribl, onboarding new feeds, and providing Tier 3 support. The SSE will also work with vendors on open tickets and operate within an Agile environment and enterprise change control systems. The role involves performing research, analysis, design, creation, and implementation to meet current and future requirements, as well as building and operationalizing an enterprise observability strategy.

Responsibilities

  • Design, implement, and maintain high-performance and scalable observability solutions in a cloud environment.
  • Collaborate with cross-functional teams to gather requirements, architect solutions, and deploy logging and monitoring environments that align with business needs.
  • Configuration and maintenance of Datadog integrations including Webhooks, Amazon, Cisco, CrowdStrike, Cribl Stream, Container, VMWare, SNMP, journald, Okta, python, Zscaler, Microsoft 365, Webhooks, Palo Alto.
  • Configuration of telemetry logs through Cribl Stream including syslog, SNMP traps, JSON, AWS CloudWatch, AWS S3.
  • Development of custom data/telemetry pipelines including Grok parsing, GeoIP parsing, field remapping, and error tracking.
  • Ingest telemetry logs directly from cloud SaaS providers such as Zscaler, Okta, CrowdStrike, ServiceNow, Microsoft 365.
  • Installation and configuration of the Datadog Agent and Datadog Synthetics Agent on Windows servers, Linux servers, and Docker/Kubernetes containers.
  • Configuration of the Datadog Agent to collect host logs, processes, custom metrics (including SNMP), and network performance monitoring (NPM).
  • Configuration of Synthetic testing to monitor infrastructure uptime SLAs and SLOs using private locations.
  • Configuration of service-related monitors based on metrics, logs, live processes, service checks, anomalies/outliers, including monitoring of serverless such as AWS Lambda functions.
  • Development of custom dashboards with a focus on reliability and performance of services.
  • Configuration and management of Service Catalog, including the definition of services and associated dashboards, monitors, SLOs, synthetic tests, metrics, and logs.
  • Configuration of incident management and service-based analytics including integration with JIRA and/or ServiceNow.
  • Maintain code repositories and versioning of any scripting or automation.
  • Provides technical leadership, oversight, governance, and direction for integrating with, and reporting on, observability pipelines.
  • Provide consultative services to support the application integrations required to be observed/monitored, such as Hadoop HDFS, Hadoop Map Reduce, Hive.
  • Identify opportunities for monitoring improvement, including incorporating APM and RUM monitoring.
  • Update documentation and user guides as needed.
  • Collaborate with cross-functional teams.
  • Configure monitors & alerts to integrate with Incident Management tools.

Requirements

  • Undergraduate degree in an engineering or computer science discipline and/or equivalent experience/certification.
  • 7+ years of experience in information technology with hands-on technical/engineering roles including 2+ years of experience working with Datadog, including hands-on experience administering AND supporting a Datadog migration or implementation.
  • 3+ years of experience with AWS.
  • 3+ years data onboarding within a large-scale enterprise environment.
  • Experience in DataDog including building dashboards, reports, and alerts to meet customer requirements.
  • Experience with Infrastructure & Monitoring as Code tools.
  • Experience configuring and supporting additional Datadog modules.
  • Solid understanding of networking and device configuration.
  • Experience with migrating from other monitoring platforms to Datadog.
  • Experience with Incident Response tools.
  • Knowledge of Agile and continuous integration practices.
  • Collaborative mindset that thrives in fast paced environments.
  • Excellent verbal and written communication skills including the ability to author and present materials ranging from detailed technical specifications to high-level concepts for senior audiences.

Nice-to-haves

  • DataDog, Cribl and AWS certifications.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service