Computer World Services Corp. (Cws) - Mansfield, TX

posted 21 days ago

Full-time - Mid Level
Remote - Mansfield, TX
Professional, Scientific, and Technical Services

About the position

The Site Reliability Engineer (EMO Engineer) at Computer World Services Corp is responsible for defining and implementing observability solutions for IT operations, ensuring high-performance and scalable infrastructure in a cloud environment. This role involves collaboration with cross-functional teams to gather requirements, architect solutions, and deploy monitoring environments that align with business needs. The engineer will also provide Tier 3 support, manage observability pipelines, and lead the migration of data feeds to enhance operational efficiency.

Responsibilities

  • Design, implement, and maintain high-performance and scalable observability solutions in a cloud environment.
  • Collaborate with cross-functional teams to gather requirements, architect solutions, and deploy logging and monitoring environments that align with business needs.
  • Configuration and maintenance of Datadog integrations including Webhooks, Amazon, Cisco, CrowdStrike, Cribl Stream, Container, VMWare, SNMP, journald, Okta, python, Zscaler, Microsoft 365, Webhooks, Palo Alto.
  • Configuration of telemetry logs through Cribl Stream including syslog, SNMP traps, JSON, AWS CloudWatch, AWS S3.
  • Development of custom data/telemetry pipelines including Grok parsing, GeoIP parsing, field remapping, and error tracking.
  • Ingest telemetry logs directly from cloud SaaS providers such as Zscaler, Okta, CrowdStrike, ServiceNow, Microsoft 365.
  • Installation and configuration of the Datadog Agent and Datadog Synthetics Agent on Windows servers, Linux servers, and Docker/Kubernetes containers.
  • Configuration of the Datadog Agent to collect host logs, processes, custom metrics (including SNMP), and network performance monitoring (NPM).
  • Configuration of Synthetic testing to monitor infrastructure uptime SLAs and SLOs using private locations.
  • Configuration of service-related monitors based on metrics, logs, live processes, service checks, anomalies/outliers, including monitoring of serverless such as AWS Lambda functions.
  • Development of custom dashboards with a focus on reliability and performance of services.
  • Configuration and management of Service Catalog, including the definition of services and associated dashboards, monitors, SLOs, synthetic tests, metrics, and logs.
  • Configuration of incident management and service-based analytics including integration with JIRA and/or ServiceNow.
  • Maintain code repositories and versioning of any scripting or automation.
  • Provides technical leadership, oversight, governance, and direction for integrating with, and reporting on, observability pipelines.
  • Provide consultative services to support the application integrations required to be observed/monitored, such as Hadoop HDFS, Hadoop Map Reduce, Hive.
  • Identify opportunities for monitoring improvement, including incorporating APM and RUM monitoring.
  • Update documentation and user guides as needed.
  • Collaborate with cross-functional teams.
  • Configure monitors & alerts to integrate with Incident Management tools.

Requirements

  • Undergraduate degree in an engineering or computer science discipline and/or equivalent experience/certification.
  • 7+ years of experience in information technology with hands-on technical/engineering roles.
  • 2+ years of experience working with Datadog, including hands-on experience administering AND supporting a Datadog migration or implementation.
  • 3+ years of experience with AWS.
  • 3+ years data onboarding within a large-scale enterprise environment.
  • Experience in DataDog including building dashboards, reports, and alerts to meet customer requirements.
  • Experience with Infrastructure & Monitoring as Code tools.
  • Experience configuring and supporting additional Datadog modules.
  • Solid understanding of networking and device configuration.
  • Experience with migrating from other monitoring platforms to Datadog.
  • Experience with Incident Response tools.
  • Knowledge of Agile and continuous integration practices.
  • Collaborative mindset that thrives in fast paced environments.
  • Excellent verbal and written communication skills.

Nice-to-haves

  • DataDog, Cribl and AWS certifications.

Benefits

  • Remote work flexibility
  • Equal employment opportunity
  • Affirmative action employer
  • Reasonable accommodations for individuals with disabilities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service