Paycom Payroll - Oklahoma City, OK

posted 3 months ago

Full-time
Oklahoma City, OK
Professional, Scientific, and Technical Services

About the position

Site reliability engineers will be dedicated full-time to creating software tools, metrics, and processes that improve the reliability of applications, sites, and systems in production. The Site Reliability Engineer is primarily responsible for ensuring the integrity, functionality, and reliability of applications and sites. This role involves developing software to detect unusual error activity and implementing workflows and processes designed to identify and reduce the overall number of application/system errors. Collaboration with software development teams as part of the Software Development Life Cycle (SDLC) is essential to design and implement availability, reliability, and error monitoring solutions in their applications. The Site Reliability Engineer will take responsibility for removing, isolating, or remediating errors, debugs, warnings, or other kinds of messages from existing logs to improve overall log content and usefulness. Limiting system downtime is a critical aspect of this role, which involves defining and enforcing standards for incident responses, error tracking, monitoring, and alerting with the goal of improving established reliability metrics. The engineer will effectively respond to escalated site reliability issues at any time of the day while on-call. Additionally, conducting regular research on best practices and new technology for monitoring, alerting, error tracking and detection, and application performance is expected.

Responsibilities

  • Develop software to detect unusual error activity.
  • Implement workflows and processes that are designed to identify and reduce the overall number of application/system errors.
  • Collaborate with software development as part of the SDLC to design and implement availability, reliability, and error monitoring solutions in their applications.
  • Take responsibility for removing, isolating, or remediating errors, debugs, warnings or other kinds of messages from existing logs to improve overall log content and usefulness.
  • Limit system downtime by defining and enforcing standards for incident responses, error tracking, monitoring, and alerting with the goal to improve established reliability metrics.
  • Effectively respond to escalated site reliability issues any time of the day while on-call.
  • Conduct regular research on best practices and new technology for monitoring, alerting, error tracking and detection and application performance.

Requirements

  • Bachelors degree in Computer Science, MIS or related field.
  • 3+ years experience utilizing alerting and telemetry tools such as Grafana, Prometheus, Splunk, Dynatrace and others.
  • 2+ years experience with Splunk SPL.
  • 2+ years experience with at least one programming language such as PHP, Python, Java, .Net.

Nice-to-haves

  • 1+ years experience with CI/CD.
  • 1+ years experience with container and container orchestration such as Docker and Kubernetes.
  • 1+ years experience with Prom.
  • 1+ years experience with SQL.
  • Troubleshooting in a large-scale networked environment.
  • Knowledge of Paycom's applications, systems, and database.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service