Burgeon IT Services - Plano, TX

posted 4 days ago

Full-time - Mid Level
Plano, TX

About the position

The Site Reliability Engineer (SRE) role focuses on enhancing network and service availability through automation, tools, and processes. The position requires a solid foundation in software development, particularly in Python, and experience with Docker and Kubernetes. The SRE will work closely with the Network Operations Center (NOC) to resolve network events and integrate new tools and products into the NOC support teams. This role involves designing, implementing, and troubleshooting technical issues, as well as mentoring team members and ensuring high-quality software development practices.

Responsibilities

  • Drive solid system architecture and guide and mentor code development practices.
  • Manage safe feature branching strategies and versioning control.
  • Develop proper workflow for team code review and deliver well-vetted and tested products.
  • Oversee application testing procedures and software deployment packaging.
  • Monitor infrastructure, in/outbound processes, web services, and application health.
  • Implement feature tracking and bug fixes.
  • Define standards for enterprise quality software that is robust, scalable, and maintainable.
  • Develop and maintain a catalog of reliability scripts, tools, and libraries for operational needs.
  • Monitor and analyze network performance, providing automation insights for network events.
  • Analyze data to diagnose and identify root causes for network-specific events.
  • Act as a Tier 3 escalation for issues from Tier 1 or Tier 2 related to the observability platform.
  • Collaborate with vendors and internal technical teams to incorporate technical solutions.
  • Define and implement strategies for network automation to improve operational efficiencies.
  • Manage a CI/CD pipeline for network development and testing.
  • Participate in documentation of application/network flows for support needs.
  • Provide technical guidance, training, and mentorship to NOC and engineering teams.
  • Develop and improve instrumentation for monitoring and logging service health and availability.
  • Participate in Major Incident bridges and formal RCA reports.

Requirements

  • Bachelor's Degree in Computer Science, IT-related field, or equivalent experience.
  • At least 3+ years of scripting experience in Python and Javascript.
  • 3+ years of event-driven engineering experience, preferably with AIOps using AI/ML platforms/tools.
  • 3+ years of experience utilizing Source Code Management, CI/CD tools, and Automation tools such as Git/Gitlab, Terraform, Ansible, Chef, Puppet, Jenkins.
  • 3+ years of experience building CI/CD pipelines, version control, and system testing with Gitlab and Jenkins.
  • 3+ years of experience with OS level containerization techniques using Docker, WindRiver, VMware, Kubernetes, and Rancher.
  • 3+ years of experience with cloud platforms such as AWS, Azure, and Google Cloud Platform.
  • 5+ years of technical, hands-on experience in AWS Cloud Engineering, 5G ORAN, 5G Core, and/or Data and Transport Engineering.
  • Strong ownership of work and results delivery.
  • Habitual code branching, versioning, feature lifecycle management, testing, packaging, and deployments.
  • Excellent communication skills and a team player.

Nice-to-haves

  • 5+ years of experience using platforms such as DataDog, Grafana, ServiceNow, Solarwinds, Cisco Vitria/Matrix, Innoeye, Atlassian Stack (Crucible, Bitbucket, JIRA, Confluence).
  • Experience gaining insight from log files with LOKI, ElasticSearch, Prometheus, and Grafana.
  • Experience implementing systems tracing with services such as Tempo, Jaeger, Opentracing.
  • Intermediate understanding of utilizing RestAPIs, Apache Spark, Kafka.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service