CodeHunterposted 6 days ago
Full-time - Mid Level
McLean, VA

About the position

CodeHunter is a dynamic and innovative tech company that specializes in cybersecurity. CodeHunter is an enterprise-grade malware hunting platform. In seconds, we identify unknown malware threats that are undetectable to current cybersecurity solutions. By automating the analysis process, we reduce dependency on manual efforts and provide actionable intelligence that protects your organization from threat actors. We are enthusiastic about pushing the boundaries of technology to create innovative solutions that solve real-world problems. As we continue to grow, we are looking for a talented Mid-Level Site Reliability Engineer to join our team. As a Mid-Level Site Reliability Engineer at CodeHunter you will own the availability, resiliency, and scaling of our SaaS product offering. We need a highly skilled, purposeful, and accountable Site Reliability Engineer (SRE) to lead the charge in establishing a world-class reliability program. You will play a critical role in optimizing our systems, ensuring scalability, and maintaining the highest security, availability, and performance standards. You will work closely with our DevOps Engineering team to observe, measure, and deliver high-quality solutions to meet client contractual service level agreements (SLAs). This position offers an exciting opportunity to work on challenging projects, learn modern technologies, and make a foundational impact on growing our service level maturity.

Responsibilities

  • Lead the design and execution of a cutting-edge site reliability program that raises the bar for performance, scalability, and security.
  • Refine our DevSecOps practices, ensuring continuous improvement in monitoring, logging, and security.
  • Take full ownership of optimizing system performance, managing disaster recovery processes, and driving cost management for third-party SaaS solutions (AWS, Azure).
  • Establish and exceed SLAs, SLOs, and SLIs to guarantee system reliability, manage incidents with a sense of urgency, and conduct post-mortems to continuously improve our infrastructure.
  • Champion resilience and system uptime through chaos engineering, automated scaling, self-healing mechanisms, and future-proof capacity planning.
  • Develop and implement advanced monitoring and observability tools, while actively managing error budgets to meet organizational goals.
  • Automate CI/CD pipelines, infrastructure as code (IaC), and configurations to streamline our development processes.
  • Lead the DevOps Change Control Board (CCB), setting the standard of excellence in our change management processes.
  • Oversee the creation and evolution of a comprehensive internal knowledge base and develop training content to ensure seamless onboarding.
  • Drive zero-downtime deployments, utilizing blue-green and canary deployment strategies to ensure smooth updates.
  • Manage and optimize cloud platforms, containers (Docker, Kubernetes), and observability tools feeding critical insights to NOC/SOC and executive-level dashboards.
  • Stay at the forefront of industry’s best practices, emerging technologies, and innovations to drive continuous improvement.

Requirements

  • A proven history of exceeding expectations and delivering high-impact results in site reliability, performance optimization, and system scalability.
  • Expertise in cloud platforms (AWS, Azure), containers (Docker, Kubernetes), and modern monitoring/observability tools.
  • Deep experience with automation, DevSecOps practices, and infrastructure-as-code (IaC).
  • Strong leadership skills, with the ability to drive change, champion excellence, and mentor others.
  • A proactive problem-solver with a keen focus on both long-term vision and immediate execution.
  • Passion for continuous learning, staying updated on industry trends, and applying best practices to deliver exceptional reliability and performance.
  • Significant prior experience managing uptimes for cloud infrastructure investments handling millions of HTTP requests per second globally.
  • Good knowledge of PowerShell, Python, Bash.
  • 5-7+ years of professional site reliability engineering.
  • Excellent problem-solving and communication skills.
  • Ability to work collaboratively in a team environment and delegate responsibilities to team members.

Nice-to-haves

  • Dynamic environment creation on demand using Terraform or similar technology.

Benefits

  • 401K
  • Health coverage
  • Vision and dental coverage
  • Company-sponsored training
  • Parking or metro benefits
  • Catered lunches
  • Generous PTO policy
Hard Skills
Kubernetes
2
Bash
1
Docker
1
Python
1
Terraform
1
5Bk1CdtUi HQTg Tb0CDtdW5NIO
0
5U8B6w3GR9rl 84yxiF1tCZrj
0
62hmcK pEPagk91z8ZJ
0
6RLBy ZiOFUh4RfAt
0
9vpfuX2e 13xGKTEpyOMZ
0
AGpZ40x kFg4I8AbcnT6S
0
D0noaiUxMPWT vhbriLJlTUA
0
D951jThd zHAT8l
0
Dym7QlofaGUjSrz BR2 DHKjm
0
EWXsGbKvlQ9H zSu1ybi0Ng6F
0
GC5sJNXbq RTozm5XrP
0
Halek4SZXNOr pgZaqAevV0ht
0
JlgCduX WhmoVzDU
0
KBanwu
0
QfwWpKM jzxFf3a5IuWo
0
QqZ2IWo83 KGYrX5casuy
0
STN 4dTcoME wui8CgDKPLY
0
WET4PnhYm kbDu
0
X1VNOJ Sn6ZsxajNE7lQHI fqRmiGvdj3Y
0
ZV5ltg6 IwJhs0PE7OSnm
0
pQqUg2eca GzLuxch62EayX
0
riQvlLaU 3w9anK
0
Soft Skills
LNEj6VqT 1bPJCFjH
0
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service