Alloy - New York, NY
posted 2 months ago
Alloy is seeking a Site Reliability Engineer (SRE) to join our Infrastructure Team in New York City. This role is pivotal in ensuring that our services, which are relied upon by leading fintechs and top-tier banks, maintain high uptime and exceed our Service Level Objectives (SLOs). As part of a team of five engineers, you will report to the Engineering Manager of Infrastructure and will be responsible for architecting and building infrastructure solutions that enhance our operational reliability. Your work will involve provisioning and managing a variety of AWS resources using Terraform, implementing solutions for deploying applications to Kubernetes in production, and helping to architect secure and reliable systems and deployment pipelines. In this role, you will be expected to write and review code comfortably, apply pragmatic thinking to justify decisions on building versus buying solutions, and continuously seek opportunities to improve our infrastructure. You will utilize tools like Datadog, Splunk, or New Relic to identify latency issues in distributed systems and propose solutions to mitigate them. Participation in on-call rotations is part of the job, but your focus will be on building resilient and self-healing systems to minimize alerts. You will also be responsible for writing infrastructure as code (IAC) using Terraform, automating processes with AWS Tools, GitHub Actions, and custom scripts, and supporting application developers by eliminating constraints in the deployment pipeline. Continuous improvement will be a key focus, as you will look for ways to enhance uptime, autoscaling, and recovery times while suggesting new cloud services and optimizing costs. Your contributions will be crucial in maintaining a high standard of service delivery and operational excellence at Alloy.