NCR Atleos - Atlanta, GA
posted 4 months ago
We are looking for a Site Reliability Engineer (SRE) at NCR Atleos, headquartered in Atlanta, who will initially focus on production AppOps. The ideal candidate will be responsible for building scalable systems using best practices around automation to improve reliability, velocity, and enable monitoring of the operational health of stacks throughout their life-cycle. This includes metrics collection, aggregation, and visualization. As a member of the SRE team, you will support NCR's Financial Services business unit, product, and technology teams to enhance the design and operation of systems, ensuring they are scalable, reliable, and efficient while maintaining high availability of products and services primarily residing in the cloud. The SRE will play a crucial role in influencing the development and implementation of reliable production systems and services to meet emerging business needs, such as Cloud-based SaaS. SREs take pride in the resiliency and stability of production systems while being committed to innovation and operational improvement through the application of software engineering practices to operations. You will facilitate innovation and operational improvement by making our products easier to adopt and use through enhancements to the product, tools, processes, and documentation. The goal is to achieve six 9's or better in availability/uptime. In this role, you will be responsible for maintaining and scaling production services and servers for complex and high throughput cloud services. You will bridge and own the union between development, quality, security, and operations, improving scalability, service reliability, capacity, and performance. Your responsibilities will include writing automation code for provisioning and operating infrastructure at massive scale, participating in disaster recovery planning and execution, and collaborating with development teams to create SLIs, SLOs, and SLAs. You will also develop monitoring architecture, implement monitoring agents, build dashboards, manage escalations and alerts, and participate in incident management and root cause analysis (RCA).