Site Reliability Engineer

NCR Atleos - Atlanta, GA

posted 5 months ago

Full-time - Mid Level

Atlanta, GA

10,001+ employees

Credit Intermediation and Related Activities

About the position

We are looking for a Site Reliability Engineer (SRE) at NCR Atleos, headquartered in Atlanta, who will initially focus on production AppOps. The ideal candidate will be responsible for building scalable systems using best practices around automation to improve reliability, velocity, and enable monitoring of the operational health of stacks throughout their life-cycle. This includes metrics collection, aggregation, and visualization. As a member of the SRE team, you will support NCR's Financial Services business unit, product, and technology teams to enhance the design and operation of systems, ensuring they are scalable, reliable, and efficient while maintaining high availability of products and services primarily residing in the cloud. The SRE will play a crucial role in influencing the development and implementation of reliable production systems and services to meet emerging business needs, such as Cloud-based SaaS. SREs take pride in the resiliency and stability of production systems while being committed to innovation and operational improvement through the application of software engineering practices to operations. You will facilitate innovation and operational improvement by making our products easier to adopt and use through enhancements to the product, tools, processes, and documentation. The goal is to achieve six 9's or better in availability/uptime. In this role, you will be responsible for maintaining and scaling production services and servers for complex and high throughput cloud services. You will bridge and own the union between development, quality, security, and operations, improving scalability, service reliability, capacity, and performance. Your responsibilities will include writing automation code for provisioning and operating infrastructure at massive scale, participating in disaster recovery planning and execution, and collaborating with development teams to create SLIs, SLOs, and SLAs. You will also develop monitoring architecture, implement monitoring agents, build dashboards, manage escalations and alerts, and participate in incident management and root cause analysis (RCA).

Responsibilities

Maintain and scale production services and servers for complex and high throughput cloud services.
Bridge and own the union between development, quality, security, and operations.
Improve scalability, service reliability, capacity, and performance.
Write automation code for provisioning and operating infrastructure at massive scale.
Initiate and contribute to continuous improvement of software delivery processes and practices.
Use automation extensively to design, configure, manage, and monitor systems in support of product development teams.
Participate in disaster recovery planning and execution.
Maintain and patch servers supporting SaaS products, including Windows and Linux Servers.
Collaborate with all teams to ship code to production using CI/CD and AppSec tooling.
Create SLIs, SLOs, and SLAs in collaboration with development teams.
Provide timely assistance and remediation solutions during critical situations and production incidents.
Develop monitoring architecture, implement monitoring agents, build dashboards, manage escalations and alerts.
Participate in incident management and driving root cause analysis (RCA) and risk management processes.
Participate in a rotating on-call schedule during off-hours.

Requirements

BS degree in Computer Science or related technical field or 5 years prior relevant experience.
Extensive experience in a DevOps / SRE role with demonstrable experience in deploying and managing large scale production environments in Google Cloud Platform, AWS or Azure and Multi Datacenter environment.
Experience developing and debugging code in languages such as Java, C, C++, .NET, Python, Ruby, Go, Shell, Perl, JavaScript.
2+ years deploying and supporting high traffic, scalable web applications/services.
2+ years with cloud virtualization and PaaS.
2+ years with AWS/Google Cloud Platform/Azure.
2+ years with Docker, Kubernetes and early versions of OpenShift.
Experience with Linux, Shell Scripting, PKI TLS/SSL, Network, firewalls, load balancers and backup.
Experience in designing, analyzing and running large-scale distributed systems.
Experience hosting and solving problems with public-facing services securely in Azure, AWS or Google Cloud Platform.
Experience with orchestration, automation, and configuration management tools like git, Fabric and Ansible (or Puppet, Chef, Terraform, Helm or related technology).
Excellent analysis, debugging, root-cause identification, and troubleshooting skills.
Experience with Kubernetes, system virtualization, on-prem and/or hybrid cloud computing, cloud Identity and security system, cloud monitoring and logging, and/or local/cloud storage.
Experience with one or more CI tools Jenkins, Artifactory, Harness, CloudBuild.
Experience with application disaster recovery, migration, roll-back plans, expansion, routine deployments, and system upgrades.
Experience with log management, including aggregation, alerting, and graphing.

Nice-to-haves

Experience with Cassandra, Elasticsearch or Kafka.
Cloud certifications and exposure to Harness.

Benefits

Medical Insurance
Dental Insurance
Life Insurance
Vision Insurance
Short/Long Term Disability
Paid Vacation
401k

Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company