Senior Site Reliability Engineer II

LexisNexis Risk Data Management - Alpharetta, GA

posted 5 months ago

Full-time - Mid Level

Remote - Alpharetta, GA

10,001+ employees

About the position

LexisNexis Risk Solutions is seeking a Senior Site Reliability Engineer II to join our global engineering team. This position is open to applicants based in Alpharetta, GA, or those who prefer a fully remote work environment within the United States. The successful candidate will play a crucial role in shaping the operations and support for critical applications, customers, and projects. This role requires collaboration with Development, QA, IT Operations, and Customer Operations teams, emphasizing effective communication and problem-solving skills in a fast-paced working environment. The position also involves a transition to DevOps practices, agile support, and deployment processes, making it essential for the candidate to be adaptable and forward-thinking. As a Senior Site Reliability Engineer II, you will be responsible for leading complex reliability and toil reduction projects. This advanced professional role requires a deep understanding of system and application code, enabling you to make data-driven recommendations that balance customer, development, and operational needs. You will act as a subject matter expert, providing guidance to product and development teams to enhance reliability within a product group. Additionally, you will be involved in training and mentoring junior staff, ensuring a culture of learning and continuous improvement within the team. The role includes an on-call rotation for off-peak hours to maintain 24/7, 365 system availability, highlighting the importance of reliability in our operations. You will be expected to recommend service level objectives in partnership with product and development teams, master observability tools and techniques, and act as an escalation point during incidents. Your contributions will be vital in delivering resilient application stacks through Infrastructure as Code and other DevOps practices, as well as monitoring and supporting critical, high-revenue business applications.

Responsibilities

Recommend service level objectives in partnership with Product and Dev teams
Master observability tools and techniques
Act as an escalation during incidents
Collaborate with development teams to troubleshoot systems and application performance issues
Improve the SRE framework
Champion shared services and platforms to drive reliability
Create disaster recovery plans including advanced fault injection
Advise on SRE training curriculum and content
Deliver resilient application stacks via 'Infrastructure as Code' and other DevOps practices
Monitor and provide ongoing support for critical, high-revenue business applications
Diagnose and resolve complex system and application issues
Work with diverse technical and non-technical teams, including Development, QA, IT Operations, Customer Operations, and Project Management teams
Write and maintain systems/application documentation for technical and non-technical audiences
Migrate existing applications to Cloud environments

Requirements

Professional experience of working within the public cloud - AWS, Azure
Use of orchestration tools such as Terraform, CloudFormation
Experience with Continuous Integration/Delivery Tools such as GitLab, GitHub, Jenkins
Coding and scripting experience such as PowerShell, Bash, Python or equivalent
Configuration management tools such as Ansible, Puppet, Chef, or equivalents
Hands-on experience with Windows and Linux servers, including support and troubleshooting
Previous analytic and troubleshooting experience
Cloud architecture and system design to solve key business problems and facilitate team goals
Experience migrating applications from on-premises to public cloud

Nice-to-haves

Experience working with containerized workloads such as Docker and Kubernetes
System and application monitoring tools such as Prometheus, Grafana, CloudWatch
Familiarity with Log Management tools such as Elastic Stack, Graylog or Splunk
Experience working with relational databases such as MySQL, MS SQL Server or similar
Use of Secret Management services such as Hashicorp Vault
Knowledge of change control and associated procedures
Hands-on experience performing application static/dynamic security and penetration assessments with tools such as SonarQube, CheckMarx, AppScan, BurpSuite, OWASP ZAP Proxy, WebInspect, Fortify, Veracode, Nessus

Benefits

Flexible work environment (remote options)
Investment in staff development
Collaborative and friendly work culture
Opportunities for personal and career development
Support for women in technology initiatives
Diversity and inclusion programs
Health and wellness programs

Senior Site Reliability Engineer II

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company