Lead Site Reliability Engineer (Hybrid)

Cognizant Technology Solutions - Dallas, TX

posted 3 months ago

Full-time - Senior

Dallas, TX

10,001+ employees

Professional, Scientific, and Technical Services

About the position

Cognizant is seeking a highly qualified Lead Site Reliability Engineer (SRE) to join our Digital Engineering practice. This role is pivotal in developing and building scalable, enterprise applications that meet the high demands of our clients. As a Lead SRE, you will be part of a dynamic digital software team that is dedicated to delivering high-quality, reliable, and maintainable code. You will collaborate closely with product managers, designers, and clients to make informed decisions that lead to the rapid delivery of valuable software solutions. Our engineers are committed to continuous improvement, regularly reflecting on our processes to identify areas for enhancement and to celebrate our successes. Success in this role is measured by the effectiveness of our team and the satisfaction of our customers. In this position, you will be responsible for implementing robust monitoring and alerting systems that focus on symptoms rather than just outages. You will document your findings and transform them into repeatable actions, ultimately leading to automation. Your expertise will be crucial in improving deployment processes, change management, and release management to enhance efficiency. You will also debug production issues across various services and levels of the stack, propose solutions to improve system resiliency, availability, and security, and plan configuration changes at both the application and infrastructure levels. Your proactive approach will help identify opportunities to enhance system performance based on monitoring insights, and you will conduct thorough Root Cause Analysis (RCA) investigations to address incidents effectively. As a Lead SRE, you will also play a key role in advancing our DevSecOps practices, accelerating delivery, and resolving technical challenges. You will provide valuable input in developing strategic technology roadmaps and respond to customer incidents with a focus on support and resolution. This hybrid position is based in Fort Worth, Dallas, TX, and offers the opportunity to work in a collaborative and innovative environment.

Responsibilities

Make monitoring and alerting notify on symptoms and not on outages.
Document findings and turn them into repeatable actions and automation.
Improve the deployment process, change management, and release management processes to make them efficient and streamlined.
Debug production issues across services and levels of the stack.
Propose ideas and solutions within the product team to improve resiliency, availability, and security.
Plan and implement configuration change operations at both the application and infrastructure levels.
Actively look for opportunities to improve the availability and performance of the system based on monitoring and observation.
Complete Root Cause Analysis (RCA) investigations.
Improve DevSecOps practices and accelerate delivery, taking a lead role in solving technical issues.
Assist in providing inputs to develop strategic technology roadmaps.
Respond to incidents and provide support for customer incidents.

Requirements

12 to 15 years of overall experience.
7 to 8 years of solid SRE working experience.
Experience implementing GitHub, GitAction CI/CD, and ADO cloud for automation.
Experience implementing monitoring and observability in AKS and Azure cloud, Kubernetes.
Experience with monitoring and metrics in Dynatrace, Prometheus, Grafana, and integrations with Moogsoft/xMatters.
Experience with open-source logging infrastructure.
2 years of experience in an environment with Node JS and GQL.
Hands-on experience with Infrastructure as a Service (IaaS), Platform as a Service (PaaS) tools and platforms, and containers and container orchestration platforms (Docker & Kubernetes).
Expertise in one or more cloud-native relational databases such as MySQL, PostgreSQL, and NoSQL databases such as Cassandra and MongoDB.

Nice-to-haves

Experience with Terraform in Azure and on-prem infrastructure resources.
Experience with load balancing applications including proxies and CDN (automation).
Ability to script automated performance testing scenarios for APIs and web front ends and embed them in CI/CD pipelines.
Experience in the airline industry.
Familiarity with TypeScript and JavaScript.
Experience with database and persistence frameworks: Mongo, Oracle, Object/Relational Mapping, Query performance tuning.
Experience with Mongo Schema Design and Mongo Aggregation Framework.
Experience with web services: GraphQL, REST/SOAP (JSON/WSDL/XML).
Experience with DB Admin/SQL Server, SysAdmin, solving network issues, and VM management.

Benefits

Cognizant is recognized as a Military Friendly Employer.
Cognizant Veterans Network assists Veterans in building and growing a career at Cognizant.
Cognizant is committed to creating a diverse environment and considers all applicants without regard to race, creed, color, national origin, ancestry, age, marital and family status, disabilities, sexual orientation or preference, veteran status or any other classification protected by law.

Lead Site Reliability Engineer (Hybrid)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company