Site Reliability Engineering (SRE)

Applab Systems - Princeton, NJ

posted 3 months ago

Full-time

Princeton, NJ

Professional, Scientific, and Technical Services

About the position

The Site Reliability Engineering (SRE) position is a full-time, permanent role based in O'Fallon, Missouri, requiring onsite presence. This role is critical in ensuring the reliability and performance of applications in a production environment. The SRE will be responsible for incident resolution, change implementation, and supporting various operational tasks that ensure the smooth functioning of applications. The position demands a proactive approach to monitoring and troubleshooting, with a strong emphasis on collaboration with development and support teams to address issues effectively. Key responsibilities include reviewing and resolving incidents that arise from various monitoring alerts, including those from the Operation Command Center and Enterprise Monitoring Operations. The SRE will also be tasked with deploying application-related artifacts during approved release windows, reporting deployment issues, and coordinating with development teams for resolution. Additionally, the role involves resolving work orders related to business queries, performing traffic routing for infrastructure maintenance, and conducting detailed root cause analysis for high-severity incidents. The SRE will support User Acceptance Testing (UAT) by the product team and assist in onboarding new customers to the platform. This includes configuring applications, testing file processing, and ensuring timely report delivery. The role also requires raising and reviewing change tickets, creating documentation in Confluence for new incidents and work orders, and collaborating with customers on ad-hoc queries. Furthermore, the SRE will be expected to build automation scripts to enhance processes and reduce incidents, support the completion of Post Incident Reports, and participate in War Room calls for critical incidents. Flexibility in working hours, including shifts and weekend support, is essential for this position.

Responsibilities

Review and resolve incidents arising from Operation Command Center Alerts, Enterprise Monitoring Operations, OMNIBUS, and Splunk Alerts.
Deploy application-related artifacts to production environments during approved release windows.
Report issues with deployments and coordinate with Development Teams to resolve deployment issues.
Resolve work orders related to business/functional queries, ad-hoc testing, verification, and validation from Regional product and customer support teams.
Perform traffic routing in support of infrastructure maintenance.
Conduct detailed root cause analysis for high-severity incidents and implement preventive actions.
Support UAT testing by the Product team and Regional customer support team.
Configure applications/artifacts and assist in onboarding new customers to the platform.
Test newly onboarded customers' file processing and report delivery.
Raise new change tickets and arrange for approvals, including CAB approvals.
Review and approve change tickets.
Create Confluence pages for newly analyzed work orders and incidents with resolution steps.
Collaborate with customers on ad-hoc queries.
Work with Development/Testing teams for defect analysis using production-simulated data.
Build automation scripts to reduce incidents and improve processes.
Assist customers in filling out Post Incident Reports for high-impact incidents.
Participate in or initiate War Room calls that impact application availability or customer experience.
Willingness to work on shifts (morning and afternoon) and provide weekend support.

Requirements

Application/Production L2 Support experience is a must.
Proficiency in Unix front-end troubleshooting, Oracle SQL, and Java.
Experience with monitoring tools such as Splunk and Dynatrace.
Familiarity with DevOps tools.

Site Reliability Engineering (SRE)

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company