Applab Systems - Princeton, NJ
posted 3 months ago
The Site Reliability Engineering (SRE) position is a full-time, permanent role based in O'Fallon, Missouri, requiring onsite presence. This role is critical in ensuring the reliability and performance of applications in a production environment. The SRE will be responsible for incident resolution, change implementation, and supporting various operational tasks that ensure the smooth functioning of applications. The position demands a proactive approach to monitoring and troubleshooting, with a strong emphasis on collaboration with development and support teams to address issues effectively. Key responsibilities include reviewing and resolving incidents that arise from various monitoring alerts, including those from the Operation Command Center and Enterprise Monitoring Operations. The SRE will also be tasked with deploying application-related artifacts during approved release windows, reporting deployment issues, and coordinating with development teams for resolution. Additionally, the role involves resolving work orders related to business queries, performing traffic routing for infrastructure maintenance, and conducting detailed root cause analysis for high-severity incidents. The SRE will support User Acceptance Testing (UAT) by the product team and assist in onboarding new customers to the platform. This includes configuring applications, testing file processing, and ensuring timely report delivery. The role also requires raising and reviewing change tickets, creating documentation in Confluence for new incidents and work orders, and collaborating with customers on ad-hoc queries. Furthermore, the SRE will be expected to build automation scripts to enhance processes and reduce incidents, support the completion of Post Incident Reports, and participate in War Room calls for critical incidents. Flexibility in working hours, including shifts and weekend support, is essential for this position.