Unclassified - Berkeley Heights, NJ
posted 2 months ago
As a Digital Apps Site Reliability Engineer (SRE) at GalaxE, you will play a crucial role in providing hands-on support for existing environments. Your responsibilities will encompass a wide range of tasks including software installation, patch installation, upgrades, query writing, configuration, security, system monitoring and tuning, disaster recovery planning, and release deployments. You will be part of a 24x7 support team for production Internet applications, ensuring that any issues are diagnosed and resolved promptly. This position will require you to be the point of escalation for application support, particularly in diagnosing and resolving complex customer issues related to the Portal and Web Services environments. In this role, you will drive incident crisis technical bridges and management bridges as necessary, leveraging your experience and organizational knowledge to reduce Mean Time to Recovery (MTTR). You will collaborate with Change Management and Release Managers to review proposed change events for production and participate in all Production Support activities during incidents and outages. As a hands-on technical resource, you will be expected to resolve all technical issues within both lower and upper environments, while also making recommendations for performance and capacity improvements. Documentation is a key aspect of this role; you will be responsible for documenting install defects, assigning severity to problems, and conducting postmortems to identify root cause analysis (RCA) after fallback. You will also participate in internal and external audits as required by management and work closely with Engineering to ensure that all relevant Key Performance Indicators (KPIs) are implemented within the monitoring framework. Additionally, you will escalate issues to technology, operations, and/or vendors as appropriate, and ensure that database/application controls and procedures remain compliant with Corporate IT risk. Supporting Disaster Recovery tests and live recovery for all production environments will also be part of your responsibilities.