Digital Apps SRE

Unclassified - Berkeley Heights, NJ

posted 3 months ago

Full-time - Mid Level

Remote - Berkeley Heights, NJ

About the position

As a Digital Apps Site Reliability Engineer (SRE) at GalaxE, you will play a crucial role in providing hands-on support for existing environments. Your responsibilities will encompass a wide range of tasks including software installation, patch installation, upgrades, query writing, configuration, security, system monitoring and tuning, disaster recovery planning, and release deployments. You will be part of a 24x7 support team for production Internet applications, ensuring that any issues are diagnosed and resolved promptly. This position will require you to be the point of escalation for application support, particularly in diagnosing and resolving complex customer issues related to the Portal and Web Services environments. In this role, you will drive incident crisis technical bridges and management bridges as necessary, leveraging your experience and organizational knowledge to reduce Mean Time to Recovery (MTTR). You will collaborate with Change Management and Release Managers to review proposed change events for production and participate in all Production Support activities during incidents and outages. As a hands-on technical resource, you will be expected to resolve all technical issues within both lower and upper environments, while also making recommendations for performance and capacity improvements. Documentation is a key aspect of this role; you will be responsible for documenting install defects, assigning severity to problems, and conducting postmortems to identify root cause analysis (RCA) after fallback. You will also participate in internal and external audits as required by management and work closely with Engineering to ensure that all relevant Key Performance Indicators (KPIs) are implemented within the monitoring framework. Additionally, you will escalate issues to technology, operations, and/or vendors as appropriate, and ensure that database/application controls and procedures remain compliant with Corporate IT risk. Supporting Disaster Recovery tests and live recovery for all production environments will also be part of your responsibilities.

Responsibilities

Provide hands-on support for existing environments including software installation, patch installation, upgrades, and configuration.
Perform system monitoring and tuning, disaster recovery planning, and release deployments.
Provide 24x7 support of production Internet applications on a rotating basis.
Act as a point of escalation for application support to diagnose and resolve complex customer issues.
Drive incident crisis technical bridges and management bridges as required to reduce MTTR.
Collaborate with Change Management and Release Managers to review proposed change events for production.
Participate in all Production Support activities during incidents and outages.
Resolve all technical issues within lower and upper environments and recommend performance and capacity improvements.
Document install defects and assign severity to problems that occurred.
Conduct postmortems to identify root cause analysis (RCA) after fallback.
Participate in internal and external audits as required by management.
Work closely with Engineering to implement relevant KPIs within the monitoring framework.
Escalate issues to technology, operations, and/or vendors as appropriate.
Ensure compliance of database/application controls and procedures with Corporate IT risk.
Support Disaster Recovery tests and live recovery for all production environments.

Requirements

Experience with web servers such as Nginx or Apache configurations and reverse proxies.
Proficient in Linux system administration, including managing and troubleshooting Linux systems and services, and bash scripting.
Experience with JBoss or WildFly administration.
Ability to work with third-party vendors.
Willingness to participate in On-Call rotation.

Nice-to-haves

Familiarity with containerization technologies such as Docker and Kubernetes.
Experience with Continuous Integration / Continuous Delivery tools like Azure DevOps and Jenkins.
Solid understanding of routing and networking concepts.
Experience working in an Agile development environment.
Familiarity with collaboration platforms such as JIRA, Confluence, Wiki, and ServiceNow.

Benefits

Diversity and inclusion initiatives
Opportunities for professional development
Flexible work arrangements
Health and wellness programs
Competitive salary and benefits package

Digital Apps SRE

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company