New York Life - Lebanon, NJ

posted 2 months ago

Full-time - Mid Level
Lebanon, NJ
Insurance Carriers and Related Activities

About the position

As part of Technology, you'll have the opportunity to contribute to groundbreaking initiatives that shape New York Life's digital landscape. Leverage cutting-edge technologies like Generative AI to increase productivity, streamline processes, and create seamless experiences for clients, agents, and employees. Your expertise fuels innovation, agility, and growth - driving the company's success. The Enterprise Data Management (EDM) team is seeking a skilled Cloud Data Platform Site Reliability Engineer (SRE) to help build and maintain our core data, reporting, and analytics platform for the Insurance & Agency Group at New York Life. You will be responsible for ensuring the reliability, performance, and scalability of our cloud-based data infrastructure, leveraging AWS services to create a robust and secure environment.

Responsibilities

  • Develop and maintain monitoring, alerting, and logging systems to proactively detect and resolve incidents.
  • Perform root cause analysis and implement solutions to prevent recurrence.
  • Manage incident response, including on-call rotations, triaging, and escalation.
  • Create and manage Infrastructure as Code (IaC) using tools like Terraform.
  • Automate deployments, scaling, backups, and disaster recovery processes.
  • Develop and maintain CI/CD pipelines to ensure smooth deployment and rollback processes.
  • Analyze performance metrics and optimize infrastructure and application performance.
  • Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Conduct capacity planning and scaling to manage anticipated loads.
  • Implement security best practices, including network security, IAM policies, and encryption.
  • Conduct security audits and compliance checks to ensure regulatory adherence.
  • Respond to security incidents and implement remediation measures.
  • Work with development teams to ensure services are reliable, scalable, and easily monitored.
  • Collaborate with cross-functional teams to design, build, and maintain cloud infrastructure.
  • Identify and implement improvements to operational processes and workflows.
  • Design, implement, and test disaster recovery and business continuity plans.
  • Ensure regular backups and replication to minimize data loss and downtime.

Requirements

  • 3+ years of experience as a Cloud Site Reliability Engineer.
  • 1+ years of experience with AWS services (AWS S3, EC2, Glue, Redshift, RDS) in shared service or hybrid environments.
  • Proficiency in AWS services (EC2, S3, RDS, Lambda, VPC, CloudWatch, IAM, etc.).
  • Strong knowledge of scripting languages (Python, Bash, etc.) and automation tools (Terraform).
  • Experience with CI/CD tools and DevOps practices.
  • Familiarity with monitoring and logging tools.
  • Strong troubleshooting and problem-solving skills, with exposure to Machine Learning (ML) and Artificial Intelligence (AI) fields.
  • Exposure to industry-standard Data Governance processes and procedures.
  • Bachelor's degree in Computer Engineering, Computer Science, MIS, or a related field is preferred but not required.

Benefits

  • Leave programs
  • Adoption assistance
  • Student loan repayment programs
  • Annual discretionary bonus eligibility
  • Incentive program participation eligibility
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service