American Express - New York, NY

posted 4 days ago

Full-time - Mid Level
New York, NY
Credit Intermediation and Related Activities

About the position

The Sr. Infrastructure Engineer - Network & Cloud at American Express is responsible for designing, implementing, and managing robust infrastructure solutions that ensure reliability, scalability, and performance across compute, storage, network, and cloud technologies. This role involves leading incident response efforts, performing root cause analysis, and developing automation tools while collaborating with engineering teams to enhance infrastructure systems. The position emphasizes continuous improvement and operational excellence within a diverse and inclusive tech environment.

Responsibilities

  • Ensure the reliability, availability, and performance of the entire infrastructure stack including compute, storage, network and cloud components.
  • Lead incident response efforts across the infrastructure stack, coordinating with Application Support, SRE, and Engineering teams to minimize MTTD and MTTR.
  • Perform root cause analysis for infrastructure related incidents and implement corrective actions.
  • Develop and maintain automation tools for managing infrastructure resources.
  • Collaborate with Engineering teams to plan and execute system upgrades and maintenance.
  • Conduct capacity planning and resource management for all infrastructure components.
  • Participate in on-call rotations to provide 24x7 support for all critical infrastructure issues.
  • Design and implement disaster recovery plans and business continuity strategies.
  • Implement best practices for monitoring, logging, and alerting across the infrastructure.
  • Foster a culture of continuous improvement and operational excellence.
  • Analyze complex infrastructure problems, design scalable and resilient solutions, and lead the implementation of these solutions.
  • Collaborate with architects and other engineers to design and enhance the architecture of infrastructure systems, ensuring alignment with business needs and technology standards.

Requirements

  • Proven experience managing and optimizing a diverse infrastructure stack.
  • Extensive knowledge of cloud platforms (AWS, Azure, Google Cloud Platform) and infrastructure as code (Terraform, CloudFormation).
  • Familiarity with service mesh technologies (Istio, Linkerd).
  • Solid understanding of virtualization (VMware, Hyper-V) and containerization (Docker, Kubernetes) and orchestration.
  • Understanding of storage solutions (SAN, NAS, cloud storage) and backup systems.
  • Strong understanding of network protocols, routing, switching, and firewalls (Palo Alto).
  • Experience with load balancers (F5, HAProxy, Nginx) and network monitoring tools.
  • Experience in DNS management and troubleshooting.
  • Experience in network security best practices.
  • Proficiency in monitoring and observability tools (Prometheus, Grafana, Splunk).
  • Proficiency in at least one scripting language (Python, Bash) for automation.
  • Experience with CI/CD pipeline management and DevOps practices.
  • Strong understanding of disaster recovery and business continuity planning.
  • Experience with performance tuning and capacity planning.
  • Understanding of chaos engineering principles and practices.
  • Skills in cost optimization for cloud infrastructure.
  • Familiarity with Akamai.

Nice-to-haves

  • 10 plus years of experience in using cloud native monitoring tools like AWS CloudWatch, Azure Monitor, and Google Cloud Operations Suite.
  • Experience with packet capture tools like Wireshark for troubleshooting network issues.
  • Experience in using traceroute utilities and performance analysis tools like perf for identifying and resolving bottlenecks.
  • Familiarity with tools such as ipconfig/ifconfig for viewing network configurations, flushing DNS, and diagnosing network issues.
  • Experience with SNMP-based tools for network device monitoring and performance management.
  • Experience in using NetFlow for network traffic analysis.
  • Experience with tools like iostat, vmstat, and dstat for monitoring storage and system performance.
  • Experience in tools like df, du, lsblk, and fdisk for managing and troubleshooting file systems and disk partitions.

Benefits

  • Competitive base salaries
  • Bonus incentives
  • 6% Company Match on retirement savings plan
  • Free financial coaching and financial well-being support
  • Comprehensive medical, dental, vision, life insurance, and disability benefits
  • Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need
  • 20+ weeks paid parental leave for all parents, regardless of gender, offered for pregnancy, adoption or surrogacy
  • Free access to global on-site wellness centers staffed with nurses and doctors (depending on location)
  • Free and confidential counseling support through our Healthy Minds program
  • Career development and training opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service