Capital One - Boston, MA

posted 4 months ago

Full-time - Senior
Boston, MA
Credit Intermediation and Related Activities

About the position

At Capital One, we believe that AI and machine learning represent the biggest opportunity in financial services today, and is a chance to revolutionize the industry with more real-time personalized experiences than it was ever possible. Our mission is to use the power of machine learning to deliver better financial services to our customers by creating trustworthy, reliable and human-in-the-loop systems. From informing customers about unusual charges to answering their questions in real time, our AI/ML capabilities are bringing humanity and simplicity to banking. Because of our investments in public cloud infrastructure that provides on-demand compute and storage for machine learning and our principled approach to building enterprise platforms led by our best talent, we are now uniquely positioned to harness the power of generative AI that few other organizations can. Capital One's commitment to AI has sponsorship from the CEO, the Board of Directors, and the executive committee of the company. We are committed to building world-class applied science and engineering teams, on the foundations of our industry leading data and AI/ML capabilities with breakthrough product experiences. The Vice President, Platform Operations will lead and optimize our Machine Learning & Artificial Intelligence platform operations. This executive will report into the Senior Vice President, Head of Machine Learning Experience in our engineering organization. As this role will serve as a member of our leadership team within our Enterprise Data Machine Learning organization, it is of paramount importance that this individual values diverse perspectives, fosters collaboration and encourages innovative ideas - and can create a place where associates of all backgrounds can thrive by bringing their most authentic selves to work. As we continue to grow and expand, we are seeking a highly skilled and motivated Sr. Director of Platform Operations to lead and optimize our ML/AI platform operations. Our Sr. Director Platform Operations will play a pivotal role in setting the roadmap and overseeing the day-to-day management of our AI and ML platforms including: setting strategies and overseeing container management in public cloud (AWS), cloud resource provisioning, ensuring low latency, high availability of cloud resources, cloud optimization, etc. They will maintain a deep understanding of the technical aspects of the platform, including infra, algorithms, APIs and integrations, and provide operations leadership to the engineering and production teams. Additionally, they will implement robust processes and operations dashboards to monitor platform performance, user feedback, and adherence to service level agreements (SLAs), observability, resiliency, and key operational metrics in real time. Collaboration with cyber, technology risk management, security and compliance teams will be essential to understand the company cyber, risk and compliance requirements. The Sr. Director will work closely with product and engineering to ensure the platform adheres to industry best practices, corporate cyber and tech risk management standards, and implement automation and dashboards to visualize vulnerabilities, platform incidents, cloud controls compliance, and cloud resource utilization to enable proactive decision making and risk mitigation. Finally, they will build a high performing operations team, recruiting world class SREs, production engineers, and data engineers, grooming and retaining talent on the team.

Responsibilities

  • Set the roadmap and oversee the day-to-day management of AI and ML platforms.
  • Set strategies and oversee container management in public cloud (AWS).
  • Manage cloud resource provisioning, ensuring low latency and high availability of cloud resources.
  • Implement robust processes and operations dashboards to monitor platform performance and user feedback.
  • Collaborate with cyber, technology risk management, security and compliance teams to understand requirements.
  • Work closely with product and engineering to ensure adherence to industry best practices.
  • Implement automation and dashboards to visualize vulnerabilities and platform incidents.
  • Develop a long-term vision and roadmap for platform operations enhancements.
  • Build and lead a high performing operations team, recruiting and retaining top talent.

Requirements

  • Bachelor's degree.
  • At least 9 years of experience managing Platform, infrastructure operations or Site Reliability Engineering in a public cloud environment.
  • At least 7 years of people management experience.

Nice-to-haves

  • Master's Degree in a STEM field (Science, Technology, Engineering, or Mathematics).
  • 5+ years of experience in managing large-scale, high-performance, distributed systems as a Site Reliability Engineer or a product engineer.
  • 5+ years of experience in setting up and scaling observability platform and creating Operational health dashboards.
  • 3+ years of experience in building systems and solutions within a regulated environment.
  • 3+ years of experience in Artificial Intelligence, Machine Learning or Cloud infrastructure.
  • 3+ years of experience with managing distributed systems, multi-tenant, micro services, and container orchestration (Kubernetes).
  • 5+ years of experience with machine learning lifecycle and familiarity with major Machine Learning frameworks.

Benefits

  • Comprehensive health benefits.
  • Financial benefits including performance-based incentives and bonuses.
  • Inclusive workplace culture that values diversity and belonging.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service