Machine Learning Ops Engineer, Applied Machine Learning

Apple - Sunnyvale, CA

posted 2 months ago

Full-time - Mid Level

Sunnyvale, CA

Computer and Electronic Product Manufacturing

About the position

Join Apple's Applied Machine Learning Team as a Machine Learning Ops Engineer, where you will play a crucial role in enabling Generative AI across our Applications and Platforms. This position is designed for individuals who are passionate about infrastructure and distributed systems, and who are eager to build world-class platforms and products at a large scale across cloud environments. The Applied Machine Learning team has developed systems for numerous large-scale data science applications, working on high-impact projects that serve various lines of business within Apple. We leverage the latest open-source technologies and contribute to these projects, pushing the boundaries of what is possible. In this role, you will be responsible for building LLM (Large Language Model) applications using open-source LLM App Frameworks, such as AWS BedRock and GCP Vertex AI. You will evaluate and port language models onto optimized infrastructure to reduce costs and enhance performance. Additionally, you will build tools to benchmark and compare various embedding databases and LLMs, as well as support CI/CD tools to manage applications on AWS/GCP and Kubernetes. Your work will also involve building automation to enable self-healing systems and troubleshooting application-specific, core network, system, and performance issues. You will be tasked with building a multi-tenancy system that enforces data protection between different use cases, all while being involved in challenging and fast-paced projects that support Apple's business by delivering innovative solutions. The ideal candidate will be self-motivated, proactive, and solution-oriented, ready to tackle complex problems in a dynamic environment.

Responsibilities

Build LLM Applications using open source LLM App Frameworks, AWS BedRock/GCP Vertex AI
Evaluate and port Language Models onto optimized infrastructure to reduce cost and increase performance
Build tools to benchmark and compare various embedding databases, LLMs
Build & Support CI/CD tools to port & manage applications on AWS/GCP & Kubernetes
Build automation to enable self-healing systems
Troubleshoot application specific, core network, system & performance issues
Build a multi-tenancy system by enforcing data protection between different use cases
Involvement in challenging and fast paced projects supporting Apple's business by delivering innovative solutions

Requirements

Bachelor's degree with 4+ years of experience
4+ years of experience in Python Programming
Extensive experience in deploying and managing applications on AWS/GCP & Kubernetes
Deep understanding of RAG based pipelines for Model inferencing, GuardRails
Experience in open source LLM App frameworks like LangChain/LlamaIndex

Nice-to-haves

BS in computer science with 4+ years or MS with 2+ years experience or related experience
Exposure to Cloud managed services like AWS BedRock/GCP Vertex AI
Good Understanding of Agents in GenAI
Strong Experience in Infrastructure templating tools like CloudFormation, Terraform
Experience in GitOps based deployment tools like Spinnaker/Flux/ArgoCD
Strong proficiency with Helm and Kustomize for managing Kubernetes applications and configurations
Experience in managing Embeddings using Vector databases
Exposure to Promot engineering
Experience in observability & traceability for Large Language Models
Experience in Performance tuning on operating systems like Linux
Exposure to various LLM infrastructure like GPUs, TPUs & Inferentia
Exposure to LLM runtime like Triton, Frameworks like TensorRT, vLLM
Exposure to general Java troubleshooting skills

Machine Learning Ops Engineer, Applied Machine Learning

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company