ByteDance - San Jose, CA

posted 2 days ago

- Mid Level
San Jose, CA

About the position

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the China market. The company emphasizes the importance of creation, not just in its products but also within its teams, fostering an environment where challenges are seen as opportunities for learning and innovation. The Speech team at ByteDance is dedicated to enhancing content interaction and creation through advanced speech and audio technologies, focusing on research and development in areas such as natural language understanding and multimodal deep learning. We are currently seeking a Site Reliability Engineer (SRE) to join our team, focusing on the reliability, scalability, and performance of our AI applications.

Responsibilities

  • Develop and implement monitoring solutions to track the performance and reliability of AI systems.
  • Respond to incidents, diagnose issues, and implement fixes to minimize downtime.
  • Automate repetitive tasks, streamline deployments, and create tools to improve the efficiency and reliability of AI operations.
  • Analyze and optimize the performance of AI applications and the underlying infrastructure, including tuning algorithms and resource management.
  • Forecast infrastructure needs and ensure that the AI applications have the necessary resources to handle future workloads.
  • Implement and maintain security best practices to protect data and applications, ensuring compliance with relevant regulations.
  • Create and maintain detailed documentation of infrastructure, processes, and procedures to ensure knowledge sharing and continuity.
  • Identify opportunities for process improvements and implement solutions to enhance the reliability and performance of AI systems.

Requirements

  • Strong background in software engineering and systems engineering.
  • Experience in maintaining and optimizing AI and machine learning infrastructure.
  • Proficiency in monitoring and incident response for AI systems.
  • Ability to automate tasks and streamline deployments effectively.
  • Experience in performance optimization of applications and infrastructure.
  • Knowledge of security best practices and compliance regulations.
  • Strong documentation skills for infrastructure and processes.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service