Site Reliability Engineer II

Microsoft - Aliso Viejo, CA

posted 3 months ago

Full-time - Mid Level

Aliso Viejo, CA

Publishing Industries

About the position

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world. We are looking to hire a Site Reliability Engineer II to join our Azure Data engineering team. They are leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products in our portfolio include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture. Within Azure Data, the databases team builds and maintains Microsoft's operational Database systems. We store and manage data in a structured way to enable a multitude of applications across various industries. We are on a journey to enable developer-friendly, mission-critical, AI-enabled operational Databases across relational, non-relational, and OSS offerings. The Service Reliability Team is responsible for ensuring our critical services are running efficiently, securely, and with high reliability. We work with many different teams to improve service reliability by continually innovating tooling, automation services, and processes to make supporting our products scalable and efficient. We do not just value differences or different perspectives; we seek them out and invite them in so we can tap into the collective power of everyone in the company. As a result, our customers are better served. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees, we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities

Demonstrates expertise in distributed systems design and interactions between cloud technology layers and components.
Identifies and recommends configurations optimal for cloud technology solutions and modifies the code base to improve reliability and operability.
Develops an understanding of the code, features, and operations of specific products at scale to contribute to improvements in product availability and performance.
Researches and maintains awareness of industry trends and advances in distributed systems and cloud technologies.
Suggests changes or add-ons to product features or code to improve availability and performance based on telemetry data analysis.
Designs and implements Service Reliability services, tooling, and processes.
Generates software specifications, proof-of-concepts, and prototype solutions based on high-level feature requirements.
Engages with product engineering teams through code/design reviews and incident responses to propose improvements in code base and designs.
Independently develops code or scripts that automate operations processes across product components.
Identifies opportunities to leverage existing tools and automation to increase the velocity of product engineering teams.
Designs, develops, and maintains telemetry pipelines and monitoring tools for operations metrics.
Troubleshoots problems affecting availability, reliability, and performance of components and features.
Responds to incidents during on-call rotations and deploys fixes to resolve root causes.
Develops alerts and instrumentation to monitor product capacity and resource demands.

Requirements

Expertise in distributed systems design and cloud technology interactions.
Experience with coding and modifying infrastructure code to improve system reliability.
Ability to analyze production telemetry data for insights into product performance.
Experience in designing and implementing automation for operational processes.
Knowledge of monitoring tools and telemetry pipelines for operational metrics.
Ability to troubleshoot and resolve incidents affecting product reliability.

Nice-to-haves

Familiarity with Azure cloud services and products.
Experience with big data analytics and business intelligence tools.
Knowledge of AI-enabled operational databases.

Benefits

Health insurance coverage
401k retirement savings plan
Paid holidays
Flexible scheduling options
Professional development opportunities

Site Reliability Engineer II

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company