Microsoft - Redmond, WA

posted 2 months ago

Full-time - Mid Level
Redmond, WA
Publishing Industries

About the position

Microsoft is seeking a Site Reliability Engineer II (SRE) to join our Silver Infrastructure and Sovereign Operations team. This pivotal role involves defining operations for new, existing, and emerging environments. We are looking for a candidate who thrives on solving complex issues, has a clear vision, and possesses the ability to execute end-to-end programs effectively. As a Site Reliability Engineer II, you will be instrumental in defining operating models for deploying and managing systems within sovereign and air-gapped environments. This role offers the unique opportunity to collaborate with engineers dedicated to enabling a wide range of Azure services for both internal and external customers in highly secured and regulated industries. The systems, processes, and frameworks you develop will be essential in meeting the stringent security policy and assurance requirements of our diverse customer base in the public and private sectors. If you are passionate about operational excellence and have a track record of success in similar environments, we encourage you to apply and help shape the future of our operations. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees, we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities

  • Defines and develops standardized, repeatable, scalable solutions to guarantee quality and efficient operations.
  • Drives the design, optimization, efficiency, and reliability of service management.
  • Communicates on a deeply technical level with software engineers, project management, and operations teams to improve and optimize products, improve infrastructure, reduce manual toil, and evolve services.
  • Drives efforts to collect, classify, and analyze data on a range of metrics.
  • Drives the refinement of products through data analytics and makes informed decisions in engineering products through data integration.
  • Drives efforts to integrate instrumentation for gathering telemetry data on system behavior such as performance, reliability, availability, and usage.
  • Drives sustaining feedback loops from telemetry resulting in subsequent designs.
  • Creates outputs of telemetry such as notifications or dashboards.
  • Applies debugging tools and examines logs, telemetry, and other methods to verify assumptions through writing and developing code proactively before issues occur and reactively as issues occur for products.
  • Conducts retrospective debugging of solutions to identify root causes of problems.
  • Reviews and writes issues postmortem and shares insights with the team.
  • Builds, enhances, reuses, contributes to, and identifies new software developer tools/processes to support other programs and applications to create, debug, and maintain code for products.
  • Uses open source when appropriate.
  • Begins to develop skills in other tools/topics outside areas of experience.
  • Identifies internal tools and/or creates tools that will be useful for creating the product, determining if methods are still applicable for the current solution.
  • Shares best practices and teaches others about new tools and strategies.
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions.
  • Alerts stakeholders as to status and initiates actions to restore system/product/service for simple problems and complex problems when appropriate.
  • Responds within Service Level Agreement (SLA) timeframe.
  • Drives efforts to reduce incident & request volumes, looking globally at incidences and providing broad resolutions.
  • Escalates issues to appropriate owners.
  • Ability to meet on call responsibilities periodically to support 24x7 operations.

Requirements

  • 4+ years technical experience in software engineering, network engineering, or systems administration.
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration.
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.
  • Active U.S. Government Top Secret Security Clearance.
  • Ability to meet Microsoft, customer and/or government security screening requirements.

Nice-to-haves

  • 3+ years of experience with PowerShell, C#, or C++.
  • Experience working on large-scale distributed services with on-call responsibilities.
  • Ability to build and influence broadly towards common goals and priorities.
  • Ownership for end-to-end project lifecycle with solid project management and communication skills.
  • Experience applying SRE principles in a large production environment.

Benefits

  • Health insurance coverage
  • Dental insurance coverage
  • Vision insurance coverage
  • 401(k) retirement savings plan
  • Paid holidays and vacation time
  • Flexible scheduling options
  • Professional development opportunities
  • Employee stock purchase plan
  • Tuition reimbursement
  • Mental health days
  • Wellness programs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service