The position involves identifying, designing, and implementing changes to existing services to enhance reliability, performance, and standardization across all AWS services, microservices, and serverless services. The role requires the ability to proactively identify inefficient resource utilization and remediate resources to improve platform stability and cost efficiency. You will integrate with product development teams to understand their services and support them while working on SRE-related platforms. Troubleshooting production issues, providing root cause analysis, and designing solutions to prevent future occurrences are key responsibilities. Additionally, you will plan and test for capacity growth, monitor services, create intelligent alarming for quicker incident detection and resolution, and build automations and internal tools to improve processes. Hands-on experience with Datadog and Splunk, as well as proficiency in Python, .NET, and Java, is essential.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Industry
Administrative and Support Services
Education Level
Bachelor's degree
Number of Employees
501-1,000 employees