The WorkWave Team is seeking an experienced Lead / Senior Lead Site Reliability Engineer (SRE) to drive reliability, scalability, and operational excellence across our cloud-based infrastructure. This role is crucial in ensuring high availability, monitoring, and streamlined deployment processes across various environments, including AWS and hybrid systems. The Lead / Senior Lead SRE will work closely with cross-functional teams to optimize system reliability and efficiency, actively contributing to a robust infrastructure that supports business growth.
Responsibilities
Design, manage, and optimize scalable infrastructure across cloud environments with a focus on reliability, availability, and performance. Implement comprehensive monitoring and observability systems to ensure proactive issue detection and resolution.
Lead incident response for critical infrastructure issues across cloud platforms, drive root cause analysis, and implement corrective measures to minimize recurrence.
Collaborate with cross-functional teams to create efficient, automated CI/CD pipelines that support cloud, hybrid, and on-prem deployments, enabling smooth and reliable delivery.
Apply IaC best practices across environments using tools that ensure consistent provisioning, configuration, and management of resources in cloud environments.
Ensure new services meet reliability and scalability requirements across all environments before deployment. Conduct capacity planning and performance tuning to adapt to business needs.
Develop and maintain comprehensive documentation for infrastructure, deployment workflows, monitoring configurations, and incident management procedures, providing clear guidance across teams.
Provide mentorship and technical guidance to team members, sharing knowledge of best practices in reliability engineering and infrastructure management.
Research and integrate new tools and technologies to improve the efficiency, scalability, and resilience of our SRE processes across cloud and hybrid infrastructures.
Bachelor’s or Master’s Degree in Computer Science, Information Technology, or a related field.
4-5+ years of experience in Site Reliability Engineering or DevOps with a focus on multi-environment infrastructure and cloud platforms.
Strong track record of managing and optimizing infrastructure in production environments, including incident management and system troubleshooting.
Proficient in CI/CD pipeline automation and infrastructure as code practices across cloud and hybrid environments.
Skills and Competencies
We believe that coming together as a community, in person, is important for innovation, connection and fostering a sense of belonging. Our roles have the right balance of remote and in-office working to enable flexibility for managing your life along with ensuring a real connection with your colleagues and the broader IFS community.