IFS is hiring a

Lead / Senior Lead Site Reliability Engineer - WorkWave

Colombo, Sri Lanka
Full-Time

The WorkWave Team is seeking an experienced Lead / Senior Lead Site Reliability Engineer (SRE) to drive reliability, scalability, and operational excellence across our cloud-based infrastructure. This role is crucial in ensuring high availability, monitoring, and streamlined deployment processes across various environments, including AWS and hybrid systems. The Lead / Senior Lead SRE will work closely with cross-functional teams to optimize system reliability and efficiency, actively contributing to a robust infrastructure that supports business growth.

Responsibilities

  • Design, manage, and optimize scalable infrastructure across cloud environments with a focus on reliability, availability, and performance. Implement comprehensive monitoring and observability systems to ensure proactive issue detection and resolution.

  • Lead incident response for critical infrastructure issues across cloud platforms, drive root cause analysis, and implement corrective measures to minimize recurrence.

  • Collaborate with cross-functional teams to create efficient, automated CI/CD pipelines that support cloud, hybrid, and on-prem deployments, enabling smooth and reliable delivery.

  • Apply IaC best practices across environments using tools that ensure consistent provisioning, configuration, and management of resources in cloud environments.

  • Ensure new services meet reliability and scalability requirements across all environments before deployment. Conduct capacity planning and performance tuning to adapt to business needs.

  • Develop and maintain comprehensive documentation for infrastructure, deployment workflows, monitoring configurations, and incident management procedures, providing clear guidance across teams.

  • Provide mentorship and technical guidance to team members, sharing knowledge of best practices in reliability engineering and infrastructure management.

  • Research and integrate new tools and technologies to improve the efficiency, scalability, and resilience of our SRE processes across cloud and hybrid infrastructures.

  • Bachelor’s or Master’s Degree in Computer Science, Information Technology, or a related field.

  • 4-5+ years of experience in Site Reliability Engineering or DevOps with a focus on multi-environment infrastructure and cloud platforms.

  • Strong track record of managing and optimizing infrastructure in production environments, including incident management and system troubleshooting.

  • Proficient in CI/CD pipeline automation and infrastructure as code practices across cloud and hybrid environments.

 

Skills and Competencies

  • Expertise in monitoring, observability, and incident management using tools like Grafana, AWS X-Ray, and CloudWatch, with a focus on RCA and proactive alerting.
  • Proficiency in automation and scripting (e.g., Python, Bash) and Infrastructure as Code (IaC) tools such as Terraform or AWS CloudFormation.
  • In-depth knowledge of AWS services for reliability, including Auto Scaling, Elastic Load Balancing, RDS, and S3, with a focus on high availability and fault tolerance.
  • Hands-on experience with CI/CD pipelines using AWS CodePipeline, CodeBuild, or third-party tools integrated with AWS services.
  • Excellent communication and collaboration skills to drive system reliability and foster cross-functional teamwork in a cloud-first environment.

We believe that coming together as a community, in person, is important for innovation, connection and fostering a sense of belonging. Our roles have the right balance of remote and in-office working to enable flexibility for managing your life along with ensuring a real connection with your colleagues and the broader IFS community.

Apply for this job

Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!

Get hired quicker

Be the first to apply. Receive an email whenever similar jobs are posted.

Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Lead Site Reliability Engineer Q&A's
Report this job
Apply for this job