Position Summary:The Site Reliability Engineer (SRE), will be a key player in building and scaling our cloud infrastructure.
Responsibilities:
- Architect and manage Amazon Web Services (AWS) cloud environments, including EC2, VPC, S3, and other key resources, ensuring resilience, scalability, and cost-efficiency
- Lead the design, deployment, and optimization of Kubernetes clusters using AWS EKS, leveraging container orchestration to support the scalability of our applications
- Collaborate closely with our software engineers to streamline and enhance our CI/CD pipelines, infrastructure as code (IaC) practices, and containerization processes
- Implement and maintain monitoring and alerting systems (Datadog or similar) to ensure performance, reliability, and early detection of potential issues
- Manage and oversee high-impact incidents, swiftly troubleshooting and collaborating with cross-functional teams to restore services and ensure operational continuity
- Strategically plan capacity requirements by analyzing, forecasting, and optimizing cloud infrastructure for future growth while maintaining cost-effectiveness
- Develop and maintain automation tools that minimize manual tasks and elevate operational efficiency
- Ensure our cloud infrastructure adheres to best practices in security and compliance, safeguarding our platform and services
- Perform other duties as assigned
Requirements:
- Bachelor’s degree in Computer Science or a related field, or equivalent practical experience
- 5+ years of experience with AWS or other cloud platforms and cloud security
- AWS certification such as Solutions Architect with in-depth knowledge of AWS services like EC2, VPC, Lambda, RDS is a plus
- Knowledge of Gitlab and Jenkins is a plus
- Proven experience managing Kubernetes clusters in production, especially with AWS EKS
Skills:
- Excellent communication skills
- Strong analytical and problem-solving skills with a drive for continuous improvement
- Able to work with cutting-edge technologies with desire for continuous learning
- Strong monitoring capabilities to drive growth and performance
- Contribute to a supportive, cross-functional work environment
Travel:
Limited Travel, Scheduled per needs of the business
#LI-REMOTE