Site Reliability Engineer, Staff Engineer

AI overview

Ensure high reliability and operational efficiency for AWS cloud infrastructure and collaborate with cross-functional teams to automate and enhance monitoring and support practices.

The Site Reliability Engineer (SRE) will provide L2/L3 support for AWS cloud infrastructure and production environments, ensuring high availability, reliability, and operational efficiency. This role focuses on automating operational tasks, monitoring systems, and collaborating with DevOps, Development, and Infrastructure teams to resolve issues and improve service performance.

Responsibilities:

  • Provide L2/L3 support for AWS cloud infrastructure and production environments.
  • Implement and maintain automation for operational tasks, deployments, and monitoring.
  • Monitor system health, troubleshoot incidents, and ensure high availability of services.
  • Develop and enhance scripts/tools to reduce manual effort and improve efficiency.
  • Work closely with DevOps, Development, and Infrastructure teams for issue resolution.
  • Participate in on-call rotations and incident management during US shift hours.
  • Maintain and improve monitoring, alerting, and logging systems.
  • Ensure adherence to SRE best practices for reliability, scalability, and performance.
  • Document runbooks, SOPs, and knowledge base articles.

 

  • Strong hands-on experience with AWS services (EC2, S3, RDS, Lambda, VPC, IAM, CloudWatch).
  • Experience in automation and scripting using Python, Shell, or PowerShell.
  • Familiarity with Infrastructure as Code tools (Terraform or CloudFormation).
  • Understanding of CI/CD pipelines and DevOps practices.
  • Experience with monitoring tools like CloudWatch, Grafana, Prometheus, or ELK.
  • Good understanding of Linux systems and networking concepts.
  • Exposure to containerization (Docker/Kubernetes).
  • Ability to troubleshoot production issues under pressure.
  • Excellent verbal and written communication skills.
  • Willingness to work in the US time zone shift.

👋🏼 We're Nagarro.We are a digital product engineering company that is scaling in a big way! We build products, services, and experiences that inspire, excite, and delight. We work at scale — across all devices and digital mediums, and our people exist everywhere in the world (19,500+ experts across 36 countries, to be exact). Our work culture is dynamic and non-hierarchical. We're looking for great new colleagues. That's where you come in!By this point in your career, it is not just about the tech you know or how well you can code. It is about what more you want to do with that knowledge. Can you help your teammates proceed in the right direction? Can you tackle the challenges our clients face while always looking to take our solutions one step further to succeed at an even higher level? Yes? You may be ready to join us.

View all jobs
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Site Reliability Engineer Q&A's
Report this job
Apply for this job