Qode
Site Reliability Engineer
TLDR
Join a dynamic team in Pune managing multi-cloud infrastructure focusing on AWS and Azure while collaborating globally and optimizing system performance and reliability.
Site Reliability Engineer
Location: Pune, India
Workplace Type: Onsite
Shift: US Shift
About the Role
We are seeking an experienced Site Reliability Engineer to join our dynamic team in Pune. In this role, you will be instrumental in managing our multi-cloud infrastructure, focusing on AWS and Azure. You will be responsible for setting up and maintaining the infrastructure to support our cloud migration and future division expansion. This position offers a unique opportunity to work in a global environment, collaborate with Automotive and corporate IT teams, learn new skills, and shape the future direction of our infrastructure. The ideal candidate will have a strong background in cloud computing, infrastructure as code, and automation, with a proactive approach to problem-solving and performance optimization. You will be part of the Tech Ops / SRE Team, which operates in a sharing and learning culture to maintain continuous access to our products.
Key Responsibilities
- Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding.
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Create sustainable systems and services through automation.
- Balance feature development speed and reliability with well-defined service-level objectives.
- Manage day-to-day operations of AWS/Azure Infrastructure.
- Build and document automation processes for Infrastructure as a Service/Infrastructure as code.
- Manage backup and patch management processes.
- Provide adequate support in architecture planning, migration, and installation for new projects.
- Lead the structural/architectural design of platforms, middleware, databases, and backups according to system requirements.
- Conduct technology capacity planning by reviewing current and future requirements.
- Strategize and implement disaster recovery plans, including creating and implementing backup and recovery plans.
- Manage day-to-day operations by troubleshooting issues, conducting root cause analysis (RCA), and developing fixes.
- Plan for and manage upgrades, migrations, maintenance, backups, installations, and configurations.
- Review technical performance and deploy ways to improve efficiency and fine-tune performance.
- Develop shift rosters to ensure no disruption in the tower.
- Create and update SOPs, Data Responsibility Matrices, operations manuals, and daily test plans.
- Provide weekly status reports to client leadership and internal stakeholders.
- Leverage technology to develop Service Improvement Plans (SIP) through automation.
Required Skills & Qualifications
- Bachelor’s degree (or equivalent) in computer science or a related discipline with at least 7 years of experience.
- Strong understanding and hands-on experience with EKS, including configuring, deploying, maintaining, troubleshooting, upgrading, and monitoring EKS on AWS.
- Hands-on experience with CI/CD pipelines and DevOps tooling, including Git-based version control (GitLab preferred), pipeline design and maintenance, automated builds, testing, and deployments for cloud-native and containerized workloads.
- Hands-on Experience with Linux Server, AD, LDAP, DNS, Network Storage, AWS Compute services (EC2, FSX, Managed AD, Route 53, etc…).
- Ability to program using scripting with tools or languages, such as PowerShell, Python, Ansible, Terraform, and Bash.
- Familiarity with ITSM processes like Incident, Problem, and Change Management using ServiceNow (preferable).
- Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
- Strong interpersonal skills, analytical and problem-solving ability, along with strong written and verbal communication.
- Ability to communicate ideas in both technical and non-technical ways.
- A strong capacity for teamwork and a sense of ownership, with the ability to work independently and be self-driven.
- Experience with Infra Cloud Computing Consulting.
Qode is a technology-driven platform that transforms how recruiters and candidates connect by leveraging data and automation. Our solutions streamline the hiring process through machine learning, creating private talent pools and automating workflows, ultimately enhancing the quality of candidate evaluation and decision-making. With our no-code tools, we empower organizations to develop tailored recruitment strategies without needing extensive technical skills.
- Industry
- Internet Software & Services
Site Reliability Engineer