Job Title: Lead SRE (Site Reliability Engineer )
Location: Remote Work
Type: 6+ Month Contract to hire
Rate: $Open /hr.
Pl forward updated resume to deivy.malli@two95intl.com and include your rate requirement along with your contact details with a suitable time when we can reach you.
Responsibilities
· Own uptime, SLAs, and overall reliability of cloud infrastructure and kiosks platform.
· Lead incident response, root-cause analysis, and drive actionable postmortems.
· Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team.
· Maintain and improve monitoring, alerting, and observability (Grafana, Prometheus, New Relic, etc).
· Manage, operate and recommend improvement of mo
· Execute and continuously improve disaster recovery and business continuity plans.
· Partner with platform engineering, QA, and development teams to ensure operational readiness.
· Establish and maintain runbooks, operational standards, and reliability best practices.
· Provide leadership, mentorship, and clear communication during both normal operations and incidents.
· Optimize cloud and Kubernetes environments for reliability, performance, and scalability.
Requirements
Qualifications
· 8+ years in SRE, DevOps, or Platform Engineering roles; 2+ years in a senior or lead capacity.
· Strong experience supporting production environments with strict SLAs and high uptime requirements.
· Deep knowledge of Kubernetes, containers, and cloud-native infrastructure.
· Proficiency in automation and scripting using Bash, Python, or Go.
· Hands-on experience with CI/CD pipelines and release engineering in modern environments.
· Expert-level familiarity with IaC tools (Terraform preferred).
· Strong understanding of monitoring, alerting, logging, and observability tooling.
· Experience implementing and managing GitOps workflows (ArgoCD or similar).
· Demonstrated ability to lead incidents and communicate effectively with technical and non-technical stakeholders.
· Solid understanding of disaster recovery planning, resilience practices, and system hardening.
Two95 International Inc., is a global technology firm specializing in enterprise solutions that evolves over BPM, Mobility, Cloud, Analytics, E-commerce & Social Business. Our client base includes several Fortune 500 and mid-market companies across industries and varying geographies.With vast knowledge and knowhow of 20 years in the IT field, we have been chosen as INC500 fastest growing company in North America in 2013. With the accolade of being ranked 11th in Human Resources by INC500, we have also been nominated as the 3rd fastest growing company in South Jersey by SJBM. We are ranked among the Top 20 IT Companies in New Jersey based on the year-on-year growth for the last 3 years. With a seasoned team of highly qualified personnel, our offices are located in New Jersey, Canada and India.Our Specialties Direct Hire, Contingent Staffing, Managed Outsourced Services..
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!
Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.
Site Reliability Engineer Q&A's