Site-Reliability Engineer

AI overview

Manage high availability applications in a hybrid environment while implementing cloud observability and building automation solutions for performance management.

Job Description:

• Min 3-5 years of Service reliability/operation experience running large scale, high performance applications in a hybrid environment (on-prem and cloud).

• Min 3-5 years of experience writing automation scripts and building dashboards for Application Performance management to manage Transaction journeys.

• 2-4 years of Experience working with Programming languages such as Go, Python, Java, Rust etc.

• Working knowledge on with one or more databases-Oracle, PL/SQL, SQL Server, Redis, Clickhouse, postgres, Mongo or any time-series databases

• At least 2+ years of Experience transitioning platforms to the cloud and Containerization - GCP, AWS and Rancher (or Cloud Formation, Azure and OpenShift).

• Experience maintaining containerized app in GKE/RKE/AKE environments.

• Experience Implementing Cloud observability using OTEL to enable real-time monitoring, distributed tracing and incident resolution.

• Experience working with specific GraphQL Framework (Apollo, Prisma, Hasura etc...).

• Experience using knowledge of networking protocols such as TCP/IP, HTTP, DNS, Load balancing and service mesh to troubleshoot issues in high pressure situations.

• Proven experience managing Application availability, building creative solutions to manage repetitive activities, improve gating and detect for applications at every touchpoint for a 24 x 7 High availability platform exposed to critical clients and customers.

• Working knowledge of Monitoring tools - Splunk, App-dynamics, grafana/Prometheus and Dynatrace.

• Experience with tools like Rally, Confluence and other CI/CD extenders.

• Hands-on experience with implementing in-memory caching solutions. Experience on Redis DB is a plus.

• Excellent debugging skills across variety of integrated technical platforms on API gateway.

• Hands-on with GCS, Cloud SQL, PL?SQL and Spanner.

• Monitor and troubleshoot HashiCorp Vault environments, ensuring minimal downtime and rapid recovery from incidents.

• Working knowledge on Vertex Al, Gen Al and Bigquery.

Axiom is a global information technology, consulting and outsourcing company and services provider. Our IT solutions empower organizations and individuals throughout the world to maximize value and quality to succeed in today's challenging business environment. As a fast-growing new economy company, we focus our strengths to offer world-class solutions and services through the convergence of technology, innovation, expertise and experience. We provide software consulting, development and IT-enabled services to clients across the globe. We work towards delivering sustained value creation for customers, employees, industries and society at large. Core offerings include data warehousing, middleware development, product development and web-enablement of legacy applications in verticals like telecom, finance, healthcare, manufacturing, energy & utilities, retail & distribution, enablement of legacy Relentless exploration of technology horizons and a Global Delivery Model that is a judicious combination of onsite, offsite and offshore development, offer a complete range of high-ROI business solutions spanning the consulting, technology, operations and process outsourcing value chain.

View all jobs
Get hired quicker

Be the first to apply. Receive an email whenever similar jobs are posted.

Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Reliability Engineer Q&A's
Report this job
Apply for this job