Own the reliability, availability, performance, and scalability of customer and employee facing platforms. Partner with application, infrastructure, security, and NOC teams to engineer resilient services, and automate operations across Azure and on-prem environments. Drive incident response and post-incident reviews, implement observability, and continuously improve service health through automation and best practices.
Responsibilities:
· Build and operate production platforms across Azure (e.g., AKS, App Services, Functions), Windows/Linux, and networking layers in partnership with Platform/Server/Network teams.
· Engineer end-to-end observability: metrics, logs, and traces via Azure Monitor, Application Insights, Log Analytics, Prometheus, Grafana, and centralized logging.
· Automate provisioning and configuration using Infrastructure as Code (Terraform/Bicep) and configuration management (Ansible/PowerShell DSC).
· Design and maintain CI/CD pipelines (Azure DevOps/GitHub Actions) with automated testing, canary/blue-green deployments, and change control alignment.
· Establish runbooks, SOPs, and self-healing automations to reduce MTTR and ticket volume from the NOC and Service Desk.
· Harden platform security (identity, secrets, certificates, network segmentation) leveraging Azure Key Vault, managed identities, and policy guardrails.
· Perform capacity planning, performance tuning, and cost optimization (FinOps) for compute, storage, and networking.
· Partner with Data/ETL teams to ensure reliability of batch and streaming jobs, scheduling, and dependencies.
· Create and maintain documentation (architecture, runbooks, dashboards) and support audits and compliance requirements.
Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
· 2–5+ years in SRE/DevOps/Platform Engineering with hands-on production ownership.
· Proficiency with Azure services (AKS, App Services, Functions, Azure Monitor, Log Analytics, Application Insights).
· Strong Kubernetes/Docker skills; Helm, ingress, service mesh (e.g., Istio/Linkerd) experience is a plus.
· IaC (Terraform or Bicep) and scripting (PowerShell and/or Python); Git-based workflows.
· CI/CD (Azure DevOps or GitHub Actions), artifact management, and release strategies (canary/blue-green).
· Observability tooling (Prometheus, Grafana, ELK/OpenSearch, Azure Monitor) and alert design to minimize noise.
· Experience with ITIL processes (incident, change, problem) and tools (ServiceNow/Jira).
· Knowledge of networking, DNS, TLS/certificates, load balancers, and security fundamentals.
· Excellent troubleshooting, communication, and cross-functional collaboration skills.
· Certifications such as Microsoft Azure Administrator/DevOps, CKA/CKAD, or ITIL Foundation are a plus.
All your information will be kept confidential according to EEO guidelines.
BETSOL is a cloud-first digital transformation and data management company offering products and IT services to enterprises in over 40 countries. BETSOL team holds several engineering patents, is recognized with industry awards, and BETSOL maintains a net promoter score that is 2x the industry average. BETSOL’s open-source backup and recovery product line, Zmanda (Zmanda.com), delivers up to 80% savings in total cost of ownership (TCO) and best-in-class performance. BETSOL Global IT Services (BETSOL.com) builds and supports end-to-end enterprise solutions, reducing time-to-market for its customers. BETSOL offices are set against the vibrant backdrops of Broomfield, Colorado and Bangalore, India. We take pride in being an employee-centric organization, offering comprehensive health insurance, competitive salaries, 401K, volunteer programs, and scholarship opportunities. Office amenities include a fitness center, cafe, and recreational facilities. Learn more at betsol.com.
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!
Be the first to apply. Receive an email whenever similar jobs are posted.
Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.
Site Reliability Engineer Q&A's