Site Reliability Engineer

Kathmandu , Nepal
full-time Remote

AI overview

As a Site Reliability Engineer, you will ensure the reliability and security of production systems while collaborating with teams to enhance infrastructure and platform security.

At CloudFactory, we are a mission-driven team passionate about unlocking the potential of AI to transform the world. By combining advanced technology with a global network of talented people, we make unusable data usable, driving real-world impact at scale. 

More than just a workplace, we’re a global community founded on strong relationships and the belief that meaningful work transforms lives. Our commitment to earning, learning, and serving fuels everything we do as we strive to connect one million people to meaningful work and build leaders worth following.

Our Culture

At CloudFactory, we believe in building a workplace where everyone feels empowered, valued, and inspired to bring their authentic selves to work. We are:

  • Mission-Driven: We focus on creating economic and social impact.
  • People-Centric: We care deeply about our team’s growth, well-being, and sense of belonging.
  • Innovative: We embrace change and find better ways to do things together.
  • Globally Connected: We foster collaboration between diverse cultures and perspectives.

If you’re passionate about innovation, collaboration, and making a real impact, we’d love to have you on board!

Role Summary

As a Site Reliability Engineer, you will ensure the reliability, availability, and security of production systems. You’ll collaborate closely with engineers and operators to apply engineering best practices, automation, and operational excellence across infrastructure, reliability, and platform security in a mission-driven environment.

Key Responsibilities

  • Design, build, and maintain scalable, resilient infrastructure that enables developer productivity and platform reliability.
  • Establish and maintain Infrastructure as Code (IaC) standards, best practices, and reusable templates.
  • Deploy, support, monitor, and maintain new and existing services, platforms, and application stacks.
  • Troubleshoot production issues, perform rollbacks and service restorations, and create dashboards to ensure high availability.
  • Create, maintain, and enhance runbooks for on-call and incident resolution.
  • Define and manage availability targets and SLAs for platform products.
  • Ensure production readiness across performance, availability, security, and compliance before go-live.
  • Build and improve monitoring, alerting, logging, and debugging tools.
  • Manage environment capacity planning and performance optimization.
  • Partner with engineering teams to drive performance improvements using metrics (latency, CPU, etc.).

Requirements

Must-Have Knowledge

  • Cloud Architecture: Strong expertise in AWS-based cloud infrastructure and microservices (serverless and containerized).
  • Infrastructure as Code (IaC): Proven experience provisioning and managing infrastructure via code.
  • CI/CD & DevSecOps: Solid understanding of CI/CD pipelines, web security, and DevSecOps practices.
  • Operational Excellence: Experience with monitoring, alerting, incident management, and 24x7 operational support.

Nice-to-Have Knowledge

  • Broader web security principles beyond standard DevSecOps practices.


Skills & Experience:

Must-Have Skills

  • AWS Services: Hands-on experience with EC2, CloudFormation, ECS Fargate, Lambda, SQS, SNS, S3, ECR, RDS, and Route 53.
  • IaC Tools: Terraform, CloudFormation, Serverless Framework; scripting with Bash, Python, or Go.
  • Monitoring & Logging: Experience with Grafana, ELK stack, CloudWatch, and/or Prometheus.
  • Containerization & Scripting: Proficiency with Docker and shell scripting.
  • CI/CD Tools: Experience using GitHub Actions.

Nice-to-Have Skills

  • Programming experience in Go, Node.js, or Python.
  • Advanced troubleshooting skills for complex production and customer-facing issues.

General Requirements

  • Ability to collaborate effectively across global teams and time zones.
  • Strong problem-solving skills with the ability to simplify complex issues into actionable solutions.
  • High ownership mindset with the drive to meet deadlines and support team success.
  • Willingness to participate in 24/7 operational support processes.

Benefits

  • Great Mission and Culture
  • Meaningful Work
  • Market competitive salary
  • Quarterly variable compensation
  • Remote and Home working
  • Comprehensive medical cover 
  • Group life insurance
  • Personal development and growth opportunities
  • Office snacks and lunch
  • Periodic team building and social events

At CloudFactory, we believe that work should be more than just a job—it should be a platform for growth, impact, and community. Here, you’ll earn with purpose, learn every day, and serve a mission that truly matters. If you're looking for a career where you can develop professionally, contribute meaningfully, and be part of a global movement, we’d love to have you on this journey!

Join us today and be part of our mission to connect people and technology for a better world! Apply now and bring your whole, authentic self to work—we can’t wait to meet you!

Perks & Benefits Extracted with AI

  • Health Insurance: Comprehensive medical cover
  • Team building and social events: Periodic team building and social events

CloudFactory is a global leader in combining people and technology to provide a cloud workforce solution for machine learning and core business data processing. Our managed teams have experience hundreds of AI projects and can process data with high accuracy using virtually any tool. As an impact sourcing service provider (ISSP), CloudFactory creates economic and leadership opportunities for talented people in developing nations. Trusted by 170+ companies, we enrich data for 11 of the world’s top autonomous vehicle companies and process millions of tasks a day for innovators including Microsoft, Hummingbird, Ibotta, Luminar and nuTonomy. We’re on four continents, with offices in the U.K., U.S., Nepal and Kenya.You will enjoy CloudFactory if creating meaningful work for 1 million people in the developing world excites you. Also if you value building relationships, can be described as both humble and courageous in the same sentence, and you are passionate about pooling individual talents to win as one unified team. You have developed your own engine for personal growth, and help others grow by giving both constructive and encouraging feedback. You love to do the crazy hard work upfront to make things simple for others and your approach is often thinking big, starting small and then scaling fast. If any of this resonates, it is likely you will enjoy and thrive at CloudFactory like nowhere else on earth! 5 Reasons You Should Work at CloudFactory!!Join us and make a difference in the world!After submitting your application, all of our communication will be via email, so please check your inbox and spam folders regularly. CloudFactory will at no stage of this process ask candidates to make payments or pay fees of any kind.

View all jobs
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Site Reliability Engineer Q&A's
Report this job
Apply for this job