MLOps Support Engineer (L1/L2 Technical Support)

Kathmandu , Nepal
full-time

AI overview

Drive stability and performance of AI/ML systems by providing operational support, ensuring minimal service disruption while collaborating on incident management and improving observability.

At CloudFactory, we are a mission-driven team passionate about unlocking the potential of AI to transform the world. By combining advanced technology with a global network of talented people, we make unusable data usable, driving real-world impact at scale. 

More than just a workplace, we’re a global community founded on strong relationships and the belief that meaningful work transforms lives. Our commitment to earning, learning, and serving fuels everything we do as we strive to connect one million people to meaningful work and build leaders worth following.

Our Culture

At CloudFactory, we believe in building a workplace where everyone feels empowered, valued, and inspired to bring their authentic selves to work. We are:

  • Mission-Driven: We focus on creating economic and social impact.
  • People-Centric: We care deeply about our team’s growth, well-being, and sense of belonging.
  • Innovative: We embrace change and find better ways to do things together.
  • Globally Connected: We foster collaboration between diverse cultures and perspectives.

If you’re passionate about innovation, collaboration, and making a real impact, we’d love to have you on board!

Role Summary

The MLOps Support Engineer is an operations-first role, focused on ensuring AI/ML systems remain stable, observable, and supportable in production environments. This is not a data science or feature development role.

The primary objective is to maintain continuous performance of ML models and associated pipelines with minimal disruption to both internal and client-facing services. You will provide Tier 1 and Tier 2 support, escalating to Tier 3 Engineering as needed.

What you’ll do:

  • Provide Tier 1 / Tier 2 operational support for AI/ML solutions.
  • Identify failed jobs, degraded pipelines, or performance anomalies.
  • Triage incidents, investigate issues, and coordinate escalation to Tier 3 Engineering.
  • Participate in on-call rotas once established.
  • Validate that pipelines and jobs complete successfully.
  • Monitor data pipeline health, model execution, and basic performance metrics.
  • Identify operational issues before they impact customers
  • Respond or alert customers when there has been an outage or issue with one of their models.
  • Support incident management, rollback, and recovery activities.
  • Use and maintain runbooks and operational documentation.
  • Work with Engineering to improve supportability and observability.
  • Contribute to knowledge sharing to reduce single points of failure.
  • Work within defined SLAs and support processes as the service matures
  • Build quarterly business reviews to provide updates on the health of the ML Models.
  • Evaluate champion/challenger models to see if a new model should be promoted.
  • Monitor for model drift and performance degradation, while validating that updates (new champion models or added data) do not introduce bias.

Requirements

Essential

  • Experience in operations, DevOps, SRE, or platform support roles.
  • Strong troubleshooting skills in production environments.
  • Proficiency in SQL and scripting (Python, Bash) for developing and automating ML workflows.
  • Familiarity with Cloud-hosted systems (AWS, GCP, Azure) for cloud-based ML services.
  • Git: Solid understanding of version control, particularly in collaborative development environments.
  • Comfortable working from runbooks and structured processes.

Desirable

  • Exposure to AI/ML systems in production.
  • Familiarity with monitoring and observability tools (Grafana, PowerBI, New Relic).
  • Knowledge of MLOps tooling and data platforms (ML FLow, Databricks)
  • Experience supporting customer-facing platforms.
  • Knowledge of containerization (Kubernetes) is a plus.
  • Experience of LLM Prompt Engineering and troubleshooting
  • Early career in MLOps or ML Engineering.
  • Someone who is eager to learn about complex predictive models.
  • Background in computer science, informatics, or related fields
  • Passion for Machine Learning and AI: An eager learner who is excited about working with cutting-edge ML technologies and is passionate about optimizing and maintaining ML models in production environments.
  • Early Career in MLOps or ML Engineering: Ideally, Junior ML Engineer with a strong desire to grow in the field of MLOps and AI operations.
  • A Collaborative Mindset: You thrive in a team setting and are ready to contribute to model improvement, A/B testing, and iterative development.
  • Attention to Detail: A focus on model performance, bias prevention, and ensuring optimal model behavior as new data and models are introduced.

Additional information:

Nepal

  • This role provides MLOps coverage from *07:45pm– 15:45am Nepal time* for US-based customers. You will be required to work during these hours and potentially outside of them if a model has issues.
  • Rotational On-Call work will also be required.

Colombia

  • This role provides MLOps coverage from *11am to 9pm Colombia time* for a US-based customer. You will be required to work on a shift rota to  cover 8 hour time blocks during this time period and potentially outside of them if a model has issues.
  • Rotational On-Call work will also be required.

**note that these hours are subject to change upon review**

CloudFactory is a global leader in combining people and technology to provide a cloud workforce solution for machine learning and core business data processing. Our managed teams have experience hundreds of AI projects and can process data with high accuracy using virtually any tool. As an impact sourcing service provider (ISSP), CloudFactory creates economic and leadership opportunities for talented people in developing nations. Trusted by 170+ companies, we enrich data for 11 of the world’s top autonomous vehicle companies and process millions of tasks a day for innovators including Microsoft, Hummingbird, Ibotta, Luminar and nuTonomy. We’re on four continents, with offices in the U.K., U.S., Nepal and Kenya.You will enjoy CloudFactory if creating meaningful work for 1 million people in the developing world excites you. Also if you value building relationships, can be described as both humble and courageous in the same sentence, and you are passionate about pooling individual talents to win as one unified team. You have developed your own engine for personal growth, and help others grow by giving both constructive and encouraging feedback. You love to do the crazy hard work upfront to make things simple for others and your approach is often thinking big, starting small and then scaling fast. If any of this resonates, it is likely you will enjoy and thrive at CloudFactory like nowhere else on earth! 5 Reasons You Should Work at CloudFactory!!Join us and make a difference in the world!After submitting your application, all of our communication will be via email, so please check your inbox and spam folders regularly. CloudFactory will at no stage of this process ask candidates to make payments or pay fees of any kind.

View all jobs
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Customer Service Q&A's
Report this job
Apply for this job