Site Reliability Engineer - Machine Learning Systems

Responsibilities

  • Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference.
  • Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios.
  • Responsible for resource management and planning, cost and budget, including computing and storage resources.
  • Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement.
  • Build software tools, products and systems to monitor and manage the mL infrastructure and services efficiently.
  • Be part of the global team roster that ensures system and business on-call support.

Requirements

Qualifications:

  • Bachelor's degree or above, majoring in Computer Science, computer engineering or related fields;
  • Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment;
  • Strong hands-on experience with Kubernetes and containers skills, and have ≥3 years of relevant operation and maintenance experience;

Preferred Qualifications

  • Possess excellent logical analysis ability, able to reasonably abstract and split business logic, a strong sense of responsibility, good learning ability, communication ability, self-driven and good team spirit;
  • Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time.
  • Engage in the operation and maintenance of large-scale ML distributed systems;
  • Experience in operation and maintenance of GPU servers.

At HireIO, we inspire opportunities and create meaningful lives through workforce solutions. As a leading recruitment company, we specialize in candidate sourcing, screening, and interviewing, to simplify the hiring process for businesses of all sizes and industries. We believe in the power of diversity and skillful workforces, and we strive to provide personalized services that meet the unique needs of our clients. Our team of experienced professionals is dedicated to building lasting relationships and delivering personalized services that align with our clients' goals. Inspire opportunities, create meaningful lives

View all jobs
Get hired quicker

Be the first to apply. Receive an email whenever similar jobs are posted.

Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Machine Learning Engineer Q&A's
Report this job
Apply for this job