Sr. Director - Backend Engineering

AI overview

Lead the team defining and operating cutting-edge AI infrastructure orchestration systems, ensuring robust and scalable platforms for all stages of the AI lifecycle.
  • Sr Director- Backend Engineering

    Key Skills and Role Responsibilities:

    This role is for a strategic and technical leader to define, build, and operate the infrastructure orchestration systems that power our organization's cutting-edge Artificial Intelligence (AI) initiatives. The Senior Director will lead a team responsible for ensuring a robust, scalable, cost-efficient, and high-performance platform for all stages of the AI lifecycle, from experimentation and training to deployment and inference.

    Strategy and Leadership

    • Define and execute the long-term vision and roadmap for the company’s AI infrastructure Network Services, aligning it with overall business and AI Services goals.

    • Lead, mentor, and grow a high-performing engineering and operations team focused on AI infrastructure and platform engineering.

    • Manage budget and resource allocation for AI infrastructure Network Services deliverables.

    • Act as a key liaison between AI infrastructure and other services owners and consumers, core engineering, Cloud infrastructure, and executive leadership.

    AI Infra Development and Operations

    • Oversee the design, implementation, and maintenance of the core network orchestration platforms for large-scale AI model training (e.g., distributed training, hyperparameter tuning) and deployment (e.g., containerization, serverless functions, edge deployment).

    • Ensure reliability, security, and compliance of the AI infrastructure, meeting strict standards for data governance and model integrity.

    • Establish Service Level Objectives (SLOs) and Key Performance Indicators (KPIs) for the AI platform services and lead efforts for continuous optimization and performance tuning.

    Technology and Architecture

    • Select, evaluate, and integrate the core technologies required for the AI stack (e.g., Cloud Overlay/Under networking, Infiniband, Load Balancer, DNS, Core Networking, Kubernetes, Ray, GPU/accelerator management, distributed file systems).

    • Champion infrastructure-as-code (IaC) principles to manage and provision AI resources consistently and at scale.

    Qualifications

    Required

    • Education: Bachelor's or Master’s degree in Computer Science, Engineering, or a related technical field.

    • Experience:

      • 15+ years of progressive experience in software engineering, infrastructure, or platform operations.

      • 5+ years of experience leading and managing technical teams, ideally in a Director or Sr. Director level or equivalent capacity.

      • Deep, hands-on experience designing and operating large-scale distributed systems and cloud-native network architectures.

      • Proven experience specifically with AI infrastructure orchestration (e.g., using Kubernetes) and managing accelerated compute resources (GPUs, TPUs, etc.).

      • 15+ years of Cloud backend engineering, Cloud Design, Deployment, DevOps.

      • 15+ years of experience leading system design and architecture leveraging Private Clouds and AWS and/or Azure/GCP.

      • 10+ years of demonstrable experience building and operating infrastructure as code, Infra Automation, and comfort with various flavors of Linux.

      • 15+ years of experience in building high-performance, highly available, and scalable distributed systems in the cloud.

      • 15+ years of experience in building and managing high-performance, highly available, and scalable Hybrid Cloud environments.

      • Excellent cross-group collaboration, outstanding verbal and written communication skills.

    • Skills:

      • Expert-level knowledge of containerization and orchestration (Docker, Kubernetes).

      • Software Defined Cloud Networking.

      • Strong background in DevOps and MLOps principles and tooling.

      • Proficiency in at least one modern programming language (e.g., Python, Go).

      • Exceptional strategic planning, organizational, and written/verbal communication skills.

    Preferred

    • Prior experience managing infrastructure for training and inference of large language models (LLMs) or foundation models.

    • Experience in a regulated industry with strict compliance requirements.

    • AI Private Cloud - Building and operating.

    Success Metrics

    A successful Senior Director - AI Infrastructure Orchestration will be measured by:

    • The time-to-market for AI infrastructure build, scale, and operation.

    • The resource utilization rate and cost efficiency of the AI compute infrastructure.

    • The reliability and uptime of the core AI platform services.

    • The talent retention and development within the AI Infrastructure team.

 

 

Coupang is a disruptive e-commerce giant in South Korea, offering fast Rocket Delivery and revolutionizing the shopping experience with innovation and customer-centric services.

View all jobs
Get hired quicker

Be the first to apply. Receive an email whenever similar jobs are posted.

Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Backend Engineer Q&A's
Report this job
Apply for this job