Fuku
Fuku

Infra Support Engineer

TLDR

Provide first and second-line technical support for AI Infrastructure, including GPU/CPU nodes and networking, while collaborating with developers to improve system reliability.

Infra Support Engineer – GMI Global Infrastructure Team

Preferred Location:
- Taiwan
- Malaysia

Responsibilities:
- Provide first and second-line technical support to customers for AI Infrastructure, including GPU/CPU nodes, networking, storage, orchestration, and platform services. Support is delivered via ticketing systems, emails, Slack, or other messaging platforms.
- Support GPU cluster delivery, including system provisioning, image deployment, network validation, BIOS/firmware updates, and GPU driver/runtime installation.
- Monitor system health and service-level indicators using alerts and dashboards; respond to alerts 24x7 as scheduled.
- Triage incidents by gathering context, verifying scope and impact, and following standard operating procedures and runbooks to perform immediate mitigations.
- Escalate incidents to global SRE engineers with clear, concise incident notes and relevant logs/traces.
- Maintain incident logs, update status pages, and communicate timely updates to stakeholders during incidents.
- Perform routine operational tasks such as log checks, health checks, capacity checks, and simple automated fixes.
- Participate in postmortems and contribute actionable follow-ups to reduce recurrence of incidents.
- Help maintain and improve standard operating procedures (SOP), run periodic runbook validation, and document new procedures.
- Work collaboratively with developers and SRE teams to improve system reliability.

Qualifications:
- Bachelor’s degree in Computer Science or a related field.
- Over 2 years of experience in IT operations, server administration, SRE, DevOps, or technical support.
- Hands-on Linux experience, including shell, kernel, and log management.
- Basic networking knowledge, including TCP/IP, DNS, HTTP, and VLANs.
- Familiarity with monitoring, alerting, and logging tools such as Prometheus, Grafana, and AlertManager.
- Experience with Nvidia GPU infrastructure and Kubernetes.
- Comfortable collecting diagnostics, reading logs, and interpreting traces.
- Strong troubleshooting mindset and ability to follow runbooks under pressure.
- Excellent written and verbal communication skills for customer-facing incident handling.
- Willingness to work shifts and participate in on-call rotations.
- Bilingual in English and Chinese.

Fuku is focused on streamlining the transition from legacy systems to modern programming languages, offering enterprise-level AI solutions that also cover code maintenance and documentation. Our services cater to organizations looking to enhance their technological infrastructure and efficiency in a rapidly evolving digital landscape.

Founded
Founded 2023
Industry
Internet Software & Services
View company profile
Report this job
Apply for this job