Parallel Domain
Parallel Domain

Senior Site Reliability Engineer

CAD $145,000 – CAD $185,000 per year

TLDR

Work at the core of large-scale simulation for autonomous systems, collaborating closely with engineering teams to enhance world-class cloud infrastructure.

Responsibilities
  • Infrastructure ownership and cloud operations. Design, build, and maintain multi-region AWS infrastructure using Terraform. Operate and scale EKS clusters across production regions: autoscaling, node lifecycle, workload health. Manage networking across environments: VPC design, DNS, load balancing, and cross-region connectivity. Support infrastructure changes, migrations, and expansions into new regions. Contribute to and improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize.

  • Reliability engineering and incident response. Help build and run incident management processes: severity definitions, escalation paths, on-call practices. Lead incident response, debugging, and root-cause analysis. Write postmortems and drive systemic reliability improvements from what they surface. Improve observability across metrics, logging, tracing, and dashboards. Support GPU and batch workloads running on Kubernetes.

  • Security and access management. Provide security-conscious feedback on platform architecture decisions. Own cloud IAM governance: roles, policies, and access boundaries across accounts and services. Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires.

  • Platform tooling and developer experience. Improve CI/CD pipelines and infrastructure validation. Support engineers with infrastructure debugging, environment setup, and performance issues. Contribute to tooling and automation in Python and Bash. Take on adjacent responsibilities as needed in a startup environment.
  •  
    Required Qualifications
  • Experience. 5+ years in SRE, DevOps, or infrastructure engineering roles, with a track record of operating production systems across multiple regions.

  • Terraform. Modules, state management, and multi-environment patterns.

  • AWS depth. Solid experience across VPC, IAM, EKS, S3, and CloudWatch.

  • Kubernetes expertise. Cluster operations, autoscaling, RBAC, and Helm.

  • CI/CD and GitOps. Experience with GitHub Actions, ArgoCD, or similar workflows.

  • Networking fundamentals. CIDR, DNS, load balancing, VPN, and cross-region connectivity.

  • Observability. Experience with tooling such as Prometheus and Grafana.

  • Scripting. Comfort with Python and Bash for tooling and automation.

  • Cross-platform familiarity. Working knowledge of both Linux and Windows environments. Operational experience supporting Windows-based workloads is a meaningful advantage.

  • Pragmatism and ownership. Comfortable in a fast-moving startup with evolving priorities. You take ownership of systems while collaborating closely with other teams, and you're pragmatic about tradeoffs between speed, reliability, and complexity.
  •  
    Preferred Qualifications
  • Windows on Kubernetes. Experience with Windows node pools, Windows AMIs, and GPU-adjacent components on K8s.

  • GPU scheduling. Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration.

  • Domain workloads. Experience supporting simulation, ML, or rendering workloads in cloud infrastructure.

  • AWS extras. Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family.

  • Service mesh. Familiarity with service proxy or service mesh patterns.

  • Container OS. Experience with container-optimized OS images (e.g., Bottlerocket, Packer).

  • Cost optimization. Cloud cost optimization at scale.
  •  
    Core Tools
    Terraform · AWS · Kubernetes · Helm · Kustomize · ArgoCD · GitHub Actions · Prometheus · Grafana · Docker · Python · Bash
    What Makes a Great Candidate
    You think in failure modes and proactively surface issues. You hold a principled view on security and push back constructively when designs introduce unnecessary risk. You communicate clearly across engineering, product, and customer-facing teams, flagging issues with urgency proportional to customer impact. You take end-to-end ownership of complex efforts and know when to push for the clean solution versus the pragmatic one.
    Base salary range of CAD $145,000–$185,000, depending on skills, qualifications, and experience, plus equity, full health/dental/vision coverage, learning stipend, and generous vacation. This role is remote-friendly across Canada and the US Pacific Northwest.

    Parallel Domain creates a robust platform for testing and validating autonomous systems and Physical AI through high-fidelity virtual simulations. This service is designed for developers and companies in the AI and robotics sectors, enabling them to push their innovations to the brink in a controlled environment.

    Employees
    11-50 employees
    Total raised
    $2.5M raised
    View company profile
    Report this job
    Apply for this job