Lead the SRE division and take end-to-end ownership of reliability across dLocal's platform, defining the strategy and partnering closely with Product and Engineering teams.
Own the global reliability strategy for dLocal’s platforms and services, aligning SRE goals with company and product objectives.
Define and socialize SRE standards and principles (SLIs/SLOs/SLAs, error budgets, production readiness, incident management practices, capacity planning, etc.).
Lead the SRE division: set org structure, define roles and scopes, and drive hiring, performance, and career development.
Build a culture of high ownership, continuous improvement, and data‑driven decisions across all reliability‑related work.
Ensure our most critical systems meet or exceed availability, latency, and performance targets.
Oversee and continuously evolve incident management (on‑call strategy, incident response, communication, postmortems, follow‑ups, and KPIs).
Own the strategy for observability and monitoring (metrics, logs, traces) and alerting across all environments, including tool selection, standards, and adoption.
Drive operational excellence: reduce toil via automation, improve deployment safety, and standardize production practices across teams.
Partner with Architecture, Platform, and Product Engineering leaders to define reliable, scalable architectures for our core systems and critical flows.
Guide the adoption of best practices in automation and Infrastructure as Code (IaC) across SRE and dependent engineering teams.
Sponsor and oversee large cross‑team reliability programs, such as major observability migrations, resilience testing frameworks, or reliability improvements for key products.
Provide senior technical leadership on capacity planning, performance engineering, resilience and disaster recovery.
Lead, mentor, and coach SRE Leader, Technical Referents, and senior ICs, helping them grow in both technical depth and leadership.
Collaborate closely with:
Product & Engineering to balance feature delivery and reliability.
Security, Cloud Platform, and Infrastructure to ensure secure and robust foundations.
Business stakeholders (e.g., Operations, Support, Commercial) to align on reliability expectations and SLAs.
Communicate clearly about risk, trade‑offs, and priorities to both technical and non‑technical audiences, including senior leadership.
Solid experience leading SRE / Production Engineering / Platform teams in high‑availability, high‑scale environments (fintech, payments, or similarly critical domains is a plus).
Proven track record managing managers and senior ICs, building and scaling distributed technical teams.
Deep hands‑on expertise in:
Reliability engineering: SLIs/SLOs, error budgets, capacity planning, resilience and disaster recovery.
Incident management: on‑call models, incident response, postmortems, continuous improvement of incident processes.
Observability and monitoring: metrics, logs, traces, alerting strategies, and ecosystem of tools.
Automation and IaC: strong familiarity with modern CI/CD pipelines, configuration management, and infrastructure as code.
Ability to shape technical strategy, translate it into a clear roadmap, and ensure consistent execution across multiple teams.
Excellent communication and influencing skills; comfortable driving alignment across Engineering, Product, and non‑technical stakeholders.
Strong analytical and problem‑solving skills, able to operate effectively in ambiguous, fast‑changing contexts.
Professional proficiency in English; comfortable working in a global, multi‑time‑zone, multicultural environment.
Experience in payments / fintech or other regulated, mission‑critical industries.
Hands‑on background as an SRE, Senior/Staff Engineer, or Platform Engineer before moving into leadership.
Experience implementing or maturing:
Centralized observability platforms and unified alerting strategies.
Standardized production readiness reviews and reliability sign‑off processes.
Chaos engineering / resilience testing practices.
Flexible Work Hours
we have flexible schedules and we are driven by performance.
Learning Budget
get access to a Premium Coursera subscription.
dLocal Houses
want to rent a house to spend one week anywhere in the world coworking with your team? We’ve got your back!
dLocal offers a robust payment processing solution designed for global enterprises to navigate cross-border transactions in emerging markets. By facilitating local payments and payouts in 40 countries, we help major brands enhance conversion rates and streamline their payment operations.
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!
Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.
Site Reliability Engineer Q&A's