Monitor and optimize cloud systems while collaborating closely with DevOps, Product and Development teams to ensure service reliability and continuous improvement.
The challenge
We’re at a pivotal stage in the evolution of our cloud platform. To continue scaling efficiently and strengthening reliability, we are expanding our Operations & SRE capabilities. Our infrastructure supports mission-critical services for our customers, and ensuring performance, stability, and continuous improvement is at the core of our vision.
As a Site Reliability Engineer / Systems Administrator, your mission will be to monitor and optimize our cloud systems, automate processes, ensure effective incident management, and help us maintain a robust, scalable and secure infrastructure. You will play a key role in minimizing downtime, improving operational efficiency, and supporting sustainable growth.
You’ll be part of a highly collaborative engineering environment, working closely with DevOps, Product and Development teams to build reliable services from the ground up, enforce good operational practices and contribute to ongoing enhancements that impact thousands of users.
Collaboration will be essential. You will support critical infrastructure decisions, lead incident response, proactively detect risks and ensure that both technology and teams can continue to scale confidently.
What we expect from you
Proven experience managing large-scale cloud or MSP infrastructures.
Expert-level Linux systems administration (mandatory).
Experience with Windows Server (2012–2025) in production environments.
Strong troubleshooting skills across systems, networking, storage and application layers.
Solid networking knowledge: TCP/IP, DNS, load balancing, firewalling, BGP and network virtualization.
Experience with network storage solutions such as Ceph, NFS or similar technologies.
Familiarity with IaaS orchestration platforms such as CloudStack or similar.
Experience implementing and maintaining monitoring and observability tools such as Zabbix, Prometheus, Grafana and ELK.
Experience with Infrastructure as Code practices and automation using Ansible.
Experience designing or maintaining CI/CD pipelines.
Database knowledge: MySQL, MariaDB or PostgreSQL (advanced troubleshooting is a plus).
Strong understanding of ITIL processes for incident, problem and change management.
Strong documentation practices and commitment to operational excellence.
Analytical mindset focused on reliability, scalability and continuous improvement.
Excellent communication skills in Spanish and intermediate English.
Nice to have
Hands-on experience with CI/CD pipelines.
Experience optimizing distributed systems performance.
Advanced security and system hardening expertise.
Experience with ticketing systems and operational workflow optimization.
Tools & Technologies
Operating Systems: Linux, Windows Server
Automation: Ansible, scripting (Bash, Python, PowerShell)
CI/CD: Modern pipeline implementations
Monitoring & Observability: Zabbix, Prometheus, Grafana, ELK Stack
Storage: Ceph, NFS or similar
Orchestration: CloudStack, OpenStack
Databases: MySQL, MariaDB, PostgreSQL
Collaboration & ITSM: Tools aligned with ITIL practices
Jotelulu builds a self-managed cloud infrastructure platform tailored specifically for small and medium-sized enterprises in the IT sector. By forging strategic alliances with managed service providers and IT integrators, we enhance collaborative revenue generation and provide tools that streamline IT management and automation at scale. Our focus on the Portuguese IT market sets us apart as a dedicated partner to drive growth in this niche.
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!