Site Reliability Engineer

AI overview

Join a dedicated team to enhance software reliability through automation, observability, and incident response, while collaborating with various teams to achieve engineering excellence.

We are seeking a Site Reliability Engineer (SRE) with 5–8 years of experience to help build and maintain highly available, scalable, and resilient systems. This role will focus on improving reliability through automation, observability, incident response, and engineering excellence. You will work closely with engineering, product, and operations teams to drive best practices in performance, reliability, and continuous improvement.

Ideal candidates have strong foundations in infrastructure automation, systems engineering, and software development—along with a passion for building reliable systems at scale.

Key Responsibilities:

Reliability Engineering & Automation

  • Design and implement Infrastructure as Code (IaC) using Terraform, Pulumi, CloudFormation, or Ansible to provision and manage scalable cloud infrastructure.
  • Build self-healing and auto-scaling infrastructure using Kubernetes, Docker, and managed services across AWS, Azure, or GCP.
  • Automate operational tasks including failovers, backups, and capacity adjustments.

SLI/SLO Monitoring & Observability

  • Define and track Service Level Indicators (SLIs) and Objectives (SLOs) to measure and maintain service reliability.
  • Build and manage observability stacks using Prometheus, Grafana, Datadog, CloudWatch, AppD, or equivalent.
  • Improve alert quality and reduce noise through intelligent alerting and tuning.

Incident Response & Operational Excellence

  • Lead production incident response, perform root cause analysis (RCA), and write blameless postmortems to drive continuous improvement.
  • Establish and refine runbooks, playbooks, and on-call processes to improve Mean Time to Recovery (MTTR).
  • Participate in on-call rotations to support critical production systems.

CI/CD Reliability & Release Engineering

  • Develop and optimize CI/CD pipelines (Jenkins, GitHub Actions, ArgoCD, TeamCity) with built-in checks for reliability, performance, and security.
  • Implement progressive delivery patterns like canary deployments and blue/green rollouts.
  • Collaborate with developers to ensure release processes are safe, repeatable, and observable.

Security, Compliance & Risk Management

  • Enforce cloud security best practices for IAM, network segmentation, and secret management.
  • Integrate DevSecOps practices and tooling (e.g., Snyk, SonarQube, OWASP ZAP) into pipelines for early vulnerability detection.
  • Ensure systems adhere to regulatory and compliance standards (SOC2, ISO 27001, GDPR, HIPAA, etc.).

Collaboration & Mentorship

  • Work cross-functionally with engineering, QA, and platform teams to embed reliability into the SDLC.
  • Provide guidance and mentorship to junior SREs and engineers on reliability practices.
  • Champion a culture of operational excellence, documentation, and knowledge sharing.

Skills & Qualifications:

 

Technical Expertise

  • Cloud Platforms: Strong hands-on experience with AWS, GCP, or Azure across compute, networking, storage, and identity.
  • Kubernetes: Advanced experience managing production-grade clusters (EKS, GKE, AKS), Helm, and containerized workloads.
  • CI/CD Tools: Jenkins, GitHub Actions, ArgoCD, TeamCity, Spinnaker (preferred).
  • IaC: Terraform, CloudFormation, Pulumi, Ansible (strong experience required).
  • Programming & Scripting: Proficiency in Python, Go, or Java. Strong scripting skills with Bash or PowerShell.
  • Version Control: Expert-level Git usage and branching strategies (GitOps experience is a plus).
  • Monitoring & Logging: Familiarity with Prometheus, Grafana, ELK, Datadog, New Relic, or AppD.

Security & Compliance

  • Strong understanding of cloud security principles and IAM policies.
  • Experience with automated security testing and static code analysis tools.

Soft Skills

  • Analytical thinker with strong troubleshooting and problem-solving skills.
  • Clear communication and ability to drive cross-team collaboration.
  • Strong ownership mindset and bias for action in high-pressure situations.
  • Ability to manage multiple priorities and lead technical initiatives.

Preferred Qualifications:

  • Certifications: AWS Solutions Architect, GCP Professional Cloud DevOps Engineer, or Azure DevOps Expert.
  • Experience with GitOps tooling such as GitHub, Jenkins, ArgoCD, TeamCity, jFrog, etc.
  • Exposure to serverless architecture (Lambda, GCF, Azure Functions).
  • Experience with chaos engineering and resiliency testing frameworks

Education & Experience:

  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • 5+ years of experience in SRE, DevOps, or cloud infrastructure roles with a focus on system reliability.

About Picarro:

We are the world's leader in timely, trusted, and actionable data using enhanced optical spectroscopy. Our solutions are used in a wide variety of applications, including natural gas leak detection, ethylene oxide emissions monitoring, semiconductor fabrication, pharmaceutical, petrochemical, atmospheric science, air quality, greenhouse gas measurements, food safety, hydrology, ecology, and more. Our software and hardware are designed and manufactured in Santa Clara, California and are used in over 90 countries worldwide based on over 65 patents related to cavity ring-down spectroscopy (CRDS) technology and are unparalleled in their precision, ease of use, and reliability.

At Picarro, we are committed to fostering a diverse and inclusive workplace. All qualified applicants will receive consideration for employment without regard to race, sex, color, religion, national origin, protected veteran status, gender identity, social orientation, or disability. Posted positions are not open to third-party recruiters/agencies, and unsolicited resume submissions will be considered free referrals. 

At Picarro, we strive to ensure that all individuals, regardless of their abilities, have equal opportunities. If you are an individual with a disability and require reasonable accommodation to complete any part of the application process or are limited in the ability or unable to access or use this online application process and need an alternative method for applying, please contact Picarro, Inc. at [email protected] for assistance. 

 

 

Get hired quicker

Be the first to apply. Receive an email whenever similar jobs are posted.

Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Site Reliability Engineer Q&A's
Report this job
Apply for this job