Cockroach Labs is hiring a

Site Reliability Engineer - New York, NY

New York, United States

Databases are the beating heart of every business in the world.

Cockroach Labs is the creator of CockroachDB, the most highly evolved cloud-native, distributed SQL database on the planet that scales fast, survives anything, and thrives anywhere. We created CockroachDB to unshackle teams from the constraints of their database. Join us on our mission to simplify how businesses build and operate world-changing applications!

About the Role

CockroachDB provides the backbone of storing data on a global scale. As a Site Reliability Engineer you’ll help manage and scale our CockroachCloud service, a fully managed offering of CockroachDB. You will oversee our production system, ensuring that we can provide stable and scalable infrastructure as we deliver CockroachDB to our customers. CockroachCloud is a global service spanning multiple cloud providers. Roughly half of your time will be spent on greenfield development work, with an emphasis on developing tooling and driving automation. In the role you will work across multiple teams within CockroachCloud as well as development and product teams working on CockroachDB.   

You Will

  • Manage the infrastructure for cloud services, including running internal production systems and hosting CockroachDB for our external customers.
  • Design, write and deliver software and systems to increase product reliability and operational efficiency.
  • Develop custom tools as necessary.
  • Keep a complex system running and solve problems relating to mission-critical services.
  • Design, implement, operate, and troubleshoot the automation and monitoring of production clusters to maximize performance and availability.
  • Drive the company through disaster recovery tests, where we manually turn down pieces of CockroachDB to test its overall resilience to failures.
  • Participate in an on-call rotation for our production systems and hosted services.

The Expectations

In your first 30 days, you will onboard and be exposed to our current internal and customer-facing production systems. Working with our existing SRE and engineering teams, you will pair on production operations and build out runbooks for the operation of different systems. We believe that it's essential for you to take this first month to become familiar with our technology and our company.

After 3 months, you'll be fully integrated into the team. You will develop and own tooling for reliability, automation, and other issues related to CockroachCloud’s stability and scalability. You will identify new opportunities for automating processes, streamlining delivery, deploying new core functionality, and building great tools. You will help make CockroachCloud the best platform to host CockroachDB on by bringing your expertise to our database.

You Have

  • Expertise in analyzing, monitoring, and troubleshooting large-scale distributed systems.
  • Experience in software development using one or more of the following: Go, C, C++, Python, Java.
  • Proficiency working with algorithms, data structures, and production troubleshooting.
  • Expertise in working with major cloud providers (AWS, Azure, GCP, etc.) and Cloud APIs.
  • Debugged and optimized code and to automate routine tasks.
  • Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc.)
  • Previous on-call experience, with a sense of urgency.
  • Experience building collaborative relationships with your colleagues. You enjoy being part of the code review process and partnering with your teammates on challenging problems.

The Team 

Our core mission on the SRE team is to operate at scale a secure & reliable Cockroach Cloud product. We are a group of software engineers first & foremost. We use software engineering as a means to achieve our mission; this is the SRE way. The SRE team is currently distributed across North America (6) and India (3).

Reporting to Tom Schmidt - Sr. Manager, Engineering (Site Reliability Engineering)

Tom recently joined Cockroach Labs as manager of Site Reliability Engineering and has taken responsibility for Cockroach Cloud’s production operations. Tom joined Cockroach Labs after 15 years at IBM where he initially contributed in a wide variety of technical leadership roles, generally focussing on quality and automation across compiler development, test frameworks, CICD, and more. Over the past 7 years, Tom has become an enthusiastic advocate of the Site Reliability Engineering discipline, presenting on the topic at conferences, developing certification curriculum, and securing multiple patents. Tom was also a primary contributor towards the establishment of IBMs formal SRE profession and was recognized as one of the first three SRE Thought Leaders within the company. Most recently, Tom transitioned into a management role where he introduced Site Reliability Engineering to the IBM Business Analytics organization, building an SRE team from the ground up, eventually managing over 20 individuals across 3 unique project areas while establishing practices that now guide over 80 engineers internationally. Cockroach Labs presented a new and unique opportunity to gain experience in a high paced startup environment, laying the foundation for scalable reliability as we prepare for the rapid growth of our Cockroach Cloud offering. Beyond the business, Tom is blessed to call himself a proud father of a 4 year old boy, and otherwise enjoys finding balance between spending time in nature (hiking, camping, exploring) and testing his mettle in competitive gaming.

Jordan Lewis - Senior Director of Engineering

Jordan is the Head of Engineering for CockroachDB Cloud. He’s responsible for the teams that build, maintain and keep CockroachDB Cloud reliably serving the needs of Cockroach Labs’ most demanding customer base. He joined Cockroach Labs as a database engineer in 2016 when it was just 25 people before moving into engineering leadership and most recently moving to lead the Cloud organization. Jordan lives in his hometown of Brooklyn NY with his wife. Outside of work he enjoys folk music and riding his electric scooter around town.

Isaac Wong - EVP of Engineering

Isaac is responsible for the health of the engineering organization at Cockroach Labs. He partners closely with teams to ensure we have a balanced culture that promotes quality and innovation in pursuit of our goals. Before joining Cockroach Labs Isaac was in life sciences for 16 years with Medidata Solutions where he had a front row seat on the exciting ride from a 30 person startup to more than 2000 people worldwide. But the lure of distributed, resilient, and consistent SQL databases, along with the amazing technology and culture at Cockroach Labs proved too much. When not working he likes to draw, play the piano and search NYC for cannolis with his wife and kids.

Our Benefits

  • Competitive Health Insurance Coverage (for you & your dependents!)
  • Paid parental leave (with baby bucks)
  • Flex Fridays
  • Flexible time off & flexible hours
  • Education reimbursement
  • Relocation support or home office allowance

Cockroach Labs is proud to be an Equal Opportunity Employer building a diverse and inclusive workforce. If you need additional accommodations to feel comfortable during your interview process, please email us at [email protected].

The annual anticipated base salary range for U.S. candidates for this role is USD $165,000 to $225,000 plus commission if a sales role. We set standard ranges for all U.S.-based roles based on function, level, and geographic location, benchmarked against similar stage growth companies. Actual salaries may vary and fall outside of this range depending on factors such as a candidate’s qualifications, geographic location, skills, experience, and competencies. In addition, we are often open to a wide variety of profiles, and recognize that the person we hire may be less experienced (or more senior) than this job description as posted. Salary is one component of the Cockroach Labs’ total rewards package, which includes stock options, health insurance, life and disability insurance, funds towards professional development resources, flexible PTO, paid holidays, and parental leave, to name a few! Salaries for candidates outside the U.S. will vary based on local compensation structures.

Apply for this job

Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!

Get hired quicker

Be the first to apply. Receive an email whenever similar jobs are posted.

Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Site Reliability Engineer Q&A's
Report this job
Apply for this job