Navi Mumbai, India

Full-Time

SRE, Kubernetes Platform @ PulsePoint

About PulsePoint:

PulsePoint is a fast-growing healthcare technology company (with adtech roots) using real-time data to transform healthcare. We help brands and agencies interpret the hard-to-read signals across the health journey and unify these digital determinants of health with real-world data to produce the most dimensional view of the customer. Our award-winning advertising platforms use machine learning and programmatic automation to seamlessly activate this data, making marketing, predictive analytics, and decision support easy and instantaneous.

SRE, K8s:

As a part of the SRE team (working REMOTELY) you will be closely working with the platform engineering team to help ensure reliability and availability of our Kubernetes-based platform and critical workloads running on it.

What you'll be doing:

You will analyze platform components and identify potential issues we need to monitor to ensure reliability and availability of our Kubernetes-based platform.
You will work with the platform engineering team to make sure you have all the tooling needed to provide visibility into the platform's health, efficient monitoring, helpful diagnostic logging and actionable alerting.
You will visualize the health of the platform components, set up alerting channels, configure appropriate diagnostic log levels for components and define actionable alerts to help with detection and troubleshooting of platform-related issues.
You will follow the platform alerts, tune their sensitivity, improve visualization and diagnostic logging in order to make it easier to pinpoint root causes of issues.
You will write and improve instructions for the NOC team to make it possible for them to recover from issues and stabilize the systems outside of the office hours if possible.
You will research and stage anomaly detection tools.
You will help development teams with visualization and alerting for critical workloads running on the platform.

Requirements:

EST time zone: Overlap at least 4 hours within our East Coast U.S. hours 9am-6pm EST (India time 6:30pm-3:30am IST)
2+ years of SRE experience with Kubernetes.
1+ year of SRE experience with Puppet.
On-prem experience is required
Good understanding of how to use Prometheus and elastic search stacks is needed.
Ability to configure and provision Grafana dashboards and data sources to visualize health of the platform is needed.
Ability to configure actionable alerts and proper routing to correct recipients (NOC team/Platform engineering) with correct severity (critical issue/warning/info) via correct channels (email/slack) is needed.
Ability to write automation scripts is needed, ability to write automation code in Golang is welcome.
Ability to formulate monitoring, alerting and logging requirements to keep the platform up and healthy for the platform engineering team is needed.
Ability to analyze production issues after they happen and continuously improve the alerting and monitoring to prevent same issues from happening again (missing metric/logs/alerts/wrong recipient/wrong severity channel) or at least decrease the detection and recovery time for those we cannot avoid, is needed.

WebMD and its affiliates is an Equal Opportunity/Affirmative Action employer and does not discriminate on the basis of race, ancestry, color, religion, sex, gender, age, marital status, sexual orientation, gender identity, national origin, medical condition, disability, veterans status, or any other basis protected by law.

Apply for this job

WebMD is hiring a

(Remote) Site Reliability Engineer, K8s Platform

Requirements: