Senior Scalability Engineer - Observability

TLDR

Design and develop an organization-wide observability strategy and platform using the LGTM stack to enhance engineering productivity and optimize system debugging and monitoring.

About Judi Health

Judi Health is an enterprise health technology company providing a comprehensive suite of solutions for employers and health plans, including:

  • Capital Rx, a public benefit corporation delivering full-service pharmacy benefit management (PBM) solutions to self-insured employers,
  • Judi Health™, which offers full-service health benefit management solutions to employers, TPAs, and health plans, and
  • Judi®, the industry’s leading proprietary Enterprise Health Platform (EHP), which consolidates all claim administration-related workflows in one scalable, secure platform.

Together with our clients, we’re rebuilding trust in healthcare in the U.S. and deploying the infrastructure we need for the care we deserve. To learn more, visit www.judi.health.

Location: Remote 

Position Summary: 

Our Scalability team as a Senior Scalability Engineer focused on observability platform development and engineering productivity. In this role, you will define, own, and build Judi Health's organization-wide observability strategy, tooling, and platform products. Beyond maintaining infrastructure, you'll architect and develop a custom observability platform that gives engineering teams powerful, fast, and cost-effective visibility into every layer of our infrastructure—from application logs and metrics to distributed traces. You'll build production-grade internal products using React/TypeScript frontends with Python and Rust backends, creating tools that fundamentally improve how engineers at Judi Health debug, monitor, and optimize their systems. Working closely with leadership and cross-functional teams, your work will be foundational to platform stability, performance optimization, and developer productivity across our rapidly growing healthcare platform. 

Position Responsibilities: 

In this role, you'll own the observability infrastructure that powers our engineering organization. You will:  

  • Architect observability platform: Design, implement, and maintain the LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) as the primary observability platform across all engineering teams, making architectural decisions that balance cost, performance, and developer experience.
  • Build internal observability products: Design and develop production-grade internal platform products with React/TypeScript frontends and Python/Rust backends that provide engineers with powerful log search, metrics visualization, and trace analysis capabilities.
  • Develop custom log indexing systems: Architect and build high-performance log indexing solutions using Rust that process logs and provide sub-second search across billions of log lines at a fraction of the cost.
  • Integrate SQL analytics for logs: Design and implement solutions leveraging AWS Athena or similar SQL query engines (DuckDB, ClickHouse) for ad-hoc log analysis and historical queries, enabling engineers to run complex SQL queries over S3-based log data for deep investigations and trend analysis. 
  • Create advanced query interfaces: Build sophisticated web interfaces that allow engineers to query logs, metrics, and traces with features like saved queries, query templates, correlation analysis, and pattern detection, supporting both full-text search and SQL-based analytics. 
  • Balance cloud-native and open-source: Architect solutions that thoughtfully leverage both AWS-managed services (CloudWatch, Athena, Kinesis) and open-source tooling (LGTM stack, Quickwit) to optimize for cost, performance, and operational flexibility based on use case requirements. 
  • Integrate AWS observability: Design seamless integration between AWS CloudWatch Logs/Metrics and our custom observability platform, providing unified visibility across managed and self-hosted infrastructure. 
  • Build intelligent alerting: Develop smart dashboards, monitors, and alerting systems that reduce noise, detect anomalies, and help teams respond to incidents quickly. 
  • Partner with engineering teams: Work directly with product teams to integrate observability into their services, establish logging and metrics standards, and instrument code effectively, serving as the observability subject matter expert. 
  • Enable performance optimization: Provide the observability foundation that allows the Scalability team to identify performance bottlenecks, track optimization impact, and measure platform stability with data-driven insights. 
  • Establish observability standards: Define and document comprehensive observability standards including structured logging patterns, metric naming conventions, trace instrumentation, dashboard design principles, and query best practices. 
  • Drive platform adoption: Lead workshops, create documentation, and build self-service tooling that democratizes observability across engineering, making it easy for teams to adopt best practices. 
  • Demonstrate technical leadership: Mentor engineers on observability practices, lead architecture reviews for instrumentation approaches, and represent the Scalability team in cross-functional planning. 
  • Work in an Agile/Scrum environment to continually deliver value to stakeholders and clients. 
  • Code of Conduct: Responsible for adherence to the Capital Rx Code of Conduct including reporting of noncompliance. 

Required Qualifications: 

  • 10+ years of software engineering or infrastructure engineering experience with demonstrated progression into technical leadership roles. 
  • Several years of experience leading technical initiatives, building platform products, or serving as a subject matter expert on observability infrastructure. 
  • Strong experience with React/TypeScript for frontend development and Python (Flask/SQLAlchemy) for backend services. 
  • LGTM stack expertise: Deep production experience with Loki, Grafana, Tempo, and Prometheus/Mimir for logs, metrics, and distributed tracing at scale. 
  • AWS observability: Extensive experience with AWS CloudWatch Logs and Metrics, including custom metrics, log insights, dashboard creation, and integration patterns. 
  • SQL analytics for logs: Production experience with SQL-based log analytics using AWS Athena, DuckDB, or similar query engines for analyzing structured and semi-structured data at scale. 
  • Cloud-native and open-source balance: Demonstrated ability to architect solutions leveraging both managed cloud services and open-source tooling, understanding trade-offs between operational overhead, cost, flexibility, and vendor lock-in. 
  • Search and indexing experience: Hands-on experience building or operating search systems using OpenSearch, Elasticsearch, Lucene, Tantivy, or similar search and analytics engines. 
  • Performance-critical systems: Experience building high-performance systems that process large volumes of data efficiently (millions of log lines, high-cardinality metrics). 
  • Systems thinking: Deep understanding of distributed systems, microservices architectures, and the complex observability challenges they present. 
  • Data at scale: Proven track record handling high-volume structured and unstructured logging data, identifying patterns, and building efficient search/query solutions that perform well under load. 
  • Product mindset: Ability to build internal platform products that engineers love to use, with attention to UX, performance, and reliability. 

Preferred Qualifications: 

  • Rust development experience: Production experience with Rust for building high-performance data processing, indexing, or search systems. Strong interest in learning Rust is acceptable if combined with systems programming experience in C/C++/Go. 
  • Infrastructure as code: Experience with Terraform for managing observability infrastructure and AWS resources. 
  • Additional observability platforms: Experience architecting or operating Datadog, New Relic, Splunk, or other enterprise observability platforms. 
  • Advanced query languages: Deep expertise with PromQL, LogQL, SQL optimization, and query optimization for high-cardinality data. 
  • Columnar storage formats: Experience with Parquet, ORC, or other columnar storage formats for efficient log storage and analytics on S3. 
  • Incident management: Experience designing incident response workflows, postmortem processes, and SLO/SLI frameworks that drive reliability improvements. 
  • Cost optimization: Track record of reducing observability costs while maintaining or improving capabilities (e.g., CloudWatch → S3/custom indexing migration). 
  • Data pipelines: Experience with streaming data pipelines, ETL processes, or real-time data processing. 
  • Distributed tracing: Deep knowledge of OpenTelemetry, Jaeger, Zipkin, or distributed tracing architectures. 
  • Git expertise and experience working in a mono repository. 
  • Previous Pharmacy Benefits Manager (PBM) or healthcare technology experience. 
  • Experience building developer tools or internal platforms that improve engineering productivity. 

This range represents the low and high end of the anticipated base salary range for the NY - based position. The actual base salary will depend on several factors such as: experience, knowledge, and skills, and if the location of the job changes. 

Nothing in this position description restricts management’s right to assign or reassign duties and responsibilities to this job at any time. 

Salary Range
$160,000$220,000 USD

All employees are responsible for adherence to the Capital Rx Code of Conduct including the reporting of non-compliance. This position description is designed to be flexible, allowing management the opportunity to assign or reassign duties and responsibilities as needed to best meet organizational goals.

Judi Health values a diverse workplace and celebrates the diversity that each employee brings to the table. We are proud to provide equal employment opportunities to all employees and applicants for employment and prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, medical condition, genetic information, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws. 

By submitting an application, you agree to the retention of your personal data for consideration for a future position at Judi Health. More details about Judi Health's privacy practices can be found at https://www.judi.health/legal/privacy-policy.

Judi Health is an enterprise health technology company that offers a comprehensive suite of solutions for employers and health plans. With services like Capital Rx for pharmacy benefit management and Judi Health™ for health benefit management, we streamline healthcare services to better support millions of plan members.

View all jobs
Salary
$160,000 – $220,000 per year
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Engineer Q&A's
Report this job
Apply for this job