Key Responsibilities

Software Testing & QA Leadership

Design, review, and lead the implementation of test plans, test cases, and test strategies for various software components (APIs, services, UI).
Oversee test automation script development using tools such as PyTest, Selenium, Playwright, or Postman.
Maintain and optimize test automation pipelines, integrating with CI/CD tools (e.g., Jenkins, GitLab CI, Azure DevOps).
Lead functional, regression, smoke, and performance testing efforts to validate system readiness.
Ensure traceability from requirements to test cases and bug reports.

LLM Evaluation & Benchmarking

Lead a team responsible for the evaluation of Large Language Model (LLM) outputs.
Design capability-based evaluation benchmarks (e.g., summarization, reasoning, math, code generation).
Guide the development and execution of auto-evaluation scripts, using LLM-as-a-judge, rule-based, and metric-based methods.
Build and maintain evaluation pipelines to track model accuracy, hallucination rates, robustness, and more.
Collaborate closely with AI Engineers and Data Scientists to align evaluations with development priorities.

Team Leadership & Technical Coaching

Mentor and support a team of QA engineers and model evaluators.
Allocate tasks, define sprint goals, and ensure timely and high-quality delivery of testing and evaluation artifacts.
Foster a culture of test-first thinking, technical quality, and continuous improvement.
Communicate evaluation insights and quality reports to product managers and stakeholders.

Required Qualifications

Bachelor's or Master’s degree in Computer Science, Software Engineering, AI, or a related field.
Minimum 5+ years in software testing, including experience as a Senior QA Engineer or Test Lead.
Strong experience in test case writing, test scenario design, and test automation scripting.
Proficiency in scripting languages like Python, JavaScript, or Java for test automation.
Experience with tools such as PyTest, Selenium, JUnit, Playwright, Postman, etc.
Familiarity with LLMs (e.g., DeepSeek, Mistral, LLaMA) and AI evaluation metrics (BLEU, ROUGE, Accuracy, etc.).
Experience in building or maintaining benchmark datasets for AI evaluation.
Understanding of prompt engineering, response validation, and error case analysis.

Preferred Skills

Experience with LLM evaluation libraries/tools like OpenAI Evals, TruLens, LangChain Eval, or custom scripts.
Experience working with MLOps or AI pipelines and integrating tests within them.
Familiarity with dataset labeling platforms or human-in-the-loop evaluation systems.
Strong data analysis and reporting skills using Excel, Python (Pandas/Matplotlib), or dashboards.
Ability to define and customize evaluation logic per customer or business domain.

Software Test Lead

AI overview