Senior Engineer (AWS, CloudWatch)

AI overview

Provide Level 1 support for AWS applications, ensuring efficient monitoring and troubleshooting while collaborating with teams to enhance operational processes.

REQUIREMENTS:

Basic understanding of AWS infrastructure and services (e.g., EC2, RDS, ALB, S3, CloudWatch, IAM).
Experience with application and infrastructure monitoring tools (e.g., CloudWatch, AppDynamics, New Relic, Dynatrace, Grafana).
Basic database knowledge and the ability to perform simple checks (e.g., MySQL, PostgreSQL, Oracle, RDS).
Understanding of web applications, APIs, HTTP status codes, and application errors.
Ability to perform log and metric analysis for initial troubleshooting.
Familiarity with ITSM tools for incident management (e.g., ServiceNow, Jira, Remedy).

RESPONSIBILITIES:

Provide Level 1 support for all client-facing applications and platforms, ensuring SLAs are met.
Acknowledge, log, triage, and respond to incoming incidents and alerts.
Perform initial troubleshooting by validating impact, collecting logs/metrics, and executing predefined runbook procedures.
Resolve known issues using documented processes and workarounds.
Escalate complex incidents to L2/L3 teams with complete diagnostic information and context.
Participate in major incident response bridges and provide real-time status updates.
Proactively monitor the health and performance of client-facing web/mobile applications, APIs, and integrated services.
Monitor AWS infrastructure (EC2, RDS, ALB, S3, CloudWatch) and databases for alerts and performance degradation.
Conduct routine application and service health checks.
Identify performance anomalies, error patterns, and network latency issues, escalating as required.
Fine-tune monitoring alerts and thresholds to improve signal clarity and reduce noise.
Perform basic database operational checks (e.g., connectivity, disk usage, backup status).
Validate application functionality and user-reported issues at the L1 level.
Coordinate with application owners, infrastructure teams, and third-party vendors for issue resolution.
Maintain and update knowledge base articles, runbooks, and operational documentation.
Document recurring incidents, known errors, and effective workarounds.
Support root cause analysis (RCA) by providing detailed L1 observations and data.
Identify and suggest opportunities to improve monitoring, alerting, and operational processes.

Bachelor’s or master’s degree in computer science, Information Technology, or a related field.