28 min
At clubcloud, reliability isn’t a feature — it’s a requirement. Our platform supports real-time reservations, member access, and operational workflows that clubs rely on every day. Downtime, silent failures, or delayed incident response simply aren’t acceptable.
To meet that bar, we’ve built a proactive observability and alerting system on AWS that combines CloudWatch metrics and logs, Lambda-based investigation workflows, AI-assisted analysis, and AWS Health events — all designed to detect issues early, provide immediate context, and help us act fast.
This post outlines how we do it, the alarms we’ve put in place, and why this approach matters.
In this post
Key takeaways
- CloudWatch is our single source of operational truth (metrics, logs, and events).
- Alarms are tuned for real user impact — not noise.
- When alarms fire, automated Lambda workflows collect context and speed up triage.
- AI reduces cognitive load by summarizing patterns and suggesting likely causes.
Observability First: CloudWatch as the Backbone
CloudWatch is the foundation of our operational visibility. Every critical part of the platform emits structured logs and metrics, including:
- API Gateway requests and responses
- Lambda execution metrics and errors
- Authentication flows tied to AWS Cognito and Odoo
- Reservation and booking workflows
- Background jobs and async processing
We treat logs as first-class operational data, not something you only look at after something breaks. From the beginning, we designed clubcloud with the belief that you can’t build a resilient platform if you can’t clearly see how it’s behaving in real time. Observability isn’t something you bolt on after the fact — it’s a foundational capability. That’s why Amazon CloudWatch sits at the center of our operational architecture.
CloudWatch is AWS’s native monitoring and observability service. It collects and correlates metrics, logs, and events from across the entire AWS ecosystem — from infrastructure to managed services to serverless workloads. Because our platform is built heavily on AWS-native services (Lambda, API Gateway, Cognito, EventBridge, and supporting services), CloudWatch gives us deep visibility without brittle third-party agents or custom instrumentation layers.
A Single Source of Operational Truth
One of the biggest challenges in modern distributed systems is fragmentation: logs in one place, metrics in another, alerts somewhere else. CloudWatch allows us to consolidate these signals into a single operational plane.
At clubcloud, CloudWatch acts as our system of record for:
- Request volume and error rates across APIs
- Lambda execution health, duration, and concurrency
- Authentication flows and identity mapping
- Background workflows and async processing
- Platform-wide performance trends over time
When something happens — good or bad — CloudWatch is where the evidence lives.
Logs as Structured Operational Data
We treat logs as structured, queryable data, not just text streams. Every critical service emits logs that are:
- Consistent in format
- Rich in context (request IDs, user flows, status codes)
- Designed for querying with CloudWatch Logs Insights
This allows us to move beyond reactive debugging and toward active investigation. Instead of asking “what went wrong?”, we can ask:
- When did this start?
- Which endpoints or workflows are affected?
- Is this isolated or systemic?
- Has this pattern occurred before?
Because the logs are already centralized and indexed, those answers are available in seconds — not hours.
Metrics That Reflect Real User Impact
Not all metrics are created equal. Rather than tracking everything, we focus CloudWatch metrics on signals that correlate directly with customer experience:
- Error rates instead of raw invocation counts
- Latency trends instead of averages
- Throttling and saturation indicators instead of theoretical limits
These metrics feed directly into our alarms and investigation workflows, ensuring alerts are meaningful and actionable.
Native Integration with the AWS Ecosystem
Another reason we’ve standardized on CloudWatch is its tight, native integration with the rest of AWS. This allows us to:
- Trigger Lambda investigations automatically from alarms
- Correlate metrics, logs, and AWS Health events
- Respond to infrastructure-level issues without manual glue code
- Maintain consistency across environments and regions
Because CloudWatch is part of the AWS control plane, it scales with us and evolves alongside the services we depend on.
The Foundation for Automation and AI
Perhaps most importantly, CloudWatch provides the raw signal that powers everything else we do operationally:
- Automated log investigations
- AI-driven error summarization
- Proactive alerting and escalation
- Post-incident analysis and learning
Without high-quality, centralized observability data, automation and AI become guesswork. With CloudWatch, they become force multipliers.
Alarms We Rely On (and Why They Matter)
We’ve intentionally focused our alarms on signals that indicate real user impact or systemic risk, not noisy thresholds. Below are some examples of the alarms we utilize.
API Gateway 5XX Error Alarms
What it catches: Backend failures, downstream outages, or unhandled Lambda exceptions
Why it matters: 5XXs almost always correlate with broken user experiences
How we use it: Triggers immediate investigation with log context around the failure window and sends email + Slack notifications to our operational teams
Lambda Error Rate & Throttling Alarms
What it catches: Code regressions, dependency failures, concurrency exhaustion
Why it matters: Lambdas are our execution backbone — failures cascade quickly
How we use it: Alerts include request IDs, paths, and timestamps to pinpoint root cause fast. We embed these logs into a custom Lambda email template with links to relevant log streams, groups, and AI investigation results.
Latency & Duration Thresholds
What it catches: Slow downstream services, performance regressions, cold start amplification
Why it matters: Latency often degrades UX before outright failure
How we use it: Helps us act before users report slowness
Authentication & Authorization Anomalies
What it catches: Cognito login failures, identity mapping issues with Odoo
Why it matters: If users can’t log in, the platform is effectively down
How we use it: Correlates auth errors with recent deploys or AWS service changes
Lambda-Driven Log Investigations (Automatically)
When an alarm fires, we don’t want engineers manually digging through logs under pressure. Instead, we use a dedicated investigation Lambda that automatically:
- Identifies the alert window — typically ±5 minutes around the alarm trigger
- Queries CloudWatch Logs Insights
- Relevant log groups (API Gateway, Lambda, auth services)
- Filters by status codes, error patterns, request IDs
- Aggregates key findings
- Error counts
- Repeated stack traces
- Correlated request paths or user flows
- Packages the results
- Direct CloudWatch Logs Insights links
- Summarized findings
- Raw excerpts when helpful
- Delivers notifications
- Email and/or Slack integrations
- Structured so humans can act immediately
This turns an alert from “something broke” into “here’s what broke, where, and when” — often within seconds.
AI-Assisted Error Analysis
On top of raw data, we layer CloudWatch’s AI-driven analysis to accelerate understanding:
- Pattern recognition across error logs
- Classification of failures (configuration, dependency, timeout, regression)
- Summaries written in human-readable language
- Suggested next steps or likely causes
This is especially powerful during off-hours or high-traffic periods, when fast clarity matters most.
AI doesn’t replace engineering judgment — it reduces cognitive load so engineers can focus on fixing the problem.
Proactive Monitoring with AWS Health Events
Reactive alerting isn’t enough. Some issues originate outside our codebase. That’s why we also integrate AWS Health events directly into our alerting pipeline.
What We Monitor
- Regional service degradations
- API Gateway, Lambda, or Cognito incidents
- Scheduled maintenance that could impact availability
How We Use It
- AWS Health events trigger notifications automatically
- We correlate them with platform metrics
- We can preemptively:
- Pause deployments
- Notify internal teams
- Prepare mitigations before customers feel impact
This gives us situational awareness beyond our own stack.
Why This Matters for clubcloud Customers
All of this exists for one reason: to protect the experience of the clubs and members who rely on us.
- Faster detection means less downtime
- Rich context means faster resolution
- Proactive alerts mean fewer surprises
- AI summaries mean clearer communication internally
In short, we’ve designed our platform to fail loudly, visibly, and informatively — and to recover quickly.
Looking Ahead
We continue to evolve this system with:
- Smarter anomaly detection
- Deeper AI-driven root cause analysis
- Automated remediation where appropriate
- Even tighter integration between metrics, logs, and deployments
Reliability is never “done.” It’s an ongoing discipline — and one we take seriously.
min