We Didn’t Plan for Logging — Our Observability Mistakes in Azure

Author: Explain My Stack Team
Date: July 22, 2022 | Read time: 4 min

We built what we thought was a robust API pipeline on Azure — leveraging best practices in scalability, security, and deployment. But we overlooked one essential aspect that would later haunt us: logging. Months after our go-live, we found ourselves in a reactive spiral, struggling to diagnose performance degradation, troubleshoot issues, and understand user behaviour.

Here’s a breakdown of what we missed, what it cost us, and what we’d do differently if we had the chance to rewind.

🔍 What We Missed

App Insights wasn’t wired into all services: Some of our microservices had telemetry configured, others didn’t. The result? Incomplete visibility across our stack. Dependencies, traces, and requests weren't being monitored uniformly.
No structured log format was agreed on: Different teams logged in different formats — or not at all. This made it difficult to correlate events across services or extract meaning from the logs programmatically.
Diagnostic settings weren’t enabled by default: Azure provides valuable diagnostic logs for services like API Management, Application Gateway, and Azure Functions. We failed to turn these on early, missing critical events and metrics.

⚠️ The Consequences

Blind spots in performance issues: When latency spiked, we had no clear trace of where delays originated — application tier, network, or downstream APIs. Troubleshooting was slow and frustrating.
Inability to isolate root causes: Without structured logs or telemetry correlation, every incident became a guessing game. Was it a bug? A configuration issue? A usage spike? We had no answers.
Stakeholder frustration: Internally, our stakeholders and leadership couldn’t get clarity on issues, eroding trust in the platform. Externally, users experienced outages we couldn’t explain or prevent from recurring.

✅ What We’d Do Differently

Define a logging and observability strategy early: Before writing code, align on how logs, metrics, and traces will be used. Decide what “good” looks like for operational visibility.
Use structured logs with correlation IDs: Logging in JSON format with consistent fields (timestamp, service, operation ID, correlation ID, etc.) enables effective parsing, filtering, and root cause tracing.
Enable diagnostic settings by default: Whether through Azure Portal, ARM templates, Bicep, or Terraform, ensure diagnostic logging is always configured for all resources — and sent to centralized destinations like Log Analytics or a SIEM.
Automate and visualize with dashboards: Real-time dashboards, powered by KQL in Log Analytics or via Grafana/Azure Monitor Workbooks, help identify anomalies and track system health proactively.

📌 Key Takeaway

You can’t fix what you can’t see. Observability is not a nice-to-have — it’s a first-class citizen in any cloud-native system. Treat it with the same priority as your deployment pipeline or security posture.

Logging isn’t just about debugging; it’s about enabling insight, accountability, and confidence in the systems you build. Don’t let it be an afterthought — or you’ll learn the hard way, like we did.