Top 7 Metrics to Track in SolarWinds Exchange Monitor
Monitoring Microsoft Exchange with SolarWinds Exchange Monitor (ESE or Exchange-specific modules in SolarWinds) keeps mail flow healthy, reduces downtime, and helps you spot capacity and configuration issues early. Below are the seven most valuable metrics to track, why each matters, recommended thresholds or alerting guidance, and quick remediation steps.
1. Mail Queue Length
- Why it matters: Long queues indicate delivery problems, transport bottlenecks, or downstream server issues. Growing queues correlate with delayed mail delivery and user complaints.
- Suggested thresholds: Alert at sustained queue length > 50 messages for 10 minutes (adjust by environment size).
- Remediation: Verify transport services, check disk and CPU on Hub/Edge servers, inspect queue viewer for stuck messages and NDRs, clear or reroute problematic messages after investigation.
2. Mail Flow Latency (Delivery Time)
- Why it matters: Measures time from submission to delivery—critical for SLA and user experience.
- Suggested thresholds: Alert if average delivery time exceeds 2 minutes for internal mail or 10 minutes for external mail (tune per SLA).
- Remediation: Identify slow transport hops, check DNS resolution and MX records, examine connector health and antivirus/antispam scanning delays.
3. Active Database Copy Status / Database Copies Health
- Why it matters: Ensures HA via Database Availability Groups (DAG). Unhealthy copies risk data loss and failover issues.
- Suggested thresholds: Alert on any passive copy with CopyQueueLength > 0 for prolonged periods, or copy status not “healthy.”
- Remediation: Check replication network, reseed databases if corrupted, investigate lossy network segments, ensure log shipping isn’t blocked by disk space.
4. Mailbox Store Size and Growth Rate
- Why it matters: Prevents storage exhaustion, maintains performance, and supports capacity planning.
- Suggested thresholds: Alert when mailbox database size reaches 75–85% of allocated capacity or growth exceeds expected rate (e.g., >5% monthly).
- Remediation: Archive old mail, enable retention policies, increase storage, move large mailboxes to separate databases.
5. CPU and Memory Utilization on Mailbox and Transport Servers
- Why it matters: High resource utilization degrades transport, indexing, and client access services.
- Suggested thresholds: Alert if CPU > 80% or memory consumed > 85% for sustained periods (5–10 minutes).
- Remediation: Identify runaway processes (e.g., indexing), optimize antivirus exclusions, scale out with additional servers or upgrade hardware, tune concurrent processes.
6. Client Access Service Health (OWA/EAS/Outlook RPC/REST)
- Why it matters: Directly impacts end-user access—login failures or slow OWA responses create support tickets.
- Suggested thresholds: Alert on 5%+ failed authentication rate, response times > 2s for OWA/API endpoints, or service outages.
- Remediation: Check IIS application pools, certificate validity, Autodiscover configuration, load balancer health and SSL offload settings.
7. Transport Service Errors and Rejections (NDRs / SMTP Errors)
- Why it matters: High error or rejection rates indicate misconfiguration, spam/blacklist issues, or upstream problems.
- Suggested thresholds: Alert when SMTP 4xx/5xx errors spike above baseline or NDR rate increases by >50% over normal.
- Remediation: Review SMTP logs, verify connector authentication and TLS settings, check reputation/blacklist status, adjust anti-spam tuning.
Quick Implementation Tips
- Baseline first: Collect metric baselines for 7–14 days before setting strict thresholds—alerting tuned to actual behavior reduces noise.
- Use composite alerts: Combine related metrics (e.g., queue length + CPU) to reduce false positives.
- Tag by role/environment: Different thresholds for production vs. test or for high-volume mailboxes.
- Automate runbooks: Link alerts to step-by-step remediation playbooks so responders act consistently.
Tracking these seven metrics in SolarWinds Exchange Monitor gives you coverage across delivery, availability, performance, capacity, and user experience. Tune thresholds to your environment and iterate based on incident history.
Leave a Reply