Troubleshooting and Interpreting Results — Lync Server 2013 Stress and Performance Tool
1) Quick checklist before you run tests
- Validate topology: ensure Front-End, Edge, Mediation and PSTN gateways match your production design.
- Clock sync: all test machines, servers, and gateways use NTP and are within 1–2 seconds.
- Certificates & DNS: service certificates valid; internal/external DNS records resolvable by test clients.
- Resources: CPU, memory, disk I/O and NIC interrupts on servers and generators are not saturated.
- Network: verify MTU, QoS, and sufficient bandwidth between load generators and target servers.
2) Common problems and fixes
-
High SIP error rates (4xx/5xx)
- Cause: misrouted requests, authentication problems, insufficient server capacity, invalid SIP URIs.
- Fixes: check topology and routing, confirm service account credentials, increase Front-End capacity or reduce simulated user rate, inspect Snooper/centralized logs for exact SIP responses.
-
Call setup failures or one-way audio
- Cause: NAT/firewall blocking RTP, incorrect media ports, codec mismatches, missing SRTP keys.
- Fixes: open required RTP ports on firewall, validate media bypass and SRTP settings, confirm codecs negotiated in SIP SDP, capture media with Wireshark/Snooper.
-
High latency or jitter for media
- Cause: network congestion, insufficient CPU on media path, virtualization host contention.
- Fixes: measure path latency and packet loss, enable QoS, move media processors to dedicated hardware or adjust VM resources.
-
Address Book, ABS or UC services failing
- Cause: incorrect ABS web services URLs, auth failures, expired tokens.
- Fixes: test with Test-CsAddressBookWebQuery, examine Front-End logs and IIS logs for ⁄404, fix certificates and URLs.
-
Load generator instability
- Cause: insufficient generator resources, improper provisioning, DNS/certificate issues for test accounts.
- Fixes: scale out generators, re-run provisioning with UserProvisioningTool, verify generator machine time and network access.
3) Key logs and tools to use
- Centralized Logging + Snooper: primary for SIP dialog analysis and call-flow diagrams.
- LyncPerfTool logs (consolidated.csv, scenario logs): use for aggregated metrics and error counts.
- Windows Performance Monitor (PerfMon): CPU, Memory, Disk Queue Length, Network Interface counters on Front-End, Mediation, and edge.
- Wireshark: packet-level RTP/SIP troubleshooting, measure jitter/packet loss.
- IIS and Event Viewer: service-level errors, certificate problems, and event IDs.
4) Metrics to inspect and pass/fail guidance
- Success rate: target ≥ 99% for call establishment/IM delivery depending on SLA.
- Average call setup time: baseline from production — typical target < 500–1000 ms for SIP INVITE→200 OK in same LAN.
- CPU utilization: keep < 70–80% on Front-End during steady-state.
- Memory & handle usage: no steady growth (memory leak) across long runs.
- RTP packet loss/jitter: packet loss < 1–2%, jitter < 30 ms for acceptable voice quality.
5) Interpreting common LSS outputs
- consolidated.csv: aggregated transaction counts, success/failure counts — sort by failure reason to find hotspots.
- Scenario-level reports: compare different workload mixes (IM vs AV vs conference) to see which workload triggers failures.
- SIP trace call-flow diagrams: follow failing dialog path; identify where 4xx/5xx originate.
- PerfMon timelines vs test timeline: correlate spikes in CPU, disk, or NIC drops with increases in error rates.
6) Triage workflow (fast)
- Reproduce the failing scenario with a small set of users.
- Collect centralized logs + Snooper for the failing time window.
- Correlate LyncPerfTool failure timestamps with PerfMon and network captures.
- Identify component returning error (Edge, FE, Mediation, Gateway).
- Apply targeted fix (routing, ports, resources, certificates) and re-run.
7) Post-test validation
- Run steady-state tests for several hours to spot leaks.
- Compare results to capacity plan and adjust server sizing or QoS as needed.
- Document failing scenarios, root cause, fix applied, and re-test.
Leave a Reply