Troubleshooting the Stack: A Layer 1 to Layer 7 Field Guide

"It's slow" and "it's down" are Layer 8 statements. Your job is to translate them into a layer, because every layer has different failure signatures, different tools, and a different team that owns the fix. The OSI model gets taught as trivia for cert exams, but its real value is as a debugging discipline: start at the bottom (or bisect), rule out each layer with evidence, and never skip a layer because you assume it's fine.

Here's the walk, with what each layer actually looks like when it breaks.

Layer 1 — Physical

The question: is there signal on the wire?

Bad optics, marginal cables, dirty fiber, duplex mismatches, PoE budget exhaustion. L1 problems are sneaky because they're often partial — the link is up, but errors accumulate.

Failure signatures: CRC/FCS errors incrementing, flapping interfaces, one direction working, speed negotiated below expected (1G port linking at 100M is almost always a bad pair).

show interfaces ge-0/0/12 extensive   # Junos: look at error counters, PMA/PMD
show interfaces gi1/0/12 | i error|duplex|rate   # IOS
ethtool eth0                           # Linux: speed/duplex/link
show interfaces diagnostics optics     # optic light levels — know your dBm budget

Modern platforms help: Mist's Marvis flags bad cables from PHY error patterns on EX switches without you asking. But the fundamentals don't change — if errors climb with traffic, suspect the physical path first.

Layer 2 — Data Link

The question: can frames reach the next hop?

VLANs, MAC learning, spanning tree, LACP, ARP (the L2/L3 seam). The classic failures: VLAN missing from a trunk, native VLAN mismatch, STP blocking a port you thought was forwarding, a loop melting the broadcast domain, two devices claiming one MAC.

Failure signatures: works on one switch but not across the trunk; everything on a VLAN unreachable; the network dies site-wide when someone plugs a desk switch in backwards (loop); MAC flapping log messages.

show ethernet-switching table          # Junos MAC table
show vlans / show spanning-tree
show interfaces trunk                  # IOS: is the VLAN actually allowed?
show lldp neighbors                    # is this port even connected to what you think?
ip neigh / arp -a                      # host-side ARP sanity

LLDP is the most underrated tool at this layer — half of "L2 problems" are actually "that cable doesn't go where the spreadsheet says."

Layer 3 — Network

The question: is there a route, in both directions?

IP addressing, subnetting, routing protocols (OSPF, BGP), NAT, MTU. Remember: return path matters. Traffic that arrives fine but replies via a path with no route back looks identical to "down" from the client.

Failure signatures: TTL exceeded, destination unreachable, asymmetric behavior (ping works one way), traceroute dying at a consistent hop, big packets failing while small ones pass (MTU/fragmentation — the VPN classic).

ping / traceroute / mtr               # mtr = traceroute with statistics over time
ping -s 1472 -M do <dst>              # Linux DF-bit test: hunt the MTU hole
show route <prefix>                    # what does the router actually know?
show bgp summary / show ospf neighbor  # are the protocols even adjacent?

Cloud makes L3 weirder, not simpler: security groups, route tables per subnet, TGW attachments, and NAT gateways are all Layer 3 constructs that fail without a single cable involved. VPC Reachability Analyzer is traceroute for a network you can't touch.

Layer 4 — Transport

The question: does the connection establish, and does it stay healthy?

TCP handshakes, ports, firewalls, retransmissions, windowing; UDP and its indifference to your feelings. This is the layer where "the network is fine" (ping works!) and "the app is broken" (the SYN dies at a firewall) get confused, because ICMP and TCP take different policy paths.

Failure signatures: connection timed out (silent drop — usually a firewall) vs. connection refused (RST — host reachable, service absent: actually good news). Slow-but-working transfers with high retransmit counts point to loss somewhere below.

nc -vz host 443                        # does the port answer at all?
ss -tnip                               # Linux: retransmits, cwnd, rtt per socket
tcpdump -ni eth0 'tcp port 443 and (tcp[tcpflags] & (tcp-syn|tcp-rst) != 0)'

In a packet capture, the diagnosis is usually one glance: SYN with no SYN-ACK (filtered or no return route), SYN-ACK with client RST (client-side security software), or a healthy handshake followed by retransmit storms (loss/MTU below).

Layer 5 & 6 — Session and Presentation

The question: do the two ends agree on how to talk?

In modern practice these layers collapse mostly into TLS: certificate validity, chains, SNI, protocol/cipher agreement — plus encoding issues and session-level constructs like SIP dialogs or SMB sessions.

Failure signatures: works in one browser/client but not another (cipher or trust-store differences), works by IP but not by name (SNI/cert mismatch), "certificate expired" at 00:00 UTC on a very predictable date, API clients failing after a TLS-inspection proxy (like a ZIA deployment) was inserted and the corporate root CA isn't trusted.

openssl s_client -connect host:443 -servername host   # the whole story: chain, SNI, protocol
curl -vI https://host                                  # quick client-view check

If you run TLS inspection anywhere (Zscaler, NGINX re-encryption), keep a bypass list and a trust-deployment story — half your "L7 outages" will actually be L6 trust failures.

Layer 7 — Application

The question: is the application doing the right thing with a working connection?

HTTP status codes, DNS (functionally an L7 service, and the cause of an embarrassing fraction of all outages), APIs, proxies, load balancer routing.

Failure signatures: 4xx (client/request problem — auth, path, host header), 5xx (server side — read the backend logs), 502/504 specifically meaning "your proxy couldn't get a good answer from upstream" (go look at the upstream, not the proxy), DNS returning stale/missing/split-horizon-wrong answers.

dig +short app.example.com             # and: dig @8.8.8.8 vs. internal resolver
curl -v https://app/health             # follow the request like a packet
# NGINX-side: $upstream_response_time vs $request_time in the access log
#   — upstream slow, or client slow? The log already knows.

The Meta-Skills

Bisect instead of walking. With a working mental model, split the stack: curl succeeding rules out L1–L6 in one command. Ping succeeding rules out L1–L3 (mostly). Choose tests that eliminate half the stack at a time.
Both directions, always. Requests and replies can take different paths, hit different firewalls, and fail independently.
"Works by IP, fails by name" = DNS or TLS/SNI. "Small works, big fails" = MTU. "Intermittent by flow" = ECMP hashing across one bad path. Learn the signatures; they're stable across decades.
Instrument before you're sick. Interface error baselines, flow data, synthetic checks, and structured proxy logs (the observability stack I've written about elsewhere — New Relic NPM, Mist SLEs, Meraki uplink history, ZDX) turn this entire article from a manual process into a dashboard that points at the failing layer for you.

The stack is also a career map — L1 cabling to L7 application behavior spans network tech, network engineer, security engineer, SRE. You don't have to master every layer at once. You do have to respect that all of them exist, because the packet doesn't care which ones you skipped.

Troubleshooting the Stack: A Layer 1 to Layer 7 Field Guide

Troubleshooting the Stack: A Layer 1 to Layer 7 Field Guide

Layer 1 — Physical

Layer 2 — Data Link

Layer 3 — Network

Layer 4 — Transport

Layer 5 & 6 — Session and Presentation

Layer 7 — Application

The Meta-Skills

Further Reading

Related Reading

Understanding TCP/IP: The Foundation of Internet Communication

Juniper Mist and the EX4000 Series: AI-Driven Campus Switching

Cisco Meraki MX85: SD-WAN, AutoVPN, and Branch Observability