Skip to content

Self-telemetry troubleshooting

LinkMesh renders per-component throughput by reading the collector’s own self-telemetry — Alloy’s prometheus.exporter.self, otelcol-contrib’s otelcol_* metrics. If the topology canvas shows the collector but the edges are blank, one link in that chain isn’t delivering. Work through this list top-down; each section diagnoses one specific failure mode.

Symptom: topology canvas shows the collector but no throughput numbers

The most common cause is that the collector hasn’t started pushing its own_metrics yet, or those pushes aren’t reaching the LinkMesh OTLP receiver.

Check 1: did the collector receive an own_metrics offer?

For otelcol-contrib via OpAMP, the LinkMesh server pushes a one-time own_metrics offer immediately after enrollment. The collector’s Events tab should show an own_metrics_offered event within ~5 seconds of registered.

If the event is missing, the collector enrolled before LinkMesh server v1.101 (the release that introduced own_metrics offers). Re-enrol the collector to trigger a fresh offer:

  1. Stop the collector + agent on the host
  2. In the LinkMesh UI: Collectors → … → Deregister
  3. Mint a fresh token + re-run the install one-liner

For Alloy via remotecfg, there is no separate “offer” event — the bootstrap config that linkmesh-agent wrote already includes the own_metrics exporter. Skip to Check 2.

Check 2: is the collector actually pushing OTLP?

On the host, watch the collector’s logs for OTLP send activity:

Terminal window
# Alloy
sudo journalctl -u alloy -f | grep -i 'otelcol.exporter.otlphttp'
# otelcol-contrib
sudo journalctl -u otelcol-contrib -f | grep -i 'metrics exporter'

Expected: periodic lines about successful exports every ~30 seconds.

If you see no OTLP activity at all, the bootstrap config didn’t land correctly. Check the file exists and looks right:

Terminal window
# Alloy
sudo cat /etc/alloy/config.alloy | head -5
# the first line should start with: // linkmesh-bootstrap
# otelcol-contrib (OpAMP-managed): config lives under the supervisor's
# remote-config directory rather than a static file
sudo find /var/lib/otelcol-contrib -name 'effective_config*'

Check 3: are OTLP pushes reaching the server?

If the collector logs OTLP activity but throughput still doesn’t show, something between the collector and LinkMesh is dropping the pushes. Check the LinkMesh server-side logs:

Terminal window
# On the LinkMesh server (k8s)
kubectl -n linkmesh logs deploy/linkmesh-server -f | grep -i 'otlp.dispatch\|otlp.auth'
# On a single-host install
journalctl -u linkmesh-server -f | grep -i 'otlp.dispatch\|otlp.auth'

Three failure modes to look for:

  • otlp.auth: rejected reason=missing_or_malformed_bearer — the collector is reaching the OTLP receiver but not sending the Authorization header. The bootstrap config or the OpAMP own_metrics offer didn’t include the bearer. See Check 1.

  • otlp.auth: rejected reason=token_unknown_or_revoked_or_expired — the bearer the collector is sending no longer Verifies. Most common cause: you deregistered the collector then it came back online. Re-enrol to mint a fresh token.

  • otlp.dispatch: unknown collectorId — the bearer Verifies but the CollectorID it resolves to has been deregistered from the fleet. Re-enrol to recreate the Collector record.

If you see none of these but throughput is still missing, the collector probably can’t reach the LinkMesh OTLP endpoint at all (firewall, DNS, TLS). Test from the host:

Terminal window
curl -v -X POST -H 'Authorization: Bearer test' \
-H 'Content-Type: application/x-protobuf' \
https://your-server.example.com/v1/metrics
# Expected: HTTP 401 (proves the endpoint is reachable + auth is wired)

Check 4: is per-edge throughput working but only on some edges?

Some edges show numbers, others don’t. This means the OTLP path works but specific components aren’t being attributed correctly. Most common cause: the component naming in the collector’s emitted metrics doesn’t match what the topology lookup expects.

Open the collector’s Events tab in the LinkMesh UI and look for component_naming_mismatch warnings. The fix depends on what the warning says:

  • “unknown component” — the metric arrived but the component name doesn’t match any source/destination activation. Usually means the collector is running a config the LinkMesh UI doesn’t know about (manual edit, or an older config still active because reload didn’t fire). Restart the collector.

  • “orphan throughput” — a row exists in ComponentThroughput but the topology lookup can’t find a matching edge. Usually means a source/destination was deleted from the UI while the collector was still pushing for it. Self-resolves within 24h via TTL.

Symptom: host metrics show but per-edge throughput doesn’t

Host metrics (CPU, memory, uptime) come from a different path than per-edge throughput — system.cpu.utilization etc. land via the same OTLP push but get mapped onto the Collector record, not ComponentThroughput. If host metrics work but edges don’t:

  • OTLP receiver auth + transport are fine (host metrics arrived).
  • The component-metric mapping is the issue.

Most common cause: the collector’s pipeline isn’t actually processing data yet. Per-edge throughput needs two consecutive samples 30s apart to compute a rate — a fresh enrollment with no data flowing will show nothing until data arrives.

Push some test data through the collector’s pipeline and wait 60s.

Symptom: throughput shows but the numbers look wrong

”Records/sec spikes to millions right after collector restart”

You’re looking at the lifetime cumulative counter instead of a rate. The first sample after restart can’t compute a rate (no prior sample), so it gets skipped — but if your dashboard is reading raw ComponentThroughput records directly (not the rendered topology), the counter values are in there. Wait 30 seconds for the second sample; the rate will be sensible.

”Numbers are zero but I see data flowing in Grafana Cloud”

Your collector is forwarding data correctly but not pushing self-telemetry. Two possibilities:

  • Alloy’s prometheus.exporter.self is disabled or returning empty — check the bootstrap config has the prometheus.exporter.self and prometheus.scrape blocks intact.
  • otelcol-contrib’s service.telemetry.metrics is configured to send to a different endpoint (often a customer Prometheus). Either reconfigure to push to LinkMesh as well, or accept that LinkMesh won’t show per-edge numbers for this collector.

When all else fails

Capture the collector’s effective config + recent logs and file a support ticket — include the collector’s CollectorID, the timestamp range, and which check above you got stuck on:

Terminal window
# Alloy
sudo cat /etc/alloy/config.alloy > /tmp/effective-config.txt
sudo journalctl -u alloy --since '15 minutes ago' > /tmp/collector-logs.txt
# otelcol-contrib
sudo cat /var/lib/otelcol-contrib/effective_config.yaml > /tmp/effective-config.txt
sudo journalctl -u otelcol-contrib --since '15 minutes ago' > /tmp/collector-logs.txt