Skip to content

Enrollment troubleshooting

If a host doesn’t show up in your fleet after pasting the one-liner, work through this list top-down. The script prints clear [FAIL] markers — match the error against the section heading.

”LINKMESH_SERVER env var required” / “LINKMESH_TOKEN env var required”

You ran the script without the env vars. Use sudo -E (preserves env) and set both:

Terminal window
curl -fsSL https://linkmesh.io/install.sh | \
LINKMESH_SERVER=https://your-server.example.com \
LINKMESH_TOKEN=eyJ... \
sudo -E sh

If you used plain sudo sh (no -E), the env vars get stripped before the script sees them.

”Enrollment failed” — token expired

Tokens default to 15 minutes. If the host took longer to install the package (slow network, big mirror), the token’s already gone.

Fix: mint a new token in the wizard and re-run. The script is idempotent — it won’t reinstall the package, just re-enrol.

”Enrollment failed” — server unreachable

The host can’t reach your LinkMesh server. Check:

Terminal window
curl -v https://your-server.example.com/api/v1/health

Common causes:

  • Outbound HTTPS blocked by host firewall
  • Internal DNS doesn’t resolve the server hostname from this host
  • Server is behind a VPN the host isn’t on

For air-gapped fleets, use the server-hosted install variant documented at linkmesh.io/install instead — agent host only needs to reach the LinkMesh server, not the public internet.

”Enrollment failed” — invalid token (410 Gone)

Token was already used. Each token enrols exactly one collector. Mint a fresh one for the next host.

If you didn’t intentionally use it twice, check the audit log — someone or something has already redeemed this token.

”linkmesh-agent restarted” but collector doesn’t appear in the UI

Wait 30 seconds. The agent’s first heartbeat to the server takes up to a heartbeat interval to land. If it’s still missing after 60s:

Terminal window
journalctl -u linkmesh-agent -f

Common failure modes the journal will reveal:

  • Clock skew — host clock is more than ~5 minutes off from server. mTLS cert validation rejects the connection. Fix with timedatectl set-ntp true.
  • gRPC port 50051 blocked — agent uses gRPC on :50051 to talk to the server. Check nc -vz your-server.example.com 50051 from the host.
  • TLS cert mismatch — the journal shows x509: certificate is valid for …, not <addr>. The server’s gRPC cert doesn’t cover the address the agent dials. See “tls: failed to verify certificate” just below for the fix.

”tls: failed to verify certificate” / “certificate is valid for … not …”

The agent’s journal repeats a line like:

Stream connect failed: rpc error: ... tls: failed to verify certificate:
x509: certificate is valid for 10.0.1.3, 127.0.0.1, ::1, not 35.230.148.135

The agent reached the server and trusts its CA, but the server’s gRPC certificate has no SAN for the address the agent connects to (35.230.148.135 above). The agent does strict mTLS and has no skip-verify option by design — the fix is always on the server, and takes ~30 seconds.

Why it happens: install auto-detects SANs from the server’s hostname and its local network-interface IPs. On a cloud VM the external IP is a 1:1 NAT that never appears on the VM’s interface, so it can’t be auto-detected — you reach the box on an address its own cert doesn’t know about.

Fix — on the server, add the address agents use as a SAN. Cert subcommands need exclusive database access, so stop the service first, and run as the linkmesh user so the service can read the new key:

Terminal window
sudo systemctl stop linkmesh-server
# a DNS name (preferred — survives IP changes):
sudo runuser -u linkmesh -- linkmesh-server cert init --san dns:linkmesh.example.com
# …or a bare IP, matching the "not <addr>" in the error:
sudo runuser -u linkmesh -- linkmesh-server cert init --san ip:35.230.148.135
sudo systemctl start linkmesh-server

cert init keeps the auto-detected local SANs and merges in the ones you pass. Confirm with linkmesh-server cert show, then watch the agent reconnect on its own backoff — no action needed on the agent host.

Throwaway lab and can’t touch the server right now? The agent has an escape hatch — certificates.insecureSkipVerify: true in /etc/linkmesh-agent/config.yaml — that connects without verifying the server cert. It disables MITM protection on the control plane and the agent will nag with a WARN on every connect, so treat it as dev-only and never ship it. The fix above is the right one for anything real; details under the agent config.yaml entry in the configuration reference.

”Two operators using the same token”

The first one wins. The second gets 410 Gone. The second operator should mint their own token.

If you’re scripting bulk enrolment via Ansible/Terraform, mint one token per host inside the loop — don’t share.

Still stuck?

  • Check the server’s audit log under Settings → Audit Log for the bootstrap attempt — server-side error messages are richer than what the script can surface to a piped curl.
  • Open the relevant journal: journalctl -u alloy -f (or -u otelcol-contrib) for the collector runtime, journalctl -u linkmesh-agent -f for the management client.