Skip to content

System requirements & sizing

Sizing LinkMesh means sizing two different things that scale on two different axes — and the most common mistake is conflating them:

You’re sizing…The nodeGrows withHow you scale it
The data planeyour collectors (Grafana Alloy / otelcol)telemetry bytes/sec & events/secadd collectors, split busy sources across hosts
The control planethe LinkMesh serverfleet size (number of collectors)one instance, sized up — or many, for high availability

The rest of this page sizes each axis in turn, then covers MongoDB sizing and the network/latency considerations that matter once you go multi-instance.

Sizing collectors (the data plane)

This is where telemetry bytes/sec lives. A collector’s footprint is driven by the volume and shape of the data flowing through it — independent of LinkMesh entirely; it’s ordinary OpenTelemetry Collector / Grafana Alloy capacity planning.

What drives a collector’s CPU and memory:

  • Throughput — events/sec is usually a better predictor than raw bytes/sec; serialization and per-record processing dominate. As a rough starting point from upstream benchmarks, budget on the order of one CPU core per ~10k–20k records/sec of logs, then measure and adjust for your processors.
  • Signal type — traces and high-cardinality metrics cost more per byte than plain logs.
  • Processor chain — masking, transforms, and parsing add CPU per record; a batch processor adds memory proportional to batch size.
  • Export queueing — the sending queue buffers in memory when a destination is slow; size memory for the configured queue, not just steady state.

Scaling collectors out

Collectors are independent of one another — there’s no coordination to add a host, which makes the data plane scale horizontally and linearly:

  • One collector per host (the DaemonSet pattern) for host and workload telemetry — each handles only its own node’s volume, so the fleet scales with your infrastructure automatically.
  • Dedicated aggregation collectors for gateway-style ingestion (syslog, OTLP fan-in). When one aggregator saturates a host, run more of them behind a load balancer and split busy sources across them rather than growing a single box without limit.
  • Vertical first, then horizontal — give a collector more cores until per-host limits or blast-radius concerns push you to add instances.

Because each collector holds its own last-applied config and exports directly to your backends, adding collectors adds data-plane capacity without adding load to the LinkMesh server beyond one more entry in the fleet.

Sizing the LinkMesh server (the control plane)

The server’s job is fleet management: serving the UI/API, delivering config to collectors, and ingesting their self-telemetry. Its load scales with the number of collectors, the rate of config changes, and the number of operators using it — not with your telemetry volume.

What drives server load

  • Self-telemetry storage is the dominant driver. Each collector pushes its internal metrics roughly every 30 seconds; the server stores them per component with a 24-hour retention window. The volume is fleet × components-per-collector × samples-per-day (~2,880 samples/day per component at the 30s interval). This is the bulk of the database write load and working set — see MongoDB sizing below.
  • Control connections — one management channel per managed collector. CPU and memory scale gently with fleet size and with how often config changes are pushed.
  • The Git-backed config tree grows with the number of sources, destinations, routes, and collectors — it drives the size of the server’s data volume, not its CPU.
  • Operators — concurrent UI/API users. Minor unless you have a large team hammering the API.

Baseline and indicative sizing

A single server instance is lightweight. The published chart defaults are a sensible floor for a small fleet:

CPU requestMemory requestMemory limitData volume
Baseline (small fleet)20m192Mi640Mi1Gi

Scale up from there with fleet size. The numbers below are indicative starting points, not measured guarantees — the right move is to start near them, watch the server’s own self-telemetry, and adjust:

FleetCPU requestMemoryServer data volumeNotes
≤ 50 collectors20–50m256Mi1Gibaseline is plenty
~250100–250m512Mi–1Gi2Gisingle instance comfortable
~1,000250m–1 core1–2Gia few GBthe throughput store starts to dominate
~5,0001–2 cores2–4Giseveral GBplan capacity from your real component count

Deployment shapes & scaling out the server

Single instance (BoltDB) — the default

One server process with an embedded BoltDB file on a local volume (default 1Gi) and a local Git working tree. Zero external dependencies. This is a permanent, fully-supported mode — ideal for a single host, a trial, or any deployment that doesn’t need the control plane to survive a node failure.

You scale a single instance vertically — give it more CPU and memory as the fleet grows, per the table above.

High availability (MongoDB) — preview

HA replaces “one big instance” with several identical stateless instances sharing one MongoDB replica set and one Git remote. You scale the control plane horizontally:

  • Start at three instances. Three is the usual minimum — it tolerates losing one while keeping a quorum of the MongoDB replica set reachable.
  • Autoscale on CPU. A reference target is 3→6 instances at ~70% CPU, with a pod-disruption budget keeping a minimum available during rollouts.
  • Size each instance per the single-instance table for your fleet — HA is N × that instance for availability, not a way to make each instance smaller.

Telemetry delivery is unaffected by server count: collectors keep exporting to your backends even if every server instance is down. HA shortens the window where you can’t change config or see fleet status — it doesn’t protect data flow, because data flow never touches the server.

External MongoDB sizing

MongoDB is required for HA and is the backend for larger production deployments. The data set is modest and bounded by TTLs. Indicative volumes for a fleet of ~1,000 collectors:

CollectionApprox. volumeComment
componentthroughputs~10M docs (24h TTL)dominant write load — self-telemetry every ~30s per component
collectorevents~100k docs (30d TTL)bursty lifecycle events
auditlogs~10k docs (365d TTL, configurable)low write rate
collectors~1k docsone per collector
otlptokens~1k docsone per collector; tiny, security-critical

The componentthroughputs collection is the workhorse. Its 24-hour TTL keeps it bounded at about a day of data; sizing for 10× the fleet, plan for roughly ~100M documents there with a peak working set of a few GB.

  • Replica set required — the server uses transactions for a few multi-document writes, which MongoDB only offers on a replica set. A standalone is dev-only. Three members is the production minimum.
  • MongoDB 6.0+ recommended (older versions lack some aggregation operators the throughput store uses).
  • WiredTiger cache defaults to 50% of host RAM — about right for a dedicated Mongo host. For a co-located deploy, cap it explicitly with --wiredTigerCacheSizeGB so it doesn’t contend with the server.

Full setup is in Deploy with MongoDB.

Network & internal latency

Ports

PortPurpose
8080HTTP — web UI + REST API
50051gRPC — collector control channel
7946instance-to-instance coordination (HA only)

Persistent volumes for a single instance use ReadWriteOnce. Front the HTTP and gRPC ports with TLS — see Run behind a reverse proxy.

Where latency matters

  • Server ↔ MongoDB (latency-sensitive). In HA, every read-modify-write and every transaction is a round-trip to the replica set. Co-locate the server instances and MongoDB in one region (ideally one low-latency zone group). A cross-region database adds that latency to every write path and will dominate request times.
  • Server ↔ server (HA, low-latency LAN). Instances coordinate over the cluster port; keep them on the same low-latency network, as you would any clustered service.
  • Collector ↔ server (not latency-sensitive). A collector’s distance from the server only affects how fresh its status and throughput numbers look in the UI — not telemetry delivery, which goes straight to your backends. Collectors can be spread across regions and clouds freely; only their self-telemetry and config polls cross that link, and both tolerate latency comfortably.