System requirements & sizing
Sizing LinkMesh means sizing two different things that scale on two different axes — and the most common mistake is conflating them:
| You’re sizing… | The node | Grows with | How you scale it |
|---|---|---|---|
| The data plane | your collectors (Grafana Alloy / otelcol) | telemetry bytes/sec & events/sec | add collectors, split busy sources across hosts |
| The control plane | the LinkMesh server | fleet size (number of collectors) | one instance, sized up — or many, for high availability |
The rest of this page sizes each axis in turn, then covers MongoDB sizing and the network/latency considerations that matter once you go multi-instance.
Sizing collectors (the data plane)
This is where telemetry bytes/sec lives. A collector’s footprint is driven by the volume and shape of the data flowing through it — independent of LinkMesh entirely; it’s ordinary OpenTelemetry Collector / Grafana Alloy capacity planning.
What drives a collector’s CPU and memory:
- Throughput — events/sec is usually a better predictor than raw bytes/sec; serialization and per-record processing dominate. As a rough starting point from upstream benchmarks, budget on the order of one CPU core per ~10k–20k records/sec of logs, then measure and adjust for your processors.
- Signal type — traces and high-cardinality metrics cost more per byte than plain logs.
- Processor chain — masking, transforms, and parsing add CPU per record; a batch processor adds memory proportional to batch size.
- Export queueing — the sending queue buffers in memory when a destination is slow; size memory for the configured queue, not just steady state.
Scaling collectors out
Collectors are independent of one another — there’s no coordination to add a host, which makes the data plane scale horizontally and linearly:
- One collector per host (the DaemonSet pattern) for host and workload telemetry — each handles only its own node’s volume, so the fleet scales with your infrastructure automatically.
- Dedicated aggregation collectors for gateway-style ingestion (syslog, OTLP fan-in). When one aggregator saturates a host, run more of them behind a load balancer and split busy sources across them rather than growing a single box without limit.
- Vertical first, then horizontal — give a collector more cores until per-host limits or blast-radius concerns push you to add instances.
Because each collector holds its own last-applied config and exports directly to your backends, adding collectors adds data-plane capacity without adding load to the LinkMesh server beyond one more entry in the fleet.
Sizing the LinkMesh server (the control plane)
The server’s job is fleet management: serving the UI/API, delivering config to collectors, and ingesting their self-telemetry. Its load scales with the number of collectors, the rate of config changes, and the number of operators using it — not with your telemetry volume.
What drives server load
- Self-telemetry storage is the dominant driver. Each collector
pushes its internal metrics roughly every 30 seconds; the server
stores them per component with a 24-hour retention window. The
volume is
fleet × components-per-collector × samples-per-day(~2,880 samples/day per component at the 30s interval). This is the bulk of the database write load and working set — see MongoDB sizing below. - Control connections — one management channel per managed collector. CPU and memory scale gently with fleet size and with how often config changes are pushed.
- The Git-backed config tree grows with the number of sources, destinations, routes, and collectors — it drives the size of the server’s data volume, not its CPU.
- Operators — concurrent UI/API users. Minor unless you have a large team hammering the API.
Baseline and indicative sizing
A single server instance is lightweight. The published chart defaults are a sensible floor for a small fleet:
| CPU request | Memory request | Memory limit | Data volume | |
|---|---|---|---|---|
| Baseline (small fleet) | 20m | 192Mi | 640Mi | 1Gi |
Scale up from there with fleet size. The numbers below are indicative starting points, not measured guarantees — the right move is to start near them, watch the server’s own self-telemetry, and adjust:
| Fleet | CPU request | Memory | Server data volume | Notes |
|---|---|---|---|---|
| ≤ 50 collectors | 20–50m | 256Mi | 1Gi | baseline is plenty |
| ~250 | 100–250m | 512Mi–1Gi | 2Gi | single instance comfortable |
| ~1,000 | 250m–1 core | 1–2Gi | a few GB | the throughput store starts to dominate |
| ~5,000 | 1–2 cores | 2–4Gi | several GB | plan capacity from your real component count |
Deployment shapes & scaling out the server
Single instance (BoltDB) — the default
One server process with an embedded BoltDB file on a local volume (default 1Gi) and a local Git working tree. Zero external dependencies. This is a permanent, fully-supported mode — ideal for a single host, a trial, or any deployment that doesn’t need the control plane to survive a node failure.
You scale a single instance vertically — give it more CPU and memory as the fleet grows, per the table above.
High availability (MongoDB) — preview
HA replaces “one big instance” with several identical stateless instances sharing one MongoDB replica set and one Git remote. You scale the control plane horizontally:
- Start at three instances. Three is the usual minimum — it tolerates losing one while keeping a quorum of the MongoDB replica set reachable.
- Autoscale on CPU. A reference target is 3→6 instances at ~70% CPU, with a pod-disruption budget keeping a minimum available during rollouts.
- Size each instance per the single-instance table for your fleet —
HA is
N ×that instance for availability, not a way to make each instance smaller.
Telemetry delivery is unaffected by server count: collectors keep exporting to your backends even if every server instance is down. HA shortens the window where you can’t change config or see fleet status — it doesn’t protect data flow, because data flow never touches the server.
External MongoDB sizing
MongoDB is required for HA and is the backend for larger production deployments. The data set is modest and bounded by TTLs. Indicative volumes for a fleet of ~1,000 collectors:
| Collection | Approx. volume | Comment |
|---|---|---|
componentthroughputs | ~10M docs (24h TTL) | dominant write load — self-telemetry every ~30s per component |
collectorevents | ~100k docs (30d TTL) | bursty lifecycle events |
auditlogs | ~10k docs (365d TTL, configurable) | low write rate |
collectors | ~1k docs | one per collector |
otlptokens | ~1k docs | one per collector; tiny, security-critical |
The componentthroughputs collection is the workhorse. Its 24-hour TTL
keeps it bounded at about a day of data; sizing for 10× the fleet,
plan for roughly ~100M documents there with a peak working set of a
few GB.
- Replica set required — the server uses transactions for a few multi-document writes, which MongoDB only offers on a replica set. A standalone is dev-only. Three members is the production minimum.
- MongoDB 6.0+ recommended (older versions lack some aggregation operators the throughput store uses).
- WiredTiger cache defaults to 50% of host RAM — about right for a
dedicated Mongo host. For a co-located deploy, cap it explicitly with
--wiredTigerCacheSizeGBso it doesn’t contend with the server.
Full setup is in Deploy with MongoDB.
Network & internal latency
Ports
| Port | Purpose |
|---|---|
8080 | HTTP — web UI + REST API |
50051 | gRPC — collector control channel |
7946 | instance-to-instance coordination (HA only) |
Persistent volumes for a single instance use ReadWriteOnce. Front the
HTTP and gRPC ports with TLS — see
Run behind a reverse proxy.
Where latency matters
- Server ↔ MongoDB (latency-sensitive). In HA, every read-modify-write and every transaction is a round-trip to the replica set. Co-locate the server instances and MongoDB in one region (ideally one low-latency zone group). A cross-region database adds that latency to every write path and will dominate request times.
- Server ↔ server (HA, low-latency LAN). Instances coordinate over the cluster port; keep them on the same low-latency network, as you would any clustered service.
- Collector ↔ server (not latency-sensitive). A collector’s distance from the server only affects how fresh its status and throughput numbers look in the UI — not telemetry delivery, which goes straight to your backends. Collectors can be spread across regions and clouds freely; only their self-telemetry and config polls cross that link, and both tolerate latency comfortably.
Related
- Storage backends — BoltDB vs MongoDB, and why the choice is about high availability, not fleet size
- Deploy with MongoDB — replica-set setup, backups, and configuration
- Run a highly-available deployment — the multi-instance control-plane topology
- Self-telemetry — what each collector reports back, and the 30s / 24h cadence behind the sizing numbers
- Configuration reference — every
storage.*anddatabase.*field