QMX Architecture Spec | QuantumindSSI

§1 · THE PROBLEM

MoE is the dominant architecture. The interconnect is a closed monopoly.

Every frontier LLM shipped in 2025–2026 uses a Mixture of Experts architecture — DeepSeek V3/R1 (256 experts), Llama 4, Grok, Mistral Large. The reason is efficiency: only a fraction of model weights activate per token, so you get the capability of a 400B model at the compute cost of a 40B model.

The problem is that expert routing creates massive communication overhead. For DeepSeek V3 with 256 experts across 64 GPUs, expert weight loading consumes up to 98.9% of decode time. NVIDIA built NVLink 6 specifically to solve this — a proprietary, closed interconnect that only works within NVIDIA hardware.

When you want to run MoE on heterogeneous hardware — mixing Apple Silicon, AMD, FPGA modules, and purpose-built ASIC inference chips — there is no open standard for the interconnect or the routing layer. You are forced to either choose a single vendor's full stack or build bespoke glue code every time.

What exists today and why it falls short

Academic work (Mozart, A3D-MoE, NASiC) proposes chiplet architectures for MoE at wafer scale — research-stage, not field-deployable. MoE Sovereign (April 2026) is a software-only Docker-based multi-model router — useful, but purely a software abstraction layer with no hardware specification, no physical module standard, and explicitly cloud-flexible rather than sovereign-first. taitashaw/moe-router-engine is an open-source FPGA MoE token dispatcher — solves intra-machine routing overhead but targets homogeneous GPU clusters only.

Nobody has defined an open physical module specification that lets any ASIC inference chip — from any fab — plug into a common interconnect fabric, register its capabilities, and receive routed inference work from a sovereign on-premise orchestrator.

The gap QMX fills

QMX is the missing layer. It specifies:

A physical module interface — the connector, power envelope, and signal standard that any ASIC or FPGA inference module must implement to join a QMX fabric
A capability manifest protocol — how a module advertises its domain strengths, throughput, quantisation support, and current load to the fabric controller
A routing protocol — how inference requests are dispatched to the right expert module based on semantic domain, latency budget, and queue depth
A fabric controller spec — the software interface the Victron platform exposes to manage the full module registry, health monitoring, and upgrade lifecycle

QMX is designed to be silicon-agnostic. A Mac Mini's Neural Engine, an AMD Radeon inference module, an FPGA acceleration card, and a purpose-built ASIC chip can all coexist on the same QMX fabric — contributing their respective strengths to a single coherent intelligence stack.

The target deployment is not a hyperscaler data centre. It is a desk, a server room, or a small rack — owned by a person, a clinic, a law firm, or a government department.

§2 · ARCHITECTURE OVERVIEW

┌──────────────────────────────────────────────────────────────────────────────────┐ │ QMX FABRIC — QUANTUMIND MODULAR EXPERT INTERCONNECT │ ├──────────────────────────────────────────────────────────────────────────────────┤ │ │ │ USER / APPLICATION LAYER │ │ Chat · API · Agentic Workflows · Document Intelligence · RAG pipelines │ │ │ │ │ ▼ │ │ VICTRON FABRIC CONTROLLER (runs on host: Mac Mini / AMD Node / NVIDIA) │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ QMX ROUTER MODULE REGISTRY HEALTH MONITOR │ │ │ │ Semantic dispatch Capability manifest Queue depth · Latency │ │ │ │ Domain tagging Domain · TOPS · VRAM Thermal · Uptime │ │ │ │ Load balancing Quantisation support Upgrade lifecycle │ │ │ └──────────────┬──────────────────┬──────────────────────┬───────────────┘ │ │ │ │ │ │ │ QMX INTERCONNECT FABRIC (CXL 3.x / PCIe 5 / 400GbE for distributed) │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ EXPERT MODULE A EXPERT MODULE B EXPERT MODULE C │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ FPGA ACCEL │ │ ASIC LLM │ │ Apple NE / │ │ │ │ Attention + │ │ Inference │ │ AMD Radeon │ │ │ │ Tokeniser │ │ Chip │ │ General │ │ │ │ │ │ │ │ Purpose │ │ │ │ MANIFEST: │ │ MANIFEST: │ │ MANIFEST: │ │ │ │ CODE, MATH │ │ REASON,NL │ │ GENERAL, │ │ │ │ STRUCTURED │ │ CREATIVE │ │ MULTIMODAL │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ All modules implement: QMX Module Interface v0.1 │ │ Physical: QMX-M connector | Protocol: QMX-P | Manifest: QMX-CAP schema │ └──────────────────────────────────────────────────────────────────────────────────┘

Four-layer model

QMX separates concerns across four layers, each independently specifiable and upgradeable:

L1 — Physical (QMX-M): The module connector standard. Defines pin layout, power rails (12V/5V/3.3V), PCIe 5.0 x4 signal lanes, and a dedicated QMX management bus (I²C variant) for out-of-band capability advertisement and health telemetry.
L2 — Interconnect (QMX-F): The fabric standard. For local multi-module setups: CXL 3.x over the PCIe 5.0 lanes with memory pooling enabled. For rack-scale or distributed sovereign deployments: 400GbE with RDMA (RoCE v2) for sub-5µs inter-node latency.
L3 — Capability Protocol (QMX-CAP): The manifest schema. Every module broadcasts a JSON capability document at boot and on state change, declaring: domain tags, active model identifiers, quantisation level, current TOPS, available VRAM, queue depth, time-to-first-token p50/p95, and thermal state.
L4 — Routing (QMX-R): The dispatch protocol. The Victron fabric controller reads all registered QMX-CAP manifests and maintains a live routing table. Incoming inference requests are tagged by the router with a semantic domain score, then dispatched to the module best matched by domain, latency budget, and current load.

§3 · INTERCONNECT — CXL, PCIe, AND ETHERNET

PRIMARY · LOCAL FABRIC

CXL 3.x

Compute Express Link

The primary interconnect for local multi-module QMX deployments. CXL 3.x provides cache-coherent memory pooling across all attached modules — meaning expert modules can share a unified memory fabric rather than copying tensors between devices. Latency: ~100ns. Bandwidth: up to 256 GB/s bidirectional over PCIe 5.0 x16. CXL is an open industry standard (not NVIDIA-proprietary), supported by Intel, AMD, Arm, and all major ASIC vendors. First CXL 3.x fabric switch silicon (Panmnesia PCIe 6.0/CXL 3.2) became available in late 2025.

FALLBACK · ENTRY TIER

PCIe 5.0

PCI Express 5.0

For early-phase QMX deployments where CXL fabric switches are unavailable. QMX-M modules connect via PCIe 5.0 x4 (32 GB/s per module). The fabric controller manages memory transfers explicitly. Higher latency than CXL (~1–3µs) but sufficient for batched inference workloads where latency tolerance is ≥5ms.

DISTRIBUTED · RACK SCALE

400GbE + RoCE

Ethernet RDMA

For sovereign deployments spanning multiple physical nodes — different rooms, server racks, or buildings. RDMA over Converged Ethernet v2 (RoCE v2) brings inter-node latency to 1–5µs with kernel bypass. Each node runs its own local QMX fabric; the QMX-R router handles cross-node dispatch as a federated fabric. No NVSwitch, no InfiniBand licence, no vendor lock-in.

FUTURE · UALink

UALink 1.0

Ultra Accelerator Link

UALink 1.0 (targeting 1,024-device scaling from AMD, Intel, Astera Labs — hardware arriving late 2026) is on the QMX roadmap as the high-scale fabric option. QMX-F is designed to be transport-agnostic at the specification level; UALink support will be a QMX-F v1.1 extension.

§4 · THE ROUTING LAYER — QMX-R

Semantic routing, not load balancing

The QMX router is not a simple round-robin or least-loaded dispatcher. It performs semantic domain scoring on every incoming inference request and matches it to the module whose capability manifest best fits the task type.

Domain tags in the QMX-CAP schema are drawn from a controlled vocabulary: GENERAL, CODE, MATH, REASONING, MEDICAL, LEGAL, STRUCTURED_DATA, CREATIVE, MULTILINGUAL, VISION, AUDIO. A module may declare multiple tags with confidence weights.

Routing decision pipeline

Each request passes through four stages before dispatch:

Domain classification: a lightweight local classifier (≤100M params, always-hot) tags the request with domain scores in <10ms
Candidate selection: modules with matching domain tags are ranked by availability score (queue depth × latency p50 × thermal headroom)
Latency budget check: if the request carries a latency SLA, only modules capable of meeting it are considered
Dispatch + streaming: tokens stream back through the fabric controller to the caller; if the winning module saturates mid-stream, the router can hand off continuation to a secondary module

Multi-expert synthesis

For complex queries that span domains (e.g. a medical coding question requiring both MEDICAL and STRUCTURED_DATA expertise), QMX-R supports parallel expert dispatch with result synthesis. The fabric controller sends sub-queries to two modules simultaneously and a lightweight synthesis model (also resident on the host) merges the outputs. This is the true MoE pattern — not sequential fallback, but genuine parallel expert consultation.

Capability manifest — QMX-CAP schema

Every QMX-compliant module broadcasts this schema at registration and on any state change:

{
  "qmx_version": "0.1",
  "module_id": "qmx-m-uuid-v4",
  "module_type": "ASIC_INFERENCE | FPGA | GPU | NPU",
  "vendor": "string",
  "model_id": "llama3-70b-q4_k_m",

  "capabilities": {
    "domain_tags": [
      { "tag": "REASONING", "weight": 0.92 },
      { "tag": "CODE",      "weight": 0.78 },
      { "tag": "MATH",      "weight": 0.85 }
    ],
    "context_length": 131072,
    "quantisation": "Q4_K_M",
    "peak_tops": 38.4,
    "vram_gb": 24,
    "multimodal": false
  },

  "telemetry": {
    "queue_depth": 2,
    "ttft_p50_ms": 84,
    "ttft_p95_ms": 210,
    "thermal_headroom_pct": 68,
    "uptime_s": 86401
  },

  "security": {
    "weights_encrypted_at_rest": true,
    "egress_locked": true,
    "pq_crypto": true
  }
}

The manifest is published over the QMX management bus (out-of-band I²C) at boot, and pushed via a lightweight UDP multicast to the fabric controller on any field change. Pull endpoint also available at http://[module-ip]:7070/qmx/cap.

§5 · SECURITY MODEL

WEIGHTS

Encrypted at Rest

All model weights stored on a QMX module must be encrypted using AES-256-GCM at minimum. The decryption key is held in the module's secure enclave and never exposed on the fabric. Post-quantum key encapsulation (ML-KEM-768 / Kyber) is required for QMX v1.0 certification.

FABRIC TRAFFIC

Encrypted in Transit

All QMX-F fabric traffic between the fabric controller and modules is encrypted. For CXL/PCIe: AEAD encryption at the QMX-P protocol layer. For Ethernet: mandatory TLS 1.3 with post-quantum hybrid key exchange (X25519 + ML-KEM-768).

MODULE IDENTITY

Hardware Attestation

Each QMX module must present a hardware-rooted identity certificate (stored in a dedicated secure element, TPM 2.0 compatible) at registration. The fabric controller verifies the certificate chain before accepting any module into the routing table. Prevents rogue module injection.

EGRESS

Zero External Calls

QMX-compliant modules must not initiate any network connection outside the local QMX fabric. The fabric controller maintains an egress policy enforced at the OS level. No telemetry, no update pings, no vendor callbacks — unless the operator explicitly provisions an outbound channel.

AUDIT

Immutable Inference Log

All routing decisions, module assignments, and inference completions are logged to an append-only audit store on the fabric controller. Log integrity is protected by a hash chain (SHA-3). Required for regulated sector deployments (GDPR, NHS DSP Toolkit, defence classification).

AIR GAP

Full Air-Gap Capable

The entire QMX fabric — fabric controller, all modules, routing layer — is designed to operate with zero external network connectivity. Updates are applied via signed offline packages. This is a hard design requirement, not an optional mode.

§6 · COMPETITIVE LANDSCAPE · WHY QMX IS NOVEL

As of June 2026, no publicly available project or product occupies the same position as QMX. The table below maps the closest existing work against the QMX specification dimensions. Research conducted June 2026.

Project / Product	Open spec	Physical module standard	Heterogeneous ASIC	Sovereign / on-premise first	MoE routing	Personal / SME scale
QMX (QSSI)	✓ Apache 2.0	✓ QMX-M connector	✓ By design	✓ Core requirement	✓ QMX-R semantic router	✓ Mac Mini upward
MoE Sovereign (Apr 2026)	✓ Apache 2.0	✗ Software only	~ GPU clusters	~ Optional, cloud-flexible	✓ Software router	~ Docker required
moe-router-engine (FPGA)	✓ Open source	~ FPGA only	✗ Homogeneous GPU	~ Not specified	✓ Hardware token dispatch	✗ Data centre focus
Mozart / A3D-MoE (academic)	~ Research paper	~ Custom chiplet	✓ Chiplet heterogeneous	✗ Cloud / HPC	✓ Hardware MoE	✗ Wafer-scale only
SambaNova RDU	✗ Proprietary	✗ Proprietary	✗ SambaNova only	✗ Cloud SaaS	✓ MoE inference	✗ Enterprise only
NVLink 6 / NVSwitch	✗ NVIDIA proprietary	✗ NVIDIA only	✗ NVIDIA GPUs only	~ On-premise available	✓ High-bandwidth	✗ $100k+ entry
CXL 3.x (standard)	✓ Open standard	✓ Physical spec	✓ Heterogeneous	✓ On-premise	✗ Interconnect only, no router	~ Server-class hardware

The conclusion is clear: CXL provides the physical interconnect foundation, but no project combines it with an open module specification, a semantic routing protocol, and a sovereign-first design principle targeting personal and SME-scale deployments. QMX is the first specification to occupy this space.

§7 · DEPLOYMENT CONFIGURATIONS

CONFIG A · PERSONAL

Single-node QMX

One Mac Mini M4 Max or AMD workstation acts as both fabric controller and primary expert module. A single FPGA acceleration card connects via QMX-M (PCIe 5.0 x4 initially, CXL 3.x when available). Two experts on one physical machine. Suitable for individual professionals and students.

CONFIG B · TEAM

Multi-module local fabric

One AMD EPYC node as fabric controller. Two to four QMX-M modules on a CXL 3.x fabric switch — e.g. one FPGA module (code/structured data), one general-purpose ASIC module (reasoning/NL), one domain-specific fine-tune (medical/legal). 8–24 concurrent users. The target configuration for the Local AI Instance business tier.

CONFIG C · ENTERPRISE

Distributed federated fabric

Multiple physical nodes (NVIDIA DGX or custom builds), each running a local QMX CXL fabric, interconnected over 400GbE RoCE. The fabric controllers form a federated QMX-R cluster. Suitable for government departments, hospitals, and financial institutions requiring department-level expert segregation and national-scale inference throughput.

§8 · SPECIFICATION ROADMAP

v0.1

ACTIVE · NOW

Software-first validation

Implement QMX-CAP manifest schema and QMX-R semantic router entirely in software on existing Mac Mini / AMD / NVIDIA hardware. Different locally-running models (LLaMA 3, Mistral, Qwen) act as software experts. Validate routing logic, latency budgets, domain classification accuracy, and synthesis quality before any custom silicon.

v0.2

NEXT · Q3 2026

PCIe 5.0 module interface — QMX-M draft

Define the physical QMX-M connector and power spec. Prototype first QMX-M module as a reference design using an existing FPGA development board (Xilinx Alveo U55C or similar). Publish the QMX-M draft specification as an open document under Apache 2.0.

v0.3

Q4 2026

CXL 3.x fabric — QMX-F

Integrate CXL 3.x fabric switch (Panmnesia or equivalent) as the primary QMX-F interconnect. Implement memory pooling across fabric controller and FPGA module. Benchmark latency and bandwidth against software-baseline. Publish QMX-F specification draft.

v1.0

2027

Full specification release + ASIC qualification programme

Publish complete QMX v1.0 specification. Launch QMX Certification Programme — a formal process by which ASIC and FPGA vendors can qualify their modules as QMX-compliant. First ASIC inference chip certifications. Integration into the Victron AIO-32-MAX as the fabric controller reference platform.

v1.1

2027–28

UALink 1.0 fabric extension + direct-to-silicon ASIC modules

Add UALink 1.0 as a QMX-F transport option for large-scale sovereign deployments. Qualify first purpose-built ASIC inference chips (direct-to-silicon LLM inference modules) under the QMX-M physical spec. Enable petaflop-class sovereign MoE deployments on commodity rackspace.

QMX — Quantumind
Modular eXpert Interconnect

MoE is the dominant architecture. The interconnect is a closed monopoly.

What exists today and why it falls short

The gap QMX fills

Four-layer model

Compute Express Link

PCI Express 5.0

Ethernet RDMA

Ultra Accelerator Link

Semantic routing, not load balancing

Routing decision pipeline

Multi-expert synthesis

Capability manifest — QMX-CAP schema

Encrypted at Rest

Encrypted in Transit

Hardware Attestation

Zero External Calls

Immutable Inference Log

Full Air-Gap Capable

Single-node QMX

Multi-module local fabric

Distributed federated fabric

Software-first validation

PCIe 5.0 module interface — QMX-M draft

CXL 3.x fabric — QMX-F

Full specification release + ASIC qualification programme

UALink 1.0 fabric extension + direct-to-silicon ASIC modules

This specification is open. Contribute to it.

QMX — QuantumindModular eXpert Interconnect

MoE is the dominant architecture. The interconnect is a closed monopoly.

What exists today and why it falls short

The gap QMX fills

Four-layer model

Compute Express Link

PCI Express 5.0

Ethernet RDMA

Ultra Accelerator Link

Semantic routing, not load balancing

Routing decision pipeline

Multi-expert synthesis

Capability manifest — QMX-CAP schema

Encrypted at Rest

Encrypted in Transit

Hardware Attestation

Zero External Calls

Immutable Inference Log

Full Air-Gap Capable

Single-node QMX

Multi-module local fabric

Distributed federated fabric

Software-first validation

PCIe 5.0 module interface — QMX-M draft

CXL 3.x fabric — QMX-F

Full specification release + ASIC qualification programme

UALink 1.0 fabric extension + direct-to-silicon ASIC modules

This specification is open. Contribute to it.

QMX — Quantumind
Modular eXpert Interconnect