How to Build an Endgame Local AI Agent Setup Using an 8-Node NVIDIA Cluster with 1TB Memory

Dreaming of running giant AI models like Kimi K2.5 right on your desk? It used to require a team of experts and complex coding. Now, with the right hardware and AI-assisted tools, setting up a massive 8-node cluster is easier and more accessible than ever before. Let’s build!

Building a high-performance local AI cluster is an ambitious project, but it is the ultimate way to gain control over your data and compute power. This specific configuration uses eight individual NVIDIA GB10 nodes, which provides a massive total of 160 fast ARM cores and 1TB of memory. This is not just a standard computer; it is a distributed system designed for high-speed AI inference. When you have this much RAM, you can run massive Large Language Models (LLMs) like Qwen 3.5 or Kimi K2.5 with high quantization levels, meaning the AI will be much smarter and more accurate than the smaller versions you typically find online.

The hardware side of this setup is quite diverse. We are using a mix of systems, including the ASUS Ascent GX10, the Dell Pro Max with GB10, the Lenovo PGX, and the NVIDIA DGX Spark. Each of these nodes needs a lot of power—specifically 240W delivered via USB-C. To manage this safely, you will need a professional Power Distribution Unit (PDU), such as the Ubiquiti USP-PDU-PRO, which allows you to monitor and control power for each individual node. Behind the scenes, the cabling can look like a “rats nest,” but every connection is vital for the cluster to function as a single giant brain.

Networking is the most technical part of this build. To make eight computers act as one, we use a technology called Remote Direct Memory Access (RDMA), specifically RoCE v2 (RDMA over Converged Ethernet). This allows the nodes to share data directly from their memory without putting a heavy load on the CPU. We utilize a MikroTik CRS804-4XQ-IN switch, which handles 400GbE ports. By using QSFP-DD to 2x QSFP56 breakout cables, we can provide each node with a 200GbE connection. Even though the internal PCIe Gen 5 x4 interface limits the actual throughput to about 109Gbps, this high-speed backbone is essential for reducing latency during what we call “AllReduce” operations in AI processing.

In the past, setting up a cluster like this would take weeks of manual configuration using Ansible playbooks and complex SSH commands. However, we are now entering the era of AI-assisted infrastructure. By using agents like Claude Code or OpenClaw, you can essentially give the AI the login credentials for your nodes and tell it to “set up the cluster.” These agents can handle installing NVIDIA container runtimes, configuring Docker, setting up the vLLM (Versatile LLM) engine, and even troubleshooting network mismatches. If one node has a different firmware version or a misconfigured MTU (Maximum Transmission Unit) setting, the AI agent can detect it and fix it automatically.

When it comes to performance, we focus on Tensor Parallelism (TP). If you have a massive model, you split it across multiple nodes. For example, a model might run in TP=8 mode, meaning it uses all eight nodes simultaneously. While smaller models might actually run faster on a single node due to lower networking overhead, the 8-node cluster allows you to run “Endgame” models that simply would not fit in the memory of a single machine. We also use a dedicated All-Flash Network Attached Storage (NAS) to host the model weights. This allows all nodes to pull the same data quickly during the startup phase, making the entire workflow much more efficient.

Before we finish, here is the detailed guide on how to get your cluster physically and digitally ready for action.

Unbox and Power Up: Carefully unbox your eight GB10 nodes and connect each one to your high-wattage PDU using 240W-rated USB-C cables.
Physical Networking: Plug your breakout cables into the MikroTik switch. Connect one QSFP56 end into the primary network port of each node. Use high-quality copper DAC cables for the shortest distances to save power.
Switch Configuration: Access your MikroTik switch console. Navigate to the interface settings and turn off “auto-negotiation” for the ports, manually setting them to 200GbE or 100GbE depending on your specific NIC capability.
MTU and Quality of Service: Set the MTU to 4200 or higher to support RoCE v2 traffic. Ensure ECN (Explicit Congestion Notification) and PFC (Priority Flow Control) are enabled to prevent packet loss during heavy AI workloads.
Initial Node Access: Log into each node via SSH using a management network (like the built-in 10GbE port or Wi-Fi). Update the Linux kernel and install the latest NVIDIA drivers.
Agent Orchestration: Launch an AI agent like Claude Code. Provide it with the list of IP addresses for your nodes.
Software Stack Deployment: Command the agent to install Docker and the NVIDIA Container Toolkit across all nodes. Ask it to verify that all nodes can “see” each other over the RDMA network using ib_write_bw tests.
Model Loading: Point your vLLM configuration to your NAS storage. Choose a model like Qwen-397B and set the Tensor Parallel degree to 8. Start the inference engine and wait for the “ready” status.

Building an 8-node cluster marks a significant shift in how we approach local AI development today. By utilizing the NVIDIA GB10 ecosystem alongside high-speed RDMA networking, you gain the unprecedented ability to run massive models like Kimi K2.5 with incredible fidelity. While the hardware investment is substantial, the automation provided by tools like OpenClaw removes the traditional technical barriers. I recommend starting with two nodes to understand the networking dynamics before scaling to eight. This setup ensures your local environment remains future-proof as models continue to grow. Embrace this technology to keep your data private and your workflows exceptionally efficient.

Leave a Reply Cancel reply