AI Hardware Buyer's Guide

Run AI locally — without the guesswork

For SMBs, professionals, and IT teams choosing hardware to run private AI on their own infrastructure

Trust legend: 🟢 measured (this hardware) · 🟡 reproduced (third-party benchmark, verified) · ⚪ estimated (extrapolated from architecture)
Last updated 2026-05-04 14:44 · Pricing snapshot: 2026-05-03 (CAD, Apple Canada / Canada Computers / frame.work). Verify before ordering.

Read this first — the 30-second decision

You want local AI if any of these are true:

You handle client data, employee records, financials, or anything regulated (Quebec Law 25, HIPAA, PIPEDA)
Your legal/compliance team has said "we can't use ChatGPT for that"
You want predictable monthly cost instead of a per-token bill that grows with usage
You want AI to keep working when the internet doesn't, or when the SaaS is down

You probably don't need this if:

You're a sole user with no privacy constraints and Claude Pro / ChatGPT Plus already covers your needs. The cheapest setup below is ~$5,400 CAD — that's 30+ months of a $140 cloud subscription.

What changes once you have local AI:

Clean French/English drafting, document Q&A, code review, transcript analysis, basic agent workflows — all of it private, all of it instant, all of it included once you've bought the hardware.

Sizing for a 3-4 year purchase — storage, RAM, chip

A laptop bought today should still be the right tool in 2029-2030. Three axes matter: storage (how much disk to buy), RAM (the real ceiling on what models you can run), and chip class (memory bandwidth = token generation speed). Buy too small and you'll feel pain by year 2; buy too big and you waste money on capacity you won't use. Here's how to think about it.

💾 Storage: 2 TB is the right pick for most

Realistic 4-year laptop disk budget assuming you have a home server or NAS for media + archive:

macOS + apps	~250 GB
Personal docs + mail archive	~350 GB
Active dev workspace	~150 GB
Local model library (laptop-runnable subset)	~200 GB
HF cache + temp	~60 GB
Buffer (snapshots, OS upgrades, growth)	~400 GB
Year-4 total	~1.4 TB

Verdict: 2 TB leaves ~600 GB free at year 4. Comfortable, not luxurious.

Upgrade to 4 TB if: you record/edit video locally, you run multiple Parallels/Docker VMs, you don't have a home server (laptop = whole stack), or you want zero "manage disk space" thinking. Cost: +$1,400 CAD.

🧠 RAM: 128 GB is the 4-year-safe floor

RAM is the real ceiling on what models you can run. Disk doesn't matter if the model can't fit in memory.

Year	Mainstream "good" model	Resident
2026 (now)	Gemma 4 31B 8-bit	31 GB
2027	40-50B class quantized	~40 GB
2028	60-80B class	~60 GB
2029	100B class	~70 GB

Verdict: 64 GB starts feeling tight by year 2-3 as model sizes grow + you also need RAM for OS + apps + active work. 128 GB is the only laptop config that's safe through 2029-2030.

Bigger models (200B+ dense, 1T+ MoE) won't run on any laptop regardless of RAM — those live on home servers or clusters (Setup #5, #7).

⚡ Chip: M5 Pro hits a wall sooner than M5 Max

Memory bandwidth = token generation speed. The Pro / Max gap widens as model sizes grow because bigger models = more memory to read per token.

Chip	Bandwidth	2026 30B	2029 100B
M5 Pro	307 GB/s	~30 t/s	~5 t/s
M5 Max	614 GB/s	~60 t/s	~12 t/s
M5 Ultra	~820 GB/s	~80 t/s	~18 t/s

Verdict: M5 Pro is fine for 2026 30B-class daily use. M5 Max stays usable on 100B-class models in 2029. M5 Pro at 100B class would be painful (5 t/s = ~3 minutes for a typical answer).

Estimates above assume 4-bit quantization. Lower quants (Q3, MXFP4) extend the runway by ~30%.

The 3-4 year safe pick: 128 GB RAM + M5 Max + 2 TB SSD (Setup #3 or #4) is the only laptop config that won't feel undersized by 2029. If your AI use is light and you're OK refreshing in 2-3 years, M5 Pro 64 GB (Setup #1 or #2) is fine. Don't pay for 4 TB unless you fit one of the upgrade-if cases above — that money is better spent on the home server (Setup #5).

The shopping list — 7 recommended setups

Each setup is a complete order: machine + accessories + tax + total. Pick the one that matches your situation, hand the list to your IT person, or order it yourself. Prices are CAD, snapshot 2026-05-03, including QC sales tax (TPS+TVQ 14.975%).

Setup #1

14" MacBook Pro M5 Pro · 64 GB · 2 TB

Total turnkey (CAD, incl. QC tax)

$5,344

Pick this if: Solo mobile professional. Lawyer, accountant, agent, consultant. Light-to-medium AI use a few times a day. You travel.

AI quality vs cloud

85%

Typical response

~3-8 sec

Concurrent users

1 user

Privacy tier

all-tier (offline OK)

Shopping list

14" MacBook Pro M5 Pro 64GB 2TB — Apple Canada	$4,099
AppleCare+ for Mac (3 years) — Apple	$549
LM Studio (free) — lmstudio.ai	$0
Cherry Studio chat UI (free) — cherry-ai.com	$0
Subtotal	$4,648
QC sales tax (14.975%)	+$696
TOTAL	$5,344

Lead time: In stock at Apple Canada

Setup time: Half-day (LM Studio + 2 models + chat UI)

Honest trade-offs

307 GB/s memory bandwidth — about 2× the M5 base, but half of M5 Max. Token generation scales nearly linearly with this number (see callout below).
64 GB RAM caps you at 30B-class models comfortably. Llama 70B 4-bit (~40 GB) won't fit alongside macOS + apps.
Battery 16+ hours on non-AI work, ~6-8 hours when running an LLM continuously.

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #1".

Setup #2

16" MacBook Pro M5 Pro · 64 GB · 2 TB

Total turnkey (CAD, incl. QC tax)

$6,091

Pick this if: Solo desk-first professional. Want a bigger screen built-in, occasional travel. Same AI capability as Setup #1.

AI quality vs cloud

85%

Typical response

~3-8 sec

Concurrent users

1 user

Privacy tier

all-tier (offline OK)

Shopping list

16" MacBook Pro M5 Pro 64GB 2TB — Apple Canada	$4,699
AppleCare+ for Mac (3 years) — Apple	$599
LM Studio + Cherry Studio (free) — lmstudio.ai	$0
Subtotal	$5,298
QC sales tax (14.975%)	+$793
TOTAL	$6,091

Lead time: In stock at Apple Canada

Setup time: Half-day

Honest trade-offs

Larger 16.2" screen + better sustained thermals than 14" — but 700g heavier (2.16 kg vs 1.55 kg).
Same M5 Pro chip as Setup #1 — quality and AI speed identical. You're paying for screen + thermal headroom.
Better cooling matters less here than in the Max chassis (Pro chip stays within thermal budget anyway).

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #2".

Setup #3

14" MacBook Pro M5 Max · 128 GB · 2 TB

Total turnkey (CAD, incl. QC tax)

$7,471

Pick this if: Solo road-warrior who needs Max performance in a smaller body. You travel a lot, you want to run every model on this page, and you accept the thermal compromise of the 14" chassis.

AI quality vs cloud

91%

Typical response

~2-5 sec (bursts)

Concurrent users

1 user

Privacy tier

all-tier (offline OK)

Shopping list

14" MacBook Pro M5 Max 128GB 2TB — Apple Canada	$5,899
AppleCare+ for Mac (3 years) — Apple	$599
LM Studio + Cherry Studio (free) — lmstudio.ai	$0
Subtotal	$6,498
QC sales tax (14.975%)	+$973
TOTAL	$7,471

Lead time: Build-to-order, 1-2 weeks

Setup time: Half-day

Honest trade-offs

614 GB/s bandwidth + 128 GB unified RAM — runs every model on this page including Llama 3.3 70B (~40 GB) and Qwen3.6 27B reasoning (~35 GB) with room to spare.
14" Max throttles 15-25% under sustained load (Notebookcheck, Digital Trends measurements). Spike to 96 W then drops to 42 W. Fine for bursts, painful for all-day inference.
SSD reaches 100°C under sustained AI workloads on this chassis. Apple's cooling can't keep up with the Max chip in 14" form factor.
Same chip as Setup #4 (16" Max 128 GB) but 700g lighter and more portable. If you grind on AI for hours daily, go 16". If your AI use is bursty + you live on the road, this wins.

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #3".

Setup #4

16" MacBook Pro M5 Max · 128 GB · 2 TB

Total turnkey (CAD, incl. QC tax)

$7,988

Pick this if: Solo cockpit. Heavy AI all day. You want every model on this page to run, no thermal compromise, no swap, no waiting.

AI quality vs cloud

91%

Typical response

~2-5 sec

Concurrent users

1 user (or 2-3 light)

Privacy tier

all-tier (offline OK)

Shopping list

16" MacBook Pro M5 Max 128GB 2TB — Apple Canada	$6,299
AppleCare+ for Mac (3 years) — Apple	$649
LM Studio + Cherry Studio (free) — lmstudio.ai	$0
Subtotal	$6,948
QC sales tax (14.975%)	+$1,040
TOTAL	$7,988

Lead time: Build-to-order, 2-3 weeks

Setup time: 1 day (LM Studio + 4-5 models + Cherry agents + Tailscale)

Honest trade-offs

This is the setup the measurements on this page were taken on (Setup #4 was 14" 128/4TB but the 16" body is the same chip with proper thermals — performance numbers transfer identically or improve).
16" body sustains 62 W indefinitely vs 14"'s 42 W after throttle. 15-25% faster on long workloads.
Heavier (2.16 kg) and bigger footprint. If you live on planes and coffee shops, Setup #3 is more portable.
128 GB unified memory means zero swap even with multiple models loaded + active dev work.

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #4".

Setup #5

Power user combo: 14" Pro 64 GB laptop + Mac Studio M5 Ultra at home

Total turnkey (CAD, incl. QC tax)

$15,406

Pick this if: Heavy AI user with stable home internet (>99% uptime). You want Studio-grade quality at desk + laptop autonomy on the road. Best long-term value if you can wait for Studio M5 Ultra (rumored June or Oct 2026).

AI quality vs cloud

91% laptop · 95%+ on Studio (70B-class models)

Typical response

~3-8 sec laptop · ~1-3 sec on Studio

Concurrent users

1-3 users on Studio

Privacy tier

all-tier (Tailscale-private remote access)

Shopping list

14" MacBook Pro M5 Pro 64GB 2TB — Apple Canada	$4,099
Mac Studio M5 Ultra 256GB (when released, est.) — Apple Canada	$7,500
10 GbE switch (TP-Link / MikroTik) — Amazon.ca	$400
APC UPS 1500VA — Amazon.ca	$300
Tailscale (free, ≤3 users) — tailscale.com	$0
AppleCare+ on Studio + laptop — Apple	$1,100
Subtotal	$13,399
QC sales tax (14.975%)	+$2,007
TOTAL	$15,406

Lead time: Studio M5 Ultra: not yet released (est. June or Oct 2026). M3 Ultra Studio available now as interim.

Setup time: 1-2 days (Studio + Tailscale mesh + LM Studio on both + remote access)

Honest trade-offs

Architecture: laptop = thin client when home (queries Studio over Tailscale, sub-ms LAN latency), full local stack when offline. Best of both worlds.
Studio runs 70B-class models at full bandwidth (820 GB/s on M5 Ultra). Llama 3.3 70B, Qwen3.6 35B, Gemma 4 31B all fast enough for real-time chat.
Laptop sized for autonomy, not for being primary AI rig — 64 GB Pro is plenty for the rare offline use case (Gemma 4 26B 8-bit fits comfortably).
M5 Ultra Studio is rumored, not released. If you need this today, substitute M3 Ultra Mac Studio (~$5,500-9,000 CAD depending on RAM) and upgrade later.

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #5".

Setup #6

Linux GPU workstation — 4× Intel Arc Pro B70 + Threadripper

Total turnkey (CAD, incl. QC tax)

$14,325

Pick this if: Maximum tokens-per-dollar. You're comfortable with Linux + vLLM serving stack. You don't need Mac apps. Best raw inference speed in this guide.

AI quality vs cloud

89%

Typical response

~1-2 sec (vLLM, BF16 concurrency)

Concurrent users

3-5 users

Privacy tier

all-tier (self-hosted)

Shopping list

4× Intel Arc Pro B70 32 GB GDDR6 @ $1,300 CAD ea (128 GB total VRAM) — Canada Computers	$5,200
AMD Ryzen Threadripper 7960X (24-core / 48-thread, sTR5) — verified $2,199 CAD, "Available to Order" — Canada Computers	$2,199
GIGABYTE TRX50 AERO D motherboard (sTR5, EATX, 4× M.2, dual 10 GbE) — verified $889.99 CAD, "Available to Order" — Canada Computers	$890
128 GB DDR5-5600 UDIMM non-ECC (Kingston Fury Beast 2× 64 GB) — verified $2,619.99 CAD, in stock. ECC RDIMM is sold out nationally + 2-3× more expensive — see notes. — Canada Computers	$2,620
Noctua NH-U14S TR5-SP6 CPU air cooler (Threadripper sTR5) — Amazon.ca	$150
4 TB NVMe Gen5 SSD (Samsung 990 EVO Plus or WD Black SN850X) — Amazon.ca / Newegg.ca	$600
1500-1600 W Platinum PSU (Corsair AX1500i or Seasonic Prime TX-1600) — Newegg.ca	$500
Full-tower chassis with 4-GPU support (Fractal Design Define 7 XL or Phanteks Enthoo Pro 2) — Amazon.ca / Newegg.ca	$300
Ubuntu 24.04 + vLLM (free) — ubuntu.com	$0
Subtotal	$12,459
QC sales tax (14.975%)	+$1,866
TOTAL	$14,325

Lead time: GPUs in stock at Canada Computers. CPU/MB/RAM 1-2 weeks. Total assembly 1-2 days.

Setup time: 1-2 days (Linux + Intel oneAPI drivers + vLLM + model serving)

Honest trade-offs

128 GB total VRAM across 4 cards (32 GB each at 600 GB/s per card vs 614 GB/s shared on Mac M5 Max).
~800 W total power draw under load (4× 250 W GPUs + 350 W CPU + overhead). Verify circuit + cooling before ordering.
Non-ECC UDIMM choice: spec uses Kingston Fury Beast 128 GB (consumer DDR5, $2,620 in stock). ECC RDIMM (Kingston Renegade Pro 128 GB) was $5,850 + sold out nationally as of 2026-05-04. ECC matters for training; for inference workloads silent bit-flips produce wrong tokens, not data corruption — non-ECC is acceptable. If your subvention/policy requires ECC, swap the line and budget +$3,000-4,000 + lead time.
Audible under load — multiple GPU fans + tower cooler. Not a coffee-shop or quiet-office machine.
Intel oneAPI / IPEX-LLM ecosystem is maturing — fewer tutorials than NVIDIA CUDA, expect more Stack Overflow time on driver/quant issues.
vLLM serving makes it shine for multi-user concurrent inference. Single-user is overkill — Setup #4 is faster for one person.
Best raw tokens-per-dollar in this guide for serving (BF16 c4: 194 t/s on Qwen3-4B per Intel benchmarks).

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #6".

Setup #7

Framework Cluster — 4× Mainboards in a 10" rack

Total turnkey (CAD, incl. QC tax)

$24,556

Pick this if: Team of 15-50 people sharing private AI. You have rack space and someone comfortable with Linux ops + DIY builds. Scale by adding more nodes.

AI quality vs cloud

89%

Typical response

~5-8 sec

Concurrent users

10-20 users

Privacy tier

all-tier (self-hosted, multi-user)

Shopping list

4× Framework Desktop Mainboard (Ryzen AI Max+ 395, 128 GB LPDDR5x soldered, Mini-ITX) @ $3,769 CAD ea — frame.work CA	$15,076
4× Framework 400W FlexATX PSU @ $209 CAD ea — frame.work CA	$836
4× Framework Desktop CPU Fan Kit (Noctua) @ $55 CAD ea — frame.work CA	$220
4× 2 TB NVMe Gen4 SSD (M.2 2280, WD Black SN850X or equivalent) @ ~$250 CAD ea — Amazon.ca	$1,000
MikroTik CRS812-DDQ switch (2× 400G QSFP56-DD + 2× 200G QSFP56 + 8× 50G SFP56 + 2× 10G) — MikroTik Canada Store	$1,815
MikroTik DQ+BC0003-DS+ breakout cable (1× 200G QSFP56 → 4× 50G, 3m) — feeds all 4 nodes from one switch port — MikroTik Canada Store	$111
4× 50 GbE NICs (PCIe x4) — each node uplink — Amazon.ca / Newegg.ca	$1,200
UPS 2200VA rack-mount (covers ~600 W cluster + switch) — Amazon.ca	$1,100
Ubuntu 24.04 + LibreChat + LiteLLM router (free) — librechat.ai	$0
Subtotal	$21,358
QC sales tax (14.975%)	+$3,198
TOTAL	$24,556

Lead time: Framework mainboards + PSUs: 4-8 weeks build-to-order. Switch + optics: 1-2 weeks distributor. Chassis: in stock at MyElectronics or similar.

Setup time: 5-7 days (mechanical assembly per node + switch config + cluster networking + LibreChat + per-user accounts)

Honest trade-offs

This is a DIY build. Mainboards arrive bare — you supply chassis, mount the PSU + cooler, install SSD, cable it. Plan for one full day per node.
512 GB total memory across 4 nodes at ~256 GB/s per node. Scales by node count for concurrent users, not for single-query speed.
~20% generation drop scaling 2 → 4 nodes (Alex Ziskind benchmarks) — coordination overhead is real but manageable.
~600 W total cluster power draw at sustained inference. UPS line-item is sized for this; verify your room circuit can handle it.
Memory is soldered (LPDDR5x) — no future RAM upgrade per node. Locked-in at 128 GB.
Linux-only. No Apple ecosystem. Your IT person will own this.
Best price-per-concurrent-user in this guide. Cheaper per seat than buying everyone a MacBook.

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #7".

Setup #8

NVIDIA DGX Spark — 1 or 2 node desktop AI supercomputer

Total turnkey (CAD, incl. QC tax)

$15,361

Pick this if: You want NVIDIA-native CUDA tooling (PyTorch, vLLM, TensorRT-LLM) on a desktop. 1 node for solo use, 2-node direct-cabled for fine-tuning + serving up to 405B models.

AI quality vs cloud

90% (1 node) / 92%+ (2 nodes)

Typical response

~2-4 sec (1 node) / ~1-3 sec (2 nodes)

Concurrent users

1-3 users (1 node) / 5-10 users (2 nodes)

Privacy tier

all-tier (self-hosted)

Shopping list

1× NVIDIA DGX Spark Founders Edition (GB10 Grace Blackwell, 128 GB unified, 4 TB NVMe) @ $4,699 USD — NVIDIA Marketplace / Best Buy / Newegg	$6,580
+1× DGX Spark Founders Edition (for 2-node config — optional) @ $4,699 USD — NVIDIA Marketplace / Best Buy / Newegg	$6,580
0.5 m 200G QSFP56 DAC cable (direct node-to-node, no switch needed for 2 nodes) — Amazon.ca / NADDOD	$200
Ubuntu DGX OS (preinstalled) + NVIDIA AI software stack (free) — NVIDIA NGC	$0
Subtotal	$13,360
QC sales tax (14.975%)	+$2,001
TOTAL	$15,361

Lead time: NVIDIA Marketplace: 2-4 weeks. Best Buy: in-stock varies. DAC cable: in-stock Amazon.ca.

Setup time: Half-day (1 node) or 1 day (2 nodes incl. RoCE config + NCCL test)

Honest trade-offs

1 node = $6,580 CAD. 2 nodes (the configuration shown above) = ~$13,360 CAD before tax. Drop the second node line for solo use.
Native CUDA — every PyTorch, vLLM, TensorRT-LLM, ComfyUI, etc. just works. No oneAPI/ROCm hunting like Setup #6 (Intel Arc) requires.
2 nodes = direct cable, no switch. ConnectX-7 200 GbE port-to-port via DAC. Switch only needed at 3+ nodes (separate purchase).
Memory is unified (GB10 Grace Blackwell) — 128 GB per node, ~273 GB/s bandwidth. Per-node bandwidth lower than Mac M5 Max (614 GB/s) — Spark wins on CUDA ecosystem, Mac wins on per-node speed.
~240 W per node. Two nodes ≈ 480 W. Quiet office-friendly desktop form factor.
~17 GB/s NCCL throughput over 200 GbE (real-world, per NVIDIA forum) — fine for inference, the bottleneck for training larger-than-2-node setups.
NVIDIA pricing volatile — Founders Edition jumped 18% in 2025 due to memory shortages. Check current pricing at order time.
Best fit for: AI researchers, devs who'll exploit CUDA libraries, teams with PyTorch fine-tuning workflows.

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #8".

Setup #9

NVIDIA RTX PRO Blackwell workstation (48 / 72 / 96 GB VRAM tiers)

Total turnkey (CAD, incl. QC tax)

$15,107

Pick this if: You need single-card high-VRAM CUDA for AI inference, training, or rendering. Pick the VRAM tier that matches your largest expected model.

AI quality vs cloud

92% (48 GB) / 94% (72 GB) / 95%+ (96 GB)

Typical response

<1-2 sec (CUDA + TensorRT-LLM optimized)

Concurrent users

1 user (intensive) / 3-5 users (moderate)

Privacy tier

all-tier (self-hosted)

Shopping list

GPU choice (pick ONE): RTX PRO 5000 Blackwell 48 GB GDDR7 @ $4,200 USD ≈ $5,880 CAD — NVIDIA / Newegg / B&H	$5,880
— OR — RTX PRO 5000 Blackwell 72 GB GDDR7 @ ~$6,300 USD ≈ $8,820 CAD (price not yet listed publicly) — NVIDIA / Newegg	$0
— OR — RTX PRO 6000 Blackwell 96 GB GDDR7 ECC @ $8,565 USD ≈ $11,990 CAD — NVIDIA / PNY / Micro Center	$0
AMD Ryzen Threadripper 7960X (24-core / 48-thread, sTR5) — verified $2,199 CAD, "Available to Order" — Canada Computers	$2,199
GIGABYTE TRX50 AERO D motherboard (sTR5, EATX, 4× M.2, dual 10 GbE) — verified $889.99 CAD, "Available to Order" — Canada Computers	$890
128 GB DDR5-5600 UDIMM non-ECC (Kingston Fury Beast 2× 64 GB) — verified $2,619.99 CAD, in stock. ECC RDIMM is sold out nationally + 2-3× more expensive — see notes. — Canada Computers	$2,620
Noctua NH-U14S TR5-SP6 CPU cooler (Threadripper sTR5) — Amazon.ca	$150
4 TB NVMe Gen5 SSD — Amazon.ca / Newegg.ca	$600
1500-1600 W Platinum PSU (single 600 W GPU + CPU 350 W needs ~1300 W headroom) — Newegg.ca	$500
Full-tower chassis (Fractal Define 7 XL or similar, fits dual-slot 600 W workstation GPU) — Amazon.ca / Newegg.ca	$300
Ubuntu 24.04 + NVIDIA CUDA 12 + vLLM / TensorRT-LLM (free) — ubuntu.com / nvidia.com	$0
Subtotal	$13,139
QC sales tax (14.975%)	+$1,968
TOTAL	$15,107

Lead time: GPU: 2-6 weeks (RTX PRO line, restricted distribution). Other components: 1-2 weeks.

Setup time: 1-2 days (Linux + CUDA drivers + vLLM/TensorRT-LLM model serving)

Honest trade-offs

Pricing on the GPU lines: totals shown only count the 48 GB option. The 72 GB and 96 GB lines display $0 to avoid double-counting — substitute the line you actually want when ordering. With non-ECC RAM build: 72 GB GPU build total ≈ $15,800 CAD pre-tax / $18,200 CAD with QC tax. 96 GB GPU build total ≈ $18,950 CAD pre-tax / $21,790 CAD with QC tax.
96 GB GDDR7 ECC is the most VRAM you can put in a desktop today. Runs Llama 3 70B at FP16, Qwen 2.5 72B, Mistral Large at high quants without offloading.
CUDA-native — TensorRT-LLM gives ~2-3× faster inference vs vLLM bf16 on the same card. Industry-standard tooling.
~600 W per GPU under load. 1500 W PSU is the realistic minimum; 1600 W gives headroom for transient spikes.
Non-ECC system RAM: spec uses Kingston Fury Beast 128 GB UDIMM ($2,620 in stock at Canada Computers). ECC RDIMM was $5,850 + sold out nationally as of 2026-05-04. The GPU's 96 GB GDDR7 is ECC; system RAM is non-ECC. For inference workloads this is fine. For training or grant requirements that mandate end-to-end ECC, swap to RDIMM and budget +$3,000-4,000 + lead time.
Single-card design — for multi-GPU (2× or 4× 96 GB), need server platform with PLX switching or ConnectX-7 for NVLink-replacement. Beyond this guide's scope.
Best fit for: AI fine-tuning, ML research, professional rendering, single-user max-VRAM workloads.

Want help buying & setting this up? Email artem@solutionagent.ca with "Setup #9".

Why M5 Max ≠ just more RAM than M5 Pro 🟡

The biggest under-explained spec when people compare Apple chips for AI: memory bandwidth. AI inference (the part where the model actually generates each word of the answer) is bottlenecked by how fast the chip can read the model's weights from memory — not by raw CPU/GPU compute. That makes bandwidth the single most important number for token generation speed.

M5 base (MacBook Air, base MBP)	~150 GB/s	1×
M5 Pro (Setups #1, #2, #5 laptop)	307 GB/s	~2×
M5 Max (Setups #3, #4)	614 GB/s	~4×
M5 Ultra (Setup #5 Studio, when released)	~820 GB/s (est.)	~5.5×

What this means for you: on the same model, an M5 Max generates tokens roughly twice as fast as an M5 Pro. M5 Pro is fine for occasional or light AI use. M5 Max is the floor for sustained or daily-driver AI work. Don't pay Pro prices and expect Max performance — and don't pay Max prices for a 14" body that throttles back to Pro speeds anyway (see Setup #3 trade-offs).

Source: Alex Ziskind, "Apple's New M5 Max Changes the Local AI Story" (March 2026, Stream Triad memory bandwidth measurements).

The proof — real measurements 🟢

We didn't make these numbers up. The "AI quality vs cloud" % on every setup card above traces back to this eval: 7 prompts × 6 models = 42 calls, run on a real M5 Max 128 GB, scored 0-5 by hand against a clear rubric. Sonnet 4.6 (cloud) is the reference at 100%; everything else runs on-device. Setups #3-#5 use the same Apple silicon as the test bench. Setups #6-#7 use comparable hardware (Intel Arc, AMD Ryzen AI Max+) — quality scales with model size and bandwidth, not vendor branding.

Test bench: 14" MacBook Pro M5 Max 128 GB / 4 TB / macOS 26 / LM Studio 0.4.12 with MLX runtime app-mlx-generate-mac26-arm64@22. Real audio transcript from a French-speaking real-estate professional, PII-stripped to 615 words. Total OpenRouter cost for the Sonnet 4.6 baseline: under $0.50.

Sonnet 4.6 (cloud ref)

100%

35/35 · 89s total

Gemma 4 31B MLX

91%

32/35 · 254s total

Gemma 4 26B-A4B 8bit

83%

29/35 · 93s total

Qwen3-Coder 30B-A3B

83%

29/35 · 48s total · fastest local

Llama 3.3 70B 4bit

80%

28/35 · 311s total

Qwen3.6 27B MLX

66%

23/35 · 943s total · reasoning-mode caveats

Quality Score Heatmap (0-5 per prompt, max 35)

Model	Reasoning 2 trains	Code review 3 bugs	FR Email relance	Garage lift 2P vs 4P	Consulting 3 questions	FR extract 1 sentence	FR analysis structured	Total
Sonnet 4.6 cloud reference	5	5	5	5	5	5	5	35
Gemma 4 31B MLX ~16GB · dense	5	5	4	5	3	5	5	32
Gemma 4 26B-A4B 8bit ~28GB · MoE-4B-active	5	5	4	1	4	5	5	29
Qwen3-Coder 30B-A3B ~16GB · MoE-3B-active	5	5	5	1	4	5	4	29
Llama 3.3 70B ~40GB · 4-bit	5	4	4	4	3	5	3	28
Qwen3.6 27B MLX ~35GB · 8-bit · reasoning	5	5	3	5	0	0	5	23

Latency per Prompt (seconds)

Model	Reasoning	Code rev	FR email	Garage lift	Consulting	FR extract	FR analysis	Total
Sonnet 4.6	9.2s	8.1s	16.5s	11.7s	9.9s	3.6s	29.9s	88.9s
Qwen3-Coder 30B-A3B	8.0s	6.3s	7.2s	5.9s	6.4s	4.6s	9.3s	47.7s
Gemma 4 26B-A4B 8bit	14.3s	14.4s	15.2s	11.8s	12.4s	8.6s	16.3s	93.0s
Gemma 4 31B MLX	48.8s	34.2s	48.6s	27.8s	25.1s	14.2s	55.3s	254.0s
Llama 3.3 70B 4bit	66.1s	64.5s	53.0s	21.6s	29.9s	14.1s	62.2s	311.4s
Qwen3.6 27B MLX	174.3s	159.6s	113.2s	95.7s	142.0s	47.9s	210.7s	943.4s

Findings — what 42 real-task calls actually showed

Local 80-91% of cloud quality, no asterisks. Gemma 4 31B reached 91% of Sonnet 4.6 on real-world tasks. The 9% gap is judgment-and-jurisdiction-aware consulting + sharper writing instinct, not raw correctness.
Qwen3-Coder 30B-A3B is the speed king. 48s total for the full 7-prompt suite — cheaper-to-run than even Sonnet (88s). Same 83% quality as the much bigger Gemma 4 26B 8bit. The MoE 3B-active architecture is real engineering.
Llama 3.3 70B is overrated for this workload. 80% quality at 311s total — slower than Gemma 31B, lower scoring than Gemma 26B. The famous 70B is not the local default it once was; smaller modern models match or beat it on judgment tasks.
Reasoning models (Qwen3.6) need 3-5× more output budget. Three of seven Qwen3.6 prompts truncated at finish=length because the reasoning phase ate the entire token cap. Once given enough room, Qwen3.6 cleared the original break-test (full FR transcript analysis) at score 5.
Domain reasoning is non-monotonic with size. On the garage lift question, Gemma 26B (smaller) said 4-post (wrong), Gemma 31B (larger, same family) said 2-post (right). Don't assume bigger = better; benchmark on YOUR tasks.
French is solved on local. All five working local models extracted “les suivis” / “paparasse” correctly from a messy ASR transcript. Three correctly inferred the speaker's profession (real-estate broker) from indirect signals.
Sonnet 4.6 only justifies cloud cost on judgment-heavy tasks. Cited Quebec Law 25 / Bill 64 by name on the consulting question. For drafting, extraction, code review, math — local models are at parity, run them.

Software stack — the same for all setups

Hardware is half the story. Here's the open-source stack that runs on every setup above. All free unless noted.

LM Studio (free) — runs the models, manages downloads, provides an OpenAI-compatible API at localhost:1234
Cherry Studio or LibreChat (free) — chat interface; LibreChat is multi-user and self-hostable, Cherry is single-user and slick
Tailscale (free for ≤3 users) — secure remote access to your home/office AI from any device
Vaultwarden / 1Password / Apple Keychain — credential storage for any API keys you use
Optional: small Proxmox server ($800-1,500 used hardware) for always-on background tasks separate from the main AI box

What you do not need to buy

Defensive section — things that look impressive but won't help you:

❌ NVIDIA H100 / A100 datacenter GPUs — $25K+ each, overkill for SMB. Useful for training, not inference at your scale.
❌ NVIDIA DGX Spark — $50K+ tier. Only useful if you're serving hundreds of concurrent users.
❌ "AI PCs" with Copilot+ branding — marketing label, not capability. The Microsoft NPU isn't compatible with the open-source LLM stack used on this page.
❌ Subscription "private AI" SaaS — defeats the purpose. You're back to trusting a third party with your data, just with a different bill.
❌ Used Mac Pro towers (Intel-era) — no Apple Silicon means no MLX acceleration. They'll run AI but slowly and inefficiently.

Choosing between setups — quick decision tree

1. Are you 1 person or a team?
→ Team (15-50 people) → Setup #7
→ Team (3-15 people) sharing one machine → Setup #5 (Studio at home)
→ Solo → continue to step 2

2. Mobile or desk-bound?
→ Mobile most of the day → Setup #1, #3, or laptop part of Setup #5
→ Desk most of the day → Setup #2, #4, #5, or #6

3. Heavy AI use (multiple hours/day) or occasional?
→ Occasional → Setup #1 or #2 (M5 Pro is enough)
→ Heavy → Setup #4 (16" Max) or #5 (Studio combo)

4. Linux comfort + max raw speed needed?
→ Yes → Setup #6 (Intel Arc workstation)
→ No → stay with Apple

Cluster benchmarks 🟡 — for advanced/team deployments

These numbers help size large multi-user deployments. For SMB and solo use, jump back to Section 2 — the setup cards already incorporate this data.

Mac Cluster Scaling — Exo 1.0 + RDMA over Thunderbolt 5
Cluster	Model	1 node	4 nodes	Scaling
4× M3 Ultra	DevStroll 123B dense	9.2 t/s	22 t/s	2.4×
4× M3 Ultra	Qwen 235B MoE	30 t/s	37 t/s	1.2×
Framework Cluster (Setup #7 architecture)
Cluster	Model	Generation	Prefill	Notes
4× Framework Max+ 395	MiniMax M2 Q6	~13 t/s	~152 t/s	~20% gen drop 2→4 nodes
GPU Inference (Setup #6 architecture)
GPU	Model	BF16 c1	BF16 c4	AWQ c1
Intel Arc Pro B70	Qwen3-4B	56 t/s	194 t/s	72 t/s
NVIDIA RTX Pro 4000	Qwen3-4B	51 t/s	173 t/s	89 t/s

Cluster takeaways

Dense models scale well across nodes (2.4× on 4-node Mac cluster for 123B dense). Every node does useful compute via tensor parallelism.
MoE models scale poorly (1.2× on the same hardware) — only active experts compute, rest is idle VRAM. Pick dense if you must distribute.
RDMA over Thunderbolt 5 (macOS Tahoe 26.2) is the unlock for Mac clustering. Without it, 10 GbE TCP makes cluster slower than single node.
Intel Arc Pro B70 wins on BF16 concurrency (194 t/s at c4) for vLLM serving. NVIDIA wins on quantized AWQ (89 vs 72) — CUDA quant kernels more mature.

Sources: Alex Ziskind benchmark videos (M5 Max + Apple cluster), Jeff Geerling RDMA benchmark.

Methodology & trust

How quality is scored

The 7-prompt eval in Section 4 used a 0-5 rubric per prompt: 0 = no answer / hard fail, 5 = correct, insightful, would ship as-is. Every output was read end-to-end by a human (Artem) and scored against the same rubric. Sonnet 4.6 served as the cloud reference at 35/35.

How latency is measured

Local models: LM Studio's built-in token timer, captured via the OpenAI-compatible /v1/chat/completions endpoint at localhost:1234, averaged across the 7 prompts. Cloud: OpenRouter response timing for the same prompts. All on a single test bench (M5 Max 128 GB / macOS 26 / LM Studio 0.4.12).

Trust tier definitions

🟢 Measured — Artem ran this on this hardware. Reproducible from the public test harness on GitHub.
🟡 Reproduced — third-party benchmark (Alex Ziskind, Jeff Geerling, Notebookcheck), independently verified or matching multiple sources.
⚪ Estimated — extrapolated from architecture or vendor specs. Used sparingly and labeled.

Sources

Reproduce the numbers yourself

The full test harness (prompts, scoring rubric, runner.py) is open-source: github.com/radionokia/homelab/projects/llm-benchmark. Run it on any of the setups above, score the outputs against your own rubric, and you'll get reproducible quality + latency data for your hardware.

Who built this

Artem Kravchenko — AI Automation Consultant + IT Director, 30 years in IT, based in Quebec City. Bilingual (English / French).

I run all seven setups in production: my own work, my homelab, and at client sites. Numbers on this page are what I measure, not what vendors promise. If a recommendation here costs you money you didn't need to spend, I want to hear about it — that feedback makes the next revision better.

Email me with a setup # Book a 30-min consult

Not sure which setup fits your situation? Free consult, no commitment.

Want to see ROI analysis, per-role breakdowns, and KPI scorecard? → Coming soon at /roi