AI Hardware Buyer's Guide
Run AI locally โ without the guesswork
For SMBs, professionals, and IT teams choosing hardware to run private AI on their own infrastructure
Trust legend:
๐ข measured (this hardware) ยท
๐ก reproduced (third-party benchmark, verified) ยท
โช estimated (extrapolated from architecture)
Last updated 2026-05-04 14:44 · Pricing snapshot: 2026-05-03 (CAD, Apple Canada / Canada Computers / frame.work). Verify before ordering.
Read this first โ the 30-second decision
You want local AI if any of these are true:
- You handle client data, employee records, financials, or anything regulated (Quebec Law 25, HIPAA, PIPEDA)
- Your legal/compliance team has said "we can't use ChatGPT for that"
- You want predictable monthly cost instead of a per-token bill that grows with usage
- You want AI to keep working when the internet doesn't, or when the SaaS is down
You probably don't need this if:
You're a sole user with no privacy constraints and Claude Pro / ChatGPT Plus already covers your needs. The cheapest setup below is ~$5,400 CAD โ that's 30+ months of a $140 cloud subscription.
What changes once you have local AI:
Clean French/English drafting, document Q&A, code review, transcript analysis, basic agent workflows โ all of it private, all of it instant, all of it included once you've bought the hardware.
Sizing for a 3-4 year purchase โ storage, RAM, chip
A laptop bought today should still be the right tool in 2029-2030. Three axes matter: storage (how much disk to buy), RAM (the real ceiling on what models you can run), and chip class (memory bandwidth = token generation speed). Buy too small and you'll feel pain by year 2; buy too big and you waste money on capacity you won't use. Here's how to think about it.
๐พ Storage: 2 TB is the right pick for most
Realistic 4-year laptop disk budget assuming you have a home server or NAS for media + archive:
| macOS + apps | ~250 GB |
| Personal docs + mail archive | ~350 GB |
| Active dev workspace | ~150 GB |
| Local model library (laptop-runnable subset) | ~200 GB |
| HF cache + temp | ~60 GB |
| Buffer (snapshots, OS upgrades, growth) | ~400 GB |
| Year-4 total | ~1.4 TB |
Verdict: 2 TB leaves ~600 GB free at year 4. Comfortable, not luxurious.
Upgrade to 4 TB if: you record/edit video locally, you run multiple Parallels/Docker VMs, you don't have a home server (laptop = whole stack), or you want zero "manage disk space" thinking. Cost: +$1,400 CAD.
๐ง RAM: 128 GB is the 4-year-safe floor
RAM is the real ceiling on what models you can run. Disk doesn't matter if the model can't fit in memory.
| Year | Mainstream "good" model | Resident |
| 2026 (now) | Gemma 4 31B 8-bit | 31 GB |
| 2027 | 40-50B class quantized | ~40 GB |
| 2028 | 60-80B class | ~60 GB |
| 2029 | 100B class | ~70 GB |
Verdict: 64 GB starts feeling tight by year 2-3 as model sizes grow + you also need RAM for OS + apps + active work. 128 GB is the only laptop config that's safe through 2029-2030.
Bigger models (200B+ dense, 1T+ MoE) won't run on any laptop regardless of RAM โ those live on home servers or clusters (Setup #5, #7).
โก Chip: M5 Pro hits a wall sooner than M5 Max
Memory bandwidth = token generation speed. The Pro / Max gap widens as model sizes grow because bigger models = more memory to read per token.
| Chip | Bandwidth | 2026 30B | 2029 100B |
| M5 Pro | 307 GB/s | ~30 t/s | ~5 t/s |
| M5 Max | 614 GB/s | ~60 t/s | ~12 t/s |
| M5 Ultra | ~820 GB/s | ~80 t/s | ~18 t/s |
Verdict: M5 Pro is fine for 2026 30B-class daily use. M5 Max stays usable on 100B-class models in 2029. M5 Pro at 100B class would be painful (5 t/s = ~3 minutes for a typical answer).
Estimates above assume 4-bit quantization. Lower quants (Q3, MXFP4) extend the runway by ~30%.
The 3-4 year safe pick: 128 GB RAM + M5 Max + 2 TB SSD (Setup #3 or #4) is the only laptop config that won't feel undersized by 2029. If your AI use is light and you're OK refreshing in 2-3 years, M5 Pro 64 GB (Setup #1 or #2) is fine. Don't pay for 4 TB unless you fit one of the upgrade-if cases above โ that money is better spent on the home server (Setup #5).
The shopping list โ 7 recommended setups
Each setup is a complete order: machine + accessories + tax + total. Pick the one that matches your situation, hand the list to your IT person, or order it yourself. Prices are CAD, snapshot 2026-05-03, including QC sales tax (TPS+TVQ 14.975%).
Setup #1
14" MacBook Pro M5 Pro ยท 64 GB ยท 2 TB
Total turnkey (CAD, incl. QC tax)
$5,344
Pick this if: Solo mobile professional. Lawyer, accountant, agent, consultant. Light-to-medium AI use a few times a day. You travel.
Typical response
~3-8 sec
Privacy tier
all-tier (offline OK)
Shopping list
Lead time: In stock at Apple Canada
Setup time: Half-day (LM Studio + 2 models + chat UI)
Honest trade-offs
- 307 GB/s memory bandwidth โ about 2ร the M5 base, but half of M5 Max. Token generation scales nearly linearly with this number (see callout below).
- 64 GB RAM caps you at 30B-class models comfortably. Llama 70B 4-bit (~40 GB) won't fit alongside macOS + apps.
- Battery 16+ hours on non-AI work, ~6-8 hours when running an LLM continuously.
Setup #2
16" MacBook Pro M5 Pro ยท 64 GB ยท 2 TB
Total turnkey (CAD, incl. QC tax)
$6,091
Pick this if: Solo desk-first professional. Want a bigger screen built-in, occasional travel. Same AI capability as Setup #1.
Typical response
~3-8 sec
Privacy tier
all-tier (offline OK)
Shopping list
Lead time: In stock at Apple Canada
Setup time: Half-day
Honest trade-offs
- Larger 16.2" screen + better sustained thermals than 14" โ but 700g heavier (2.16 kg vs 1.55 kg).
- Same M5 Pro chip as Setup #1 โ quality and AI speed identical. You're paying for screen + thermal headroom.
- Better cooling matters less here than in the Max chassis (Pro chip stays within thermal budget anyway).
Setup #3
14" MacBook Pro M5 Max ยท 128 GB ยท 2 TB
Total turnkey (CAD, incl. QC tax)
$7,471
Pick this if: Solo road-warrior who needs Max performance in a smaller body. You travel a lot, you want to run every model on this page, and you accept the thermal compromise of the 14" chassis.
Typical response
~2-5 sec (bursts)
Privacy tier
all-tier (offline OK)
Shopping list
Lead time: Build-to-order, 1-2 weeks
Setup time: Half-day
Honest trade-offs
- 614 GB/s bandwidth + 128 GB unified RAM โ runs every model on this page including Llama 3.3 70B (~40 GB) and Qwen3.6 27B reasoning (~35 GB) with room to spare.
- 14" Max throttles 15-25% under sustained load (Notebookcheck, Digital Trends measurements). Spike to 96 W then drops to 42 W. Fine for bursts, painful for all-day inference.
- SSD reaches 100ยฐC under sustained AI workloads on this chassis. Apple's cooling can't keep up with the Max chip in 14" form factor.
- Same chip as Setup #4 (16" Max 128 GB) but 700g lighter and more portable. If you grind on AI for hours daily, go 16". If your AI use is bursty + you live on the road, this wins.
Setup #4
16" MacBook Pro M5 Max ยท 128 GB ยท 2 TB
Total turnkey (CAD, incl. QC tax)
$7,988
Pick this if: Solo cockpit. Heavy AI all day. You want every model on this page to run, no thermal compromise, no swap, no waiting.
Typical response
~2-5 sec
Concurrent users
1 user (or 2-3 light)
Privacy tier
all-tier (offline OK)
Shopping list
Lead time: Build-to-order, 2-3 weeks
Setup time: 1 day (LM Studio + 4-5 models + Cherry agents + Tailscale)
Honest trade-offs
- This is the setup the measurements on this page were taken on (Setup #4 was 14" 128/4TB but the 16" body is the same chip with proper thermals โ performance numbers transfer identically or improve).
- 16" body sustains 62 W indefinitely vs 14"'s 42 W after throttle. 15-25% faster on long workloads.
- Heavier (2.16 kg) and bigger footprint. If you live on planes and coffee shops, Setup #3 is more portable.
- 128 GB unified memory means zero swap even with multiple models loaded + active dev work.
Setup #5
Power user combo: 14" Pro 64 GB laptop + Mac Studio M5 Ultra at home
Total turnkey (CAD, incl. QC tax)
$15,406
Pick this if: Heavy AI user with stable home internet (>99% uptime). You want Studio-grade quality at desk + laptop autonomy on the road. Best long-term value if you can wait for Studio M5 Ultra (rumored June or Oct 2026).
AI quality vs cloud
91% laptop ยท 95%+ on Studio (70B-class models)
Typical response
~3-8 sec laptop ยท ~1-3 sec on Studio
Concurrent users
1-3 users on Studio
Privacy tier
all-tier (Tailscale-private remote access)
Shopping list
Lead time: Studio M5 Ultra: not yet released (est. June or Oct 2026). M3 Ultra Studio available now as interim.
Setup time: 1-2 days (Studio + Tailscale mesh + LM Studio on both + remote access)
Honest trade-offs
- Architecture: laptop = thin client when home (queries Studio over Tailscale, sub-ms LAN latency), full local stack when offline. Best of both worlds.
- Studio runs 70B-class models at full bandwidth (820 GB/s on M5 Ultra). Llama 3.3 70B, Qwen3.6 35B, Gemma 4 31B all fast enough for real-time chat.
- Laptop sized for autonomy, not for being primary AI rig โ 64 GB Pro is plenty for the rare offline use case (Gemma 4 26B 8-bit fits comfortably).
- M5 Ultra Studio is rumored, not released. If you need this today, substitute M3 Ultra Mac Studio (~$5,500-9,000 CAD depending on RAM) and upgrade later.
Setup #6
Linux GPU workstation โ 4ร Intel Arc Pro B70 + Threadripper
Total turnkey (CAD, incl. QC tax)
$14,325
Pick this if: Maximum tokens-per-dollar. You're comfortable with Linux + vLLM serving stack. You don't need Mac apps. Best raw inference speed in this guide.
Typical response
~1-2 sec (vLLM, BF16 concurrency)
Concurrent users
3-5 users
Privacy tier
all-tier (self-hosted)
Shopping list
| 4ร Intel Arc Pro B70 32 GB GDDR6 @ $1,300 CAD ea (128 GB total VRAM) โ Canada Computers |
$5,200 |
| AMD Ryzen Threadripper 7960X (24-core / 48-thread, sTR5) โ verified $2,199 CAD, "Available to Order" โ Canada Computers |
$2,199 |
| GIGABYTE TRX50 AERO D motherboard (sTR5, EATX, 4ร M.2, dual 10 GbE) โ verified $889.99 CAD, "Available to Order" โ Canada Computers |
$890 |
| 128 GB DDR5-5600 UDIMM non-ECC (Kingston Fury Beast 2ร 64 GB) โ verified $2,619.99 CAD, in stock. ECC RDIMM is sold out nationally + 2-3ร more expensive โ see notes. โ Canada Computers |
$2,620 |
| Noctua NH-U14S TR5-SP6 CPU air cooler (Threadripper sTR5) โ Amazon.ca |
$150 |
| 4 TB NVMe Gen5 SSD (Samsung 990 EVO Plus or WD Black SN850X) โ Amazon.ca / Newegg.ca |
$600 |
| 1500-1600 W Platinum PSU (Corsair AX1500i or Seasonic Prime TX-1600) โ Newegg.ca |
$500 |
| Full-tower chassis with 4-GPU support (Fractal Design Define 7 XL or Phanteks Enthoo Pro 2) โ Amazon.ca / Newegg.ca |
$300 |
| Ubuntu 24.04 + vLLM (free) โ ubuntu.com |
$0 |
| Subtotal | $12,459 |
| QC sales tax (14.975%) | +$1,866 |
| TOTAL | $14,325 |
Lead time: GPUs in stock at Canada Computers. CPU/MB/RAM 1-2 weeks. Total assembly 1-2 days.
Setup time: 1-2 days (Linux + Intel oneAPI drivers + vLLM + model serving)
Honest trade-offs
- 128 GB total VRAM across 4 cards (32 GB each at 600 GB/s per card vs 614 GB/s shared on Mac M5 Max).
- ~800 W total power draw under load (4ร 250 W GPUs + 350 W CPU + overhead). Verify circuit + cooling before ordering.
- Non-ECC UDIMM choice: spec uses Kingston Fury Beast 128 GB (consumer DDR5, $2,620 in stock). ECC RDIMM (Kingston Renegade Pro 128 GB) was $5,850 + sold out nationally as of 2026-05-04. ECC matters for training; for inference workloads silent bit-flips produce wrong tokens, not data corruption โ non-ECC is acceptable. If your subvention/policy requires ECC, swap the line and budget +$3,000-4,000 + lead time.
- Audible under load โ multiple GPU fans + tower cooler. Not a coffee-shop or quiet-office machine.
- Intel oneAPI / IPEX-LLM ecosystem is maturing โ fewer tutorials than NVIDIA CUDA, expect more Stack Overflow time on driver/quant issues.
- vLLM serving makes it shine for multi-user concurrent inference. Single-user is overkill โ Setup #4 is faster for one person.
- Best raw tokens-per-dollar in this guide for serving (BF16 c4: 194 t/s on Qwen3-4B per Intel benchmarks).
Setup #7
Framework Cluster โ 4ร Mainboards in a 10" rack
Total turnkey (CAD, incl. QC tax)
$24,556
Pick this if: Team of 15-50 people sharing private AI. You have rack space and someone comfortable with Linux ops + DIY builds. Scale by adding more nodes.
Typical response
~5-8 sec
Concurrent users
10-20 users
Privacy tier
all-tier (self-hosted, multi-user)
Shopping list
| 4ร Framework Desktop Mainboard (Ryzen AI Max+ 395, 128 GB LPDDR5x soldered, Mini-ITX) @ $3,769 CAD ea โ frame.work CA |
$15,076 |
| 4ร Framework 400W FlexATX PSU @ $209 CAD ea โ frame.work CA |
$836 |
| 4ร Framework Desktop CPU Fan Kit (Noctua) @ $55 CAD ea โ frame.work CA |
$220 |
| 4ร 2 TB NVMe Gen4 SSD (M.2 2280, WD Black SN850X or equivalent) @ ~$250 CAD ea โ Amazon.ca |
$1,000 |
| MikroTik CRS812-DDQ switch (2ร 400G QSFP56-DD + 2ร 200G QSFP56 + 8ร 50G SFP56 + 2ร 10G) โ MikroTik Canada Store |
$1,815 |
| MikroTik DQ+BC0003-DS+ breakout cable (1ร 200G QSFP56 โ 4ร 50G, 3m) โ feeds all 4 nodes from one switch port โ MikroTik Canada Store |
$111 |
| 4ร 50 GbE NICs (PCIe x4) โ each node uplink โ Amazon.ca / Newegg.ca |
$1,200 |
| UPS 2200VA rack-mount (covers ~600 W cluster + switch) โ Amazon.ca |
$1,100 |
| Ubuntu 24.04 + LibreChat + LiteLLM router (free) โ librechat.ai |
$0 |
| Subtotal | $21,358 |
| QC sales tax (14.975%) | +$3,198 |
| TOTAL | $24,556 |
Lead time: Framework mainboards + PSUs: 4-8 weeks build-to-order. Switch + optics: 1-2 weeks distributor. Chassis: in stock at MyElectronics or similar.
Setup time: 5-7 days (mechanical assembly per node + switch config + cluster networking + LibreChat + per-user accounts)
Honest trade-offs
- This is a DIY build. Mainboards arrive bare โ you supply chassis, mount the PSU + cooler, install SSD, cable it. Plan for one full day per node.
- 512 GB total memory across 4 nodes at ~256 GB/s per node. Scales by node count for concurrent users, not for single-query speed.
- ~20% generation drop scaling 2 โ 4 nodes (Alex Ziskind benchmarks) โ coordination overhead is real but manageable.
- ~600 W total cluster power draw at sustained inference. UPS line-item is sized for this; verify your room circuit can handle it.
- Memory is soldered (LPDDR5x) โ no future RAM upgrade per node. Locked-in at 128 GB.
- Linux-only. No Apple ecosystem. Your IT person will own this.
- Best price-per-concurrent-user in this guide. Cheaper per seat than buying everyone a MacBook.
Setup #8
NVIDIA DGX Spark โ 1 or 2 node desktop AI supercomputer
Total turnkey (CAD, incl. QC tax)
$15,361
Pick this if: You want NVIDIA-native CUDA tooling (PyTorch, vLLM, TensorRT-LLM) on a desktop. 1 node for solo use, 2-node direct-cabled for fine-tuning + serving up to 405B models.
AI quality vs cloud
90% (1 node) / 92%+ (2 nodes)
Typical response
~2-4 sec (1 node) / ~1-3 sec (2 nodes)
Concurrent users
1-3 users (1 node) / 5-10 users (2 nodes)
Privacy tier
all-tier (self-hosted)
Shopping list
| 1ร NVIDIA DGX Spark Founders Edition (GB10 Grace Blackwell, 128 GB unified, 4 TB NVMe) @ $4,699 USD โ NVIDIA Marketplace / Best Buy / Newegg |
$6,580 |
| +1ร DGX Spark Founders Edition (for 2-node config โ optional) @ $4,699 USD โ NVIDIA Marketplace / Best Buy / Newegg |
$6,580 |
| 0.5 m 200G QSFP56 DAC cable (direct node-to-node, no switch needed for 2 nodes) โ Amazon.ca / NADDOD |
$200 |
| Ubuntu DGX OS (preinstalled) + NVIDIA AI software stack (free) โ NVIDIA NGC |
$0 |
| Subtotal | $13,360 |
| QC sales tax (14.975%) | +$2,001 |
| TOTAL | $15,361 |
Lead time: NVIDIA Marketplace: 2-4 weeks. Best Buy: in-stock varies. DAC cable: in-stock Amazon.ca.
Setup time: Half-day (1 node) or 1 day (2 nodes incl. RoCE config + NCCL test)
Honest trade-offs
- 1 node = $6,580 CAD. 2 nodes (the configuration shown above) = ~$13,360 CAD before tax. Drop the second node line for solo use.
- Native CUDA โ every PyTorch, vLLM, TensorRT-LLM, ComfyUI, etc. just works. No oneAPI/ROCm hunting like Setup #6 (Intel Arc) requires.
- 2 nodes = direct cable, no switch. ConnectX-7 200 GbE port-to-port via DAC. Switch only needed at 3+ nodes (separate purchase).
- Memory is unified (GB10 Grace Blackwell) โ 128 GB per node, ~273 GB/s bandwidth. Per-node bandwidth lower than Mac M5 Max (614 GB/s) โ Spark wins on CUDA ecosystem, Mac wins on per-node speed.
- ~240 W per node. Two nodes โ 480 W. Quiet office-friendly desktop form factor.
- ~17 GB/s NCCL throughput over 200 GbE (real-world, per NVIDIA forum) โ fine for inference, the bottleneck for training larger-than-2-node setups.
- NVIDIA pricing volatile โ Founders Edition jumped 18% in 2025 due to memory shortages. Check current pricing at order time.
- Best fit for: AI researchers, devs who'll exploit CUDA libraries, teams with PyTorch fine-tuning workflows.
Setup #9
NVIDIA RTX PRO Blackwell workstation (48 / 72 / 96 GB VRAM tiers)
Total turnkey (CAD, incl. QC tax)
$15,107
Pick this if: You need single-card high-VRAM CUDA for AI inference, training, or rendering. Pick the VRAM tier that matches your largest expected model.
AI quality vs cloud
92% (48 GB) / 94% (72 GB) / 95%+ (96 GB)
Typical response
<1-2 sec (CUDA + TensorRT-LLM optimized)
Concurrent users
1 user (intensive) / 3-5 users (moderate)
Privacy tier
all-tier (self-hosted)
Shopping list
| GPU choice (pick ONE): RTX PRO 5000 Blackwell 48 GB GDDR7 @ $4,200 USD โ $5,880 CAD โ NVIDIA / Newegg / B&H |
$5,880 |
| โ OR โ RTX PRO 5000 Blackwell 72 GB GDDR7 @ ~$6,300 USD โ $8,820 CAD (price not yet listed publicly) โ NVIDIA / Newegg |
$0 |
| โ OR โ RTX PRO 6000 Blackwell 96 GB GDDR7 ECC @ $8,565 USD โ $11,990 CAD โ NVIDIA / PNY / Micro Center |
$0 |
| AMD Ryzen Threadripper 7960X (24-core / 48-thread, sTR5) โ verified $2,199 CAD, "Available to Order" โ Canada Computers |
$2,199 |
| GIGABYTE TRX50 AERO D motherboard (sTR5, EATX, 4ร M.2, dual 10 GbE) โ verified $889.99 CAD, "Available to Order" โ Canada Computers |
$890 |
| 128 GB DDR5-5600 UDIMM non-ECC (Kingston Fury Beast 2ร 64 GB) โ verified $2,619.99 CAD, in stock. ECC RDIMM is sold out nationally + 2-3ร more expensive โ see notes. โ Canada Computers |
$2,620 |
| Noctua NH-U14S TR5-SP6 CPU cooler (Threadripper sTR5) โ Amazon.ca |
$150 |
| 4 TB NVMe Gen5 SSD โ Amazon.ca / Newegg.ca |
$600 |
| 1500-1600 W Platinum PSU (single 600 W GPU + CPU 350 W needs ~1300 W headroom) โ Newegg.ca |
$500 |
| Full-tower chassis (Fractal Define 7 XL or similar, fits dual-slot 600 W workstation GPU) โ Amazon.ca / Newegg.ca |
$300 |
| Ubuntu 24.04 + NVIDIA CUDA 12 + vLLM / TensorRT-LLM (free) โ ubuntu.com / nvidia.com |
$0 |
| Subtotal | $13,139 |
| QC sales tax (14.975%) | +$1,968 |
| TOTAL | $15,107 |
Lead time: GPU: 2-6 weeks (RTX PRO line, restricted distribution). Other components: 1-2 weeks.
Setup time: 1-2 days (Linux + CUDA drivers + vLLM/TensorRT-LLM model serving)
Honest trade-offs
- Pricing on the GPU lines: totals shown only count the 48 GB option. The 72 GB and 96 GB lines display $0 to avoid double-counting โ substitute the line you actually want when ordering. With non-ECC RAM build: 72 GB GPU build total โ $15,800 CAD pre-tax / $18,200 CAD with QC tax. 96 GB GPU build total โ $18,950 CAD pre-tax / $21,790 CAD with QC tax.
- 96 GB GDDR7 ECC is the most VRAM you can put in a desktop today. Runs Llama 3 70B at FP16, Qwen 2.5 72B, Mistral Large at high quants without offloading.
- CUDA-native โ TensorRT-LLM gives ~2-3ร faster inference vs vLLM bf16 on the same card. Industry-standard tooling.
- ~600 W per GPU under load. 1500 W PSU is the realistic minimum; 1600 W gives headroom for transient spikes.
- Non-ECC system RAM: spec uses Kingston Fury Beast 128 GB UDIMM ($2,620 in stock at Canada Computers). ECC RDIMM was $5,850 + sold out nationally as of 2026-05-04. The GPU's 96 GB GDDR7 is ECC; system RAM is non-ECC. For inference workloads this is fine. For training or grant requirements that mandate end-to-end ECC, swap to RDIMM and budget +$3,000-4,000 + lead time.
- Single-card design โ for multi-GPU (2ร or 4ร 96 GB), need server platform with PLX switching or ConnectX-7 for NVLink-replacement. Beyond this guide's scope.
- Best fit for: AI fine-tuning, ML research, professional rendering, single-user max-VRAM workloads.
Why M5 Max โ just more RAM than M5 Pro ๐ก
The biggest under-explained spec when people compare Apple chips for AI: memory bandwidth. AI inference (the part where the model actually generates each word of the answer) is bottlenecked by how fast the chip can read the model's weights from memory โ not by raw CPU/GPU compute. That makes bandwidth the single most important number for token generation speed.
| M5 base (MacBook Air, base MBP) | ~150 GB/s | 1ร |
| M5 Pro (Setups #1, #2, #5 laptop) | 307 GB/s | ~2ร |
| M5 Max (Setups #3, #4) | 614 GB/s | ~4ร |
| M5 Ultra (Setup #5 Studio, when released) | ~820 GB/s (est.) | ~5.5ร |
What this means for you: on the same model, an M5 Max generates tokens roughly twice as fast as an M5 Pro. M5 Pro is fine for occasional or light AI use. M5 Max is the floor for sustained or daily-driver AI work. Don't pay Pro prices and expect Max performance โ and don't pay Max prices for a 14" body that throttles back to Pro speeds anyway (see Setup #3 trade-offs).
Source: Alex Ziskind, "Apple's New M5 Max Changes the Local AI Story" (March 2026, Stream Triad memory bandwidth measurements).
The proof โ real measurements ๐ข
We didn't make these numbers up. The "AI quality vs cloud" % on every setup card above traces back to this eval: 7 prompts × 6 models = 42 calls, run on a real M5 Max 128 GB, scored 0-5 by hand against a clear rubric. Sonnet 4.6 (cloud) is the reference at 100%; everything else runs on-device. Setups #3-#5 use the same Apple silicon as the test bench. Setups #6-#7 use comparable hardware (Intel Arc, AMD Ryzen AI Max+) โ quality scales with model size and bandwidth, not vendor branding.
Test bench: 14" MacBook Pro M5 Max 128 GB / 4 TB / macOS 26 / LM Studio 0.4.12 with MLX runtime app-mlx-generate-mac26-arm64@22. Real audio transcript from a French-speaking real-estate professional, PII-stripped to 615 words. Total OpenRouter cost for the Sonnet 4.6 baseline: under $0.50.
Sonnet 4.6 (cloud ref)
100%
35/35 · 89s total
Gemma 4 31B MLX
91%
32/35 · 254s total
Gemma 4 26B-A4B 8bit
83%
29/35 · 93s total
Qwen3-Coder 30B-A3B
83%
29/35 · 48s total · fastest local
Llama 3.3 70B 4bit
80%
28/35 · 311s total
Qwen3.6 27B MLX
66%
23/35 · 943s total · reasoning-mode caveats
Quality Score Heatmap (0-5 per prompt, max 35)
| Model | Reasoning 2 trains | Code review 3 bugs | FR Email relance | Garage lift 2P vs 4P | Consulting 3 questions | FR extract 1 sentence | FR analysis structured | Total |
Sonnet 4.6 cloud reference |
5 | 5 | 5 | 5 | 5 | 5 | 5 | 35 |
Gemma 4 31B MLX ~16GB · dense |
5 | 5 | 4 | 5 | 3 | 5 | 5 | 32 |
Gemma 4 26B-A4B 8bit ~28GB · MoE-4B-active |
5 | 5 | 4 | 1 | 4 | 5 | 5 | 29 |
Qwen3-Coder 30B-A3B ~16GB · MoE-3B-active |
5 | 5 | 5 | 1 | 4 | 5 | 4 | 29 |
Llama 3.3 70B ~40GB · 4-bit |
5 | 4 | 4 | 4 | 3 | 5 | 3 | 28 |
Qwen3.6 27B MLX ~35GB · 8-bit · reasoning |
5 | 5 | 3 | 5 | 0 | 0 | 5 | 23 |
Latency per Prompt (seconds)
| Model | Reasoning | Code rev | FR email | Garage lift | Consulting | FR extract | FR analysis | Total |
| Sonnet 4.6 | 9.2s | 8.1s | 16.5s | 11.7s | 9.9s | 3.6s | 29.9s | 88.9s |
| Qwen3-Coder 30B-A3B | 8.0s | 6.3s | 7.2s | 5.9s | 6.4s | 4.6s | 9.3s | 47.7s |
| Gemma 4 26B-A4B 8bit | 14.3s | 14.4s | 15.2s | 11.8s | 12.4s | 8.6s | 16.3s | 93.0s |
| Gemma 4 31B MLX | 48.8s | 34.2s | 48.6s | 27.8s | 25.1s | 14.2s | 55.3s | 254.0s |
| Llama 3.3 70B 4bit | 66.1s | 64.5s | 53.0s | 21.6s | 29.9s | 14.1s | 62.2s | 311.4s |
| Qwen3.6 27B MLX | 174.3s | 159.6s | 113.2s | 95.7s | 142.0s | 47.9s | 210.7s | 943.4s |
Findings — what 42 real-task calls actually showed
- Local 80-91% of cloud quality, no asterisks. Gemma 4 31B reached 91% of Sonnet 4.6 on real-world tasks. The 9% gap is judgment-and-jurisdiction-aware consulting + sharper writing instinct, not raw correctness.
- Qwen3-Coder 30B-A3B is the speed king. 48s total for the full 7-prompt suite — cheaper-to-run than even Sonnet (88s). Same 83% quality as the much bigger Gemma 4 26B 8bit. The MoE 3B-active architecture is real engineering.
- Llama 3.3 70B is overrated for this workload. 80% quality at 311s total — slower than Gemma 31B, lower scoring than Gemma 26B. The famous 70B is not the local default it once was; smaller modern models match or beat it on judgment tasks.
- Reasoning models (Qwen3.6) need 3-5× more output budget. Three of seven Qwen3.6 prompts truncated at finish=length because the reasoning phase ate the entire token cap. Once given enough room, Qwen3.6 cleared the original break-test (full FR transcript analysis) at score 5.
- Domain reasoning is non-monotonic with size. On the garage lift question, Gemma 26B (smaller) said 4-post (wrong), Gemma 31B (larger, same family) said 2-post (right). Don't assume bigger = better; benchmark on YOUR tasks.
- French is solved on local. All five working local models extracted “les suivis” / “paparasse” correctly from a messy ASR transcript. Three correctly inferred the speaker's profession (real-estate broker) from indirect signals.
- Sonnet 4.6 only justifies cloud cost on judgment-heavy tasks. Cited Quebec Law 25 / Bill 64 by name on the consulting question. For drafting, extraction, code review, math — local models are at parity, run them.
Software stack โ the same for all setups
Hardware is half the story. Here's the open-source stack that runs on every setup above. All free unless noted.
- LM Studio (free) โ runs the models, manages downloads, provides an OpenAI-compatible API at
localhost:1234
- Cherry Studio or LibreChat (free) โ chat interface; LibreChat is multi-user and self-hostable, Cherry is single-user and slick
- Tailscale (free for โค3 users) โ secure remote access to your home/office AI from any device
- Vaultwarden / 1Password / Apple Keychain โ credential storage for any API keys you use
- Optional: small Proxmox server ($800-1,500 used hardware) for always-on background tasks separate from the main AI box
What you do not need to buy
Defensive section โ things that look impressive but won't help you:
- โ NVIDIA H100 / A100 datacenter GPUs โ $25K+ each, overkill for SMB. Useful for training, not inference at your scale.
- โ NVIDIA DGX Spark โ $50K+ tier. Only useful if you're serving hundreds of concurrent users.
- โ "AI PCs" with Copilot+ branding โ marketing label, not capability. The Microsoft NPU isn't compatible with the open-source LLM stack used on this page.
- โ Subscription "private AI" SaaS โ defeats the purpose. You're back to trusting a third party with your data, just with a different bill.
- โ Used Mac Pro towers (Intel-era) โ no Apple Silicon means no MLX acceleration. They'll run AI but slowly and inefficiently.
Choosing between setups โ quick decision tree
1. Are you 1 person or a team?
โ Team (15-50 people) โ Setup #7
โ Team (3-15 people) sharing one machine โ Setup #5 (Studio at home)
โ Solo โ continue to step 2
2. Mobile or desk-bound?
โ Mobile most of the day โ Setup #1, #3, or laptop part of Setup #5
โ Desk most of the day โ Setup #2, #4, #5, or #6
3. Heavy AI use (multiple hours/day) or occasional?
โ Occasional โ Setup #1 or #2 (M5 Pro is enough)
โ Heavy โ Setup #4 (16" Max) or #5 (Studio combo)
4. Linux comfort + max raw speed needed?
โ Yes โ Setup #6 (Intel Arc workstation)
โ No โ stay with Apple
Cluster benchmarks ๐ก โ for advanced/team deployments
These numbers help size large multi-user deployments. For SMB and solo use, jump back to Section 2 โ the setup cards already incorporate this data.
| Mac Cluster Scaling โ Exo 1.0 + RDMA over Thunderbolt 5 |
| Cluster | Model | 1 node | 4 nodes | Scaling |
| 4ร M3 Ultra | DevStroll 123B dense | 9.2 t/s | 22 t/s | 2.4ร |
| 4ร M3 Ultra | Qwen 235B MoE | 30 t/s | 37 t/s | 1.2ร |
| Framework Cluster (Setup #7 architecture) |
| Cluster | Model | Generation | Prefill | Notes |
| 4ร Framework Max+ 395 | MiniMax M2 Q6 | ~13 t/s | ~152 t/s | ~20% gen drop 2โ4 nodes |
| GPU Inference (Setup #6 architecture) |
| GPU | Model | BF16 c1 | BF16 c4 | AWQ c1 |
| Intel Arc Pro B70 | Qwen3-4B | 56 t/s | 194 t/s | 72 t/s |
| NVIDIA RTX Pro 4000 | Qwen3-4B | 51 t/s | 173 t/s | 89 t/s |
Cluster takeaways
- Dense models scale well across nodes (2.4ร on 4-node Mac cluster for 123B dense). Every node does useful compute via tensor parallelism.
- MoE models scale poorly (1.2ร on the same hardware) โ only active experts compute, rest is idle VRAM. Pick dense if you must distribute.
- RDMA over Thunderbolt 5 (macOS Tahoe 26.2) is the unlock for Mac clustering. Without it, 10 GbE TCP makes cluster slower than single node.
- Intel Arc Pro B70 wins on BF16 concurrency (194 t/s at c4) for vLLM serving. NVIDIA wins on quantized AWQ (89 vs 72) โ CUDA quant kernels more mature.
Sources: Alex Ziskind benchmark videos (M5 Max + Apple cluster), Jeff Geerling RDMA benchmark.
Methodology & trust
How quality is scored
The 7-prompt eval in Section 4 used a 0-5 rubric per prompt: 0 = no answer / hard fail, 5 = correct, insightful, would ship as-is. Every output was read end-to-end by a human (Artem) and scored against the same rubric. Sonnet 4.6 served as the cloud reference at 35/35.
How latency is measured
Local models: LM Studio's built-in token timer, captured via the OpenAI-compatible /v1/chat/completions endpoint at localhost:1234, averaged across the 7 prompts. Cloud: OpenRouter response timing for the same prompts. All on a single test bench (M5 Max 128 GB / macOS 26 / LM Studio 0.4.12).
Trust tier definitions
- ๐ข Measured โ Artem ran this on this hardware. Reproducible from the public test harness on GitHub.
- ๐ก Reproduced โ third-party benchmark (Alex Ziskind, Jeff Geerling, Notebookcheck), independently verified or matching multiple sources.
- โช Estimated โ extrapolated from architecture or vendor specs. Used sparingly and labeled.
Sources
Reproduce the numbers yourself
The full test harness (prompts, scoring rubric, runner.py) is open-source: github.com/radionokia/homelab/projects/llm-benchmark. Run it on any of the setups above, score the outputs against your own rubric, and you'll get reproducible quality + latency data for your hardware.
Who built this
Artem Kravchenko โ AI Automation Consultant + IT Director, 30 years in IT, based in Quebec City. Bilingual (English / French).
I run all seven setups in production: my own work, my homelab, and at client sites. Numbers on this page are what I measure, not what vendors promise. If a recommendation here costs you money you didn't need to spend, I want to hear about it โ that feedback makes the next revision better.
Not sure which setup fits your situation? Free consult, no commitment.
Want to see ROI analysis, per-role breakdowns, and KPI scorecard? → Coming soon at /roi