AI Hardware Buyer's Guide
Run AI locally โ without the guesswork
For SMBs, professionals, and IT teams choosing hardware to run private AI on their own infrastructure
Trust legend:
๐ข measured (this hardware) ยท
๐ก reproduced (third-party benchmark, verified) ยท
โช estimated (extrapolated from architecture)
Last updated 2026-05-03 18:19 · Pricing snapshot: 2026-05-03 (CAD, Apple Canada / Canada Computers / frame.work). Verify before ordering.
Read this first โ the 30-second decision
You want local AI if any of these are true:
- You handle client data, employee records, financials, or anything regulated (Quebec Law 25, HIPAA, PIPEDA)
- Your legal/compliance team has said "we can't use ChatGPT for that"
- You want predictable monthly cost instead of a per-token bill that grows with usage
- You want AI to keep working when the internet doesn't, or when the SaaS is down
You probably don't need this if:
You're a sole user with no privacy constraints and Claude Pro / ChatGPT Plus already covers your needs. The cheapest setup below is ~$5,400 CAD โ that's 30+ months of a $140 cloud subscription.
What changes once you have local AI:
Clean French/English drafting, document Q&A, code review, transcript analysis, basic agent workflows โ all of it private, all of it instant, all of it included once you've bought the hardware.
The shopping list โ 7 recommended setups
Each setup is a complete order: machine + accessories + tax + total. Pick the one that matches your situation, hand the list to your IT person, or order it yourself. Prices are CAD, snapshot 2026-05-03, including QC sales tax (TPS+TVQ 14.975%).
Setup #1
14" MacBook Pro M5 Pro ยท 64 GB ยท 2 TB
Total turnkey (CAD, incl. QC tax)
$5,344
Pick this if: Solo mobile professional. Lawyer, accountant, agent, consultant. Light-to-medium AI use a few times a day. You travel.
Typical response
~3-8 sec
Privacy tier
all-tier (offline OK)
Shopping list
Lead time: In stock at Apple Canada
Setup time: Half-day (LM Studio + 2 models + chat UI)
Honest trade-offs
- 307 GB/s memory bandwidth โ about 2ร the M5 base, but half of M5 Max. Token generation scales nearly linearly with this number (see callout below).
- 64 GB RAM caps you at 30B-class models comfortably. Llama 70B 4-bit (~40 GB) won't fit alongside macOS + apps.
- Battery 16+ hours on non-AI work, ~6-8 hours when running an LLM continuously.
Setup #2
16" MacBook Pro M5 Pro ยท 64 GB ยท 2 TB
Total turnkey (CAD, incl. QC tax)
$6,091
Pick this if: Solo desk-first professional. Want a bigger screen built-in, occasional travel. Same AI capability as Setup #1.
Typical response
~3-8 sec
Privacy tier
all-tier (offline OK)
Shopping list
Lead time: In stock at Apple Canada
Setup time: Half-day
Honest trade-offs
- Larger 16.2" screen + better sustained thermals than 14" โ but 700g heavier (2.16 kg vs 1.55 kg).
- Same M5 Pro chip as Setup #1 โ quality and AI speed identical. You're paying for screen + thermal headroom.
- Better cooling matters less here than in the Max chassis (Pro chip stays within thermal budget anyway).
Setup #3
14" MacBook Pro M5 Max ยท 128 GB ยท 2 TB
Total turnkey (CAD, incl. QC tax)
$7,471
Pick this if: Solo road-warrior who needs Max performance in a smaller body. You travel a lot, you want to run every model on this page, and you accept the thermal compromise of the 14" chassis.
Typical response
~2-5 sec (bursts)
Privacy tier
all-tier (offline OK)
Shopping list
Lead time: Build-to-order, 1-2 weeks
Setup time: Half-day
Honest trade-offs
- 614 GB/s bandwidth + 128 GB unified RAM โ runs every model on this page including Llama 3.3 70B (~40 GB) and Qwen3.6 27B reasoning (~35 GB) with room to spare.
- 14" Max throttles 15-25% under sustained load (Notebookcheck, Digital Trends measurements). Spike to 96 W then drops to 42 W. Fine for bursts, painful for all-day inference.
- SSD reaches 100ยฐC under sustained AI workloads on this chassis. Apple's cooling can't keep up with the Max chip in 14" form factor.
- Same chip as Setup #4 (16" Max 128 GB) but 700g lighter and more portable. If you grind on AI for hours daily, go 16". If your AI use is bursty + you live on the road, this wins.
Setup #4
16" MacBook Pro M5 Max ยท 128 GB ยท 2 TB
Total turnkey (CAD, incl. QC tax)
$7,988
Pick this if: Solo cockpit. Heavy AI all day. You want every model on this page to run, no thermal compromise, no swap, no waiting.
Typical response
~2-5 sec
Concurrent users
1 user (or 2-3 light)
Privacy tier
all-tier (offline OK)
Shopping list
Lead time: Build-to-order, 2-3 weeks
Setup time: 1 day (LM Studio + 4-5 models + Cherry agents + Tailscale)
Honest trade-offs
- This is the setup the measurements on this page were taken on (Setup #4 was 14" 128/4TB but the 16" body is the same chip with proper thermals โ performance numbers transfer identically or improve).
- 16" body sustains 62 W indefinitely vs 14"'s 42 W after throttle. 15-25% faster on long workloads.
- Heavier (2.16 kg) and bigger footprint. If you live on planes and coffee shops, Setup #3 is more portable.
- 128 GB unified memory means zero swap even with multiple models loaded + active dev work.
Setup #5
Power user combo: 14" Pro 64 GB laptop + Mac Studio M5 Ultra at home
Total turnkey (CAD, incl. QC tax)
$15,406
Pick this if: Heavy AI user with stable home internet (>99% uptime). You want Studio-grade quality at desk + laptop autonomy on the road. Best long-term value if you can wait for Studio M5 Ultra (rumored June or Oct 2026).
AI quality vs cloud
91% laptop ยท 95%+ on Studio (70B-class models)
Typical response
~3-8 sec laptop ยท ~1-3 sec on Studio
Concurrent users
1-3 users on Studio
Privacy tier
all-tier (Tailscale-private remote access)
Shopping list
Lead time: Studio M5 Ultra: not yet released (est. June or Oct 2026). M3 Ultra Studio available now as interim.
Setup time: 1-2 days (Studio + Tailscale mesh + LM Studio on both + remote access)
Honest trade-offs
- Architecture: laptop = thin client when home (queries Studio over Tailscale, sub-ms LAN latency), full local stack when offline. Best of both worlds.
- Studio runs 70B-class models at full bandwidth (820 GB/s on M5 Ultra). Llama 3.3 70B, Qwen3.6 35B, Gemma 4 31B all fast enough for real-time chat.
- Laptop sized for autonomy, not for being primary AI rig โ 64 GB Pro is plenty for the rare offline use case (Gemma 4 26B 8-bit fits comfortably).
- M5 Ultra Studio is rumored, not released. If you need this today, substitute M3 Ultra Mac Studio (~$5,500-9,000 CAD depending on RAM) and upgrade later.
Setup #6
Linux GPU workstation: 4ร Intel Arc Pro B70 + Threadripper base
Total turnkey (CAD, incl. QC tax)
$11,498
Pick this if: Maximum tokens-per-dollar. You're comfortable with Linux + vLLM serving stack. You don't need Mac apps. Best raw inference speed in this guide.
Typical response
~1-2 sec (vLLM, BF16 concurrency)
Concurrent users
3-5 users
Privacy tier
all-tier (self-hosted)
Shopping list
Lead time: Components in stock; assembly 1-2 days
Setup time: 1-2 days (Linux + Intel oneAPI drivers + vLLM + model serving)
Honest trade-offs
- 128 GB total VRAM across 4 cards โ same as Setup #4's unified memory but with 600 GB/s per card (vs 614 GB/s shared on Mac).
- ~600 W under load, audible fans, needs proper cooling. Not a coffee-shop machine.
- Intel oneAPI / IPEX-LLM ecosystem still maturing โ fewer tutorials than NVIDIA CUDA, more Stack Overflow time required.
- vLLM serving makes it shine for multi-user concurrent inference. Single-user is overkill โ Setup #4 is faster for one person.
Setup #7
Team cluster: 4ร Framework Desktop Max+ 395 in 10" rack
Total turnkey (CAD, incl. QC tax)
$16,349
Pick this if: Team of 15-50 people sharing private AI. You have rack space and someone comfortable with Linux ops. Scale by adding more nodes.
Typical response
~5-8 sec
Concurrent users
10-20 users
Privacy tier
all-tier (self-hosted, multi-user)
Shopping list
Lead time: Framework: 4-8 weeks build-to-order
Setup time: 3-5 days (cluster assembly + router config + LibreChat + per-user accounts)
Honest trade-offs
- 512 GB total memory across 4 nodes at ~256 GB/s per node. Scales by node count for concurrent users, not for single-query speed.
- ~20% generation drop scaling 2 โ 4 nodes (Alex Ziskind benchmarks) โ coordination overhead is real but manageable.
- Linux-only. No Apple ecosystem. Your IT person will own this.
- Best price-per-concurrent-user in this guide. Cheaper per seat than buying everyone a MacBook.
Why M5 Max โ just more RAM than M5 Pro ๐ก
The biggest under-explained spec when people compare Apple chips for AI: memory bandwidth. AI inference (the part where the model actually generates each word of the answer) is bottlenecked by how fast the chip can read the model's weights from memory โ not by raw CPU/GPU compute. That makes bandwidth the single most important number for token generation speed.
| M5 base (MacBook Air, base MBP) | ~150 GB/s | 1ร |
| M5 Pro (Setups #1, #2, #5 laptop) | 307 GB/s | ~2ร |
| M5 Max (Setups #3, #4) | 614 GB/s | ~4ร |
| M5 Ultra (Setup #5 Studio, when released) | ~820 GB/s (est.) | ~5.5ร |
What this means for you: on the same model, an M5 Max generates tokens roughly twice as fast as an M5 Pro. M5 Pro is fine for occasional or light AI use. M5 Max is the floor for sustained or daily-driver AI work. Don't pay Pro prices and expect Max performance โ and don't pay Max prices for a 14" body that throttles back to Pro speeds anyway (see Setup #3 trade-offs).
Source: Alex Ziskind, "Apple's New M5 Max Changes the Local AI Story" (March 2026, Stream Triad memory bandwidth measurements).
The proof โ real measurements ๐ข
We didn't make these numbers up. The "AI quality vs cloud" % on every setup card above traces back to this eval: 7 prompts × 6 models = 42 calls, run on a real M5 Max 128 GB, scored 0-5 by hand against a clear rubric. Sonnet 4.6 (cloud) is the reference at 100%; everything else runs on-device. Setups #3-#5 use the same Apple silicon as the test bench. Setups #6-#7 use comparable hardware (Intel Arc, AMD Ryzen AI Max+) โ quality scales with model size and bandwidth, not vendor branding.
Test bench: 14" MacBook Pro M5 Max 128 GB / 4 TB / macOS 26 / LM Studio 0.4.12 with MLX runtime app-mlx-generate-mac26-arm64@22. Real audio transcript from a French-speaking real-estate professional, PII-stripped to 615 words. Total OpenRouter cost for the Sonnet 4.6 baseline: under $0.50.
Sonnet 4.6 (cloud ref)
100%
35/35 · 89s total
Gemma 4 31B MLX
91%
32/35 · 254s total
Gemma 4 26B-A4B 8bit
83%
29/35 · 93s total
Qwen3-Coder 30B-A3B
83%
29/35 · 48s total · fastest local
Llama 3.3 70B 4bit
80%
28/35 · 311s total
Qwen3.6 27B MLX
66%
23/35 · 943s total · reasoning-mode caveats
Quality Score Heatmap (0-5 per prompt, max 35)
| Model | Reasoning 2 trains | Code review 3 bugs | FR Email relance | Garage lift 2P vs 4P | Consulting 3 questions | FR extract 1 sentence | FR analysis structured | Total |
Sonnet 4.6 cloud reference |
5 | 5 | 5 | 5 | 5 | 5 | 5 | 35 |
Gemma 4 31B MLX ~16GB · dense |
5 | 5 | 4 | 5 | 3 | 5 | 5 | 32 |
Gemma 4 26B-A4B 8bit ~28GB · MoE-4B-active |
5 | 5 | 4 | 1 | 4 | 5 | 5 | 29 |
Qwen3-Coder 30B-A3B ~16GB · MoE-3B-active |
5 | 5 | 5 | 1 | 4 | 5 | 4 | 29 |
Llama 3.3 70B ~40GB · 4-bit |
5 | 4 | 4 | 4 | 3 | 5 | 3 | 28 |
Qwen3.6 27B MLX ~35GB · 8-bit · reasoning |
5 | 5 | 3 | 5 | 0 | 0 | 5 | 23 |
Latency per Prompt (seconds)
| Model | Reasoning | Code rev | FR email | Garage lift | Consulting | FR extract | FR analysis | Total |
| Sonnet 4.6 | 9.2s | 8.1s | 16.5s | 11.7s | 9.9s | 3.6s | 29.9s | 88.9s |
| Qwen3-Coder 30B-A3B | 8.0s | 6.3s | 7.2s | 5.9s | 6.4s | 4.6s | 9.3s | 47.7s |
| Gemma 4 26B-A4B 8bit | 14.3s | 14.4s | 15.2s | 11.8s | 12.4s | 8.6s | 16.3s | 93.0s |
| Gemma 4 31B MLX | 48.8s | 34.2s | 48.6s | 27.8s | 25.1s | 14.2s | 55.3s | 254.0s |
| Llama 3.3 70B 4bit | 66.1s | 64.5s | 53.0s | 21.6s | 29.9s | 14.1s | 62.2s | 311.4s |
| Qwen3.6 27B MLX | 174.3s | 159.6s | 113.2s | 95.7s | 142.0s | 47.9s | 210.7s | 943.4s |
Findings — what 42 real-task calls actually showed
- Local 80-91% of cloud quality, no asterisks. Gemma 4 31B reached 91% of Sonnet 4.6 on real-world tasks. The 9% gap is judgment-and-jurisdiction-aware consulting + sharper writing instinct, not raw correctness.
- Qwen3-Coder 30B-A3B is the speed king. 48s total for the full 7-prompt suite — cheaper-to-run than even Sonnet (88s). Same 83% quality as the much bigger Gemma 4 26B 8bit. The MoE 3B-active architecture is real engineering.
- Llama 3.3 70B is overrated for this workload. 80% quality at 311s total — slower than Gemma 31B, lower scoring than Gemma 26B. The famous 70B is not the local default it once was; smaller modern models match or beat it on judgment tasks.
- Reasoning models (Qwen3.6) need 3-5× more output budget. Three of seven Qwen3.6 prompts truncated at finish=length because the reasoning phase ate the entire token cap. Once given enough room, Qwen3.6 cleared the original break-test (full FR transcript analysis) at score 5.
- Domain reasoning is non-monotonic with size. On the garage lift question, Gemma 26B (smaller) said 4-post (wrong), Gemma 31B (larger, same family) said 2-post (right). Don't assume bigger = better; benchmark on YOUR tasks.
- French is solved on local. All five working local models extracted “les suivis” / “paparasse” correctly from a messy ASR transcript. Three correctly inferred the speaker's profession (real-estate broker) from indirect signals.
- Sonnet 4.6 only justifies cloud cost on judgment-heavy tasks. Cited Quebec Law 25 / Bill 64 by name on the consulting question. For drafting, extraction, code review, math — local models are at parity, run them.
Software stack โ the same for all setups
Hardware is half the story. Here's the open-source stack that runs on every setup above. All free unless noted.
- LM Studio (free) โ runs the models, manages downloads, provides an OpenAI-compatible API at
localhost:1234
- Cherry Studio or LibreChat (free) โ chat interface; LibreChat is multi-user and self-hostable, Cherry is single-user and slick
- Tailscale (free for โค3 users) โ secure remote access to your home/office AI from any device
- Vaultwarden / 1Password / Apple Keychain โ credential storage for any API keys you use
- Optional: small Proxmox server ($800-1,500 used hardware) for always-on background tasks separate from the main AI box
What you do not need to buy
Defensive section โ things that look impressive but won't help you:
- โ NVIDIA H100 / A100 datacenter GPUs โ $25K+ each, overkill for SMB. Useful for training, not inference at your scale.
- โ NVIDIA DGX Spark โ $50K+ tier. Only useful if you're serving hundreds of concurrent users.
- โ "AI PCs" with Copilot+ branding โ marketing label, not capability. The Microsoft NPU isn't compatible with the open-source LLM stack used on this page.
- โ Subscription "private AI" SaaS โ defeats the purpose. You're back to trusting a third party with your data, just with a different bill.
- โ Used Mac Pro towers (Intel-era) โ no Apple Silicon means no MLX acceleration. They'll run AI but slowly and inefficiently.
Choosing between setups โ quick decision tree
1. Are you 1 person or a team?
โ Team (15-50 people) โ Setup #7
โ Team (3-15 people) sharing one machine โ Setup #5 (Studio at home)
โ Solo โ continue to step 2
2. Mobile or desk-bound?
โ Mobile most of the day โ Setup #1, #3, or laptop part of Setup #5
โ Desk most of the day โ Setup #2, #4, #5, or #6
3. Heavy AI use (multiple hours/day) or occasional?
โ Occasional โ Setup #1 or #2 (M5 Pro is enough)
โ Heavy โ Setup #4 (16" Max) or #5 (Studio combo)
4. Linux comfort + max raw speed needed?
โ Yes โ Setup #6 (Intel Arc workstation)
โ No โ stay with Apple
Cluster benchmarks ๐ก โ for advanced/team deployments
These numbers help size large multi-user deployments. For SMB and solo use, jump back to Section 2 โ the setup cards already incorporate this data.
| Mac Cluster Scaling โ Exo 1.0 + RDMA over Thunderbolt 5 |
| Cluster | Model | 1 node | 4 nodes | Scaling |
| 4ร M3 Ultra | DevStroll 123B dense | 9.2 t/s | 22 t/s | 2.4ร |
| 4ร M3 Ultra | Qwen 235B MoE | 30 t/s | 37 t/s | 1.2ร |
| Framework Cluster (Setup #7 architecture) |
| Cluster | Model | Generation | Prefill | Notes |
| 4ร Framework Max+ 395 | MiniMax M2 Q6 | ~13 t/s | ~152 t/s | ~20% gen drop 2โ4 nodes |
| GPU Inference (Setup #6 architecture) |
| GPU | Model | BF16 c1 | BF16 c4 | AWQ c1 |
| Intel Arc Pro B70 | Qwen3-4B | 56 t/s | 194 t/s | 72 t/s |
| NVIDIA RTX Pro 4000 | Qwen3-4B | 51 t/s | 173 t/s | 89 t/s |
Cluster takeaways
- Dense models scale well across nodes (2.4ร on 4-node Mac cluster for 123B dense). Every node does useful compute via tensor parallelism.
- MoE models scale poorly (1.2ร on the same hardware) โ only active experts compute, rest is idle VRAM. Pick dense if you must distribute.
- RDMA over Thunderbolt 5 (macOS Tahoe 26.2) is the unlock for Mac clustering. Without it, 10 GbE TCP makes cluster slower than single node.
- Intel Arc Pro B70 wins on BF16 concurrency (194 t/s at c4) for vLLM serving. NVIDIA wins on quantized AWQ (89 vs 72) โ CUDA quant kernels more mature.
Sources: Alex Ziskind benchmark videos (M5 Max + Apple cluster), Jeff Geerling RDMA benchmark.
Methodology & trust
How quality is scored
The 7-prompt eval in Section 4 used a 0-5 rubric per prompt: 0 = no answer / hard fail, 5 = correct, insightful, would ship as-is. Every output was read end-to-end by a human (Artem) and scored against the same rubric. Sonnet 4.6 served as the cloud reference at 35/35.
How latency is measured
Local models: LM Studio's built-in token timer, captured via the OpenAI-compatible /v1/chat/completions endpoint at localhost:1234, averaged across the 7 prompts. Cloud: OpenRouter response timing for the same prompts. All on a single test bench (M5 Max 128 GB / macOS 26 / LM Studio 0.4.12).
Trust tier definitions
- ๐ข Measured โ Artem ran this on this hardware. Reproducible from the public test harness on GitHub.
- ๐ก Reproduced โ third-party benchmark (Alex Ziskind, Jeff Geerling, Notebookcheck), independently verified or matching multiple sources.
- โช Estimated โ extrapolated from architecture or vendor specs. Used sparingly and labeled.
Sources
Reproduce the numbers yourself
The full test harness (prompts, scoring rubric, runner.py) is open-source: github.com/radionokia/homelab/projects/llm-benchmark. Run it on any of the setups above, score the outputs against your own rubric, and you'll get reproducible quality + latency data for your hardware.
Who built this
Artem Kravchenko โ AI Automation Consultant + IT Director, 30 years in IT, based in Quebec City. Bilingual (English / French).
I run all seven setups in production: my own work, my homelab, and at client sites. Numbers on this page are what I measure, not what vendors promise. If a recommendation here costs you money you didn't need to spend, I want to hear about it โ that feedback makes the next revision better.
Not sure which setup fits your situation? Free consult, no commitment.
Want to see ROI analysis, per-role breakdowns, and KPI scorecard? → Coming soon at /roi