Skip to main content
  1. Blog/

The State of Local and Affordable Inference in October 2025

·1327 words·7 mins

Introduction
#

After these two years of GPU shortages, inflated prices, and high power consumption, it feels like things are getting better for local inference hardware. Prices are dropping, efficiency is improving, and multiple viable platforms now compete for the same developer and enthusiast audience Nvidia, AMD, Intel, Apple, all have a role to play. In this post, I’ll summarize where local inference hardware stands in late 2025, highlight what’s worth buying, and discuss how cloud inference and NPUs fit into the future.

Just a point, Inference can be done on anything from a 6 year old CPU to old GPUs if you keep the model and context size small. Most of the recommendations are here for higher speed, larger models and greater contexts. Only apple and some of the new platforms from other providers come with a package that can run out of the box. Rest of the hardware requires that you have a PC.

Apple: The Best Local and efficient AI Platform
#

For many developers, Apple Silicon remains the easiest, most consistent platform for local inference particularly for small and medium models. For laptop based Inference, really cannot be beaten. Also most of the Macbooks have solid hardware and excellent build quality. Apple’s integration of MLX, Core ML, and Metal provides the smoothest “it just works” experience. While less flexible than custom GPU rigs, Mac systems are unmatched in polish, stability, and power efficiency for on-device AI tasks. Also the Mac devices are super efficient consume a lot less power when doing inference often a order of magnitude lower. It is nice feeling to run a LLM at full speed on battery.

As of October 2025:

  • Mac Studio (M2 Max, 32 GB) went on sale around €1150, excellent value.
  • Mac Studio (M2 Ultra, 64 GB) about €1999, extremely capable for most inference workloads.
  • Used 14” MacBook Pro (M1 and M2 Pro) now regularly below €1000 on the secondhand market sometimes with 32GB Ram.

Nvidia: Still Premium, Still Power Hungry
#

Nvidia remains the performance leader mainly due to the momentum of the CUDA, but at a steep premium in both cost and energy usage. The 16 pin melting connectors is still a serious issue and Nvidia is making the connector standard on most of its current choice.

  • RTX 3090: Around €600-€700,. Still excellent in 2025 and a top choice on the used market. Despite being four years old, it offers strong FP16/BF16 performance and wide software compatibility. The age is becoming to show. In a few months, we might see the 50 series super that may finally provide a replacement.
  • RTX 3060 12GB: Around €150-€200, The aging budget king only to be now replaced by Intel B580 in 2025. Be careful about the 8GB variants.
  • RTX 5060 Ti: Around €400, decent, very power efficient but limited by its 128-bit bus a bottleneck for larger models.
  • RTX 5070: Roughly €529. Would have been a great card but with 12 GB VRAM that is not so great. For that price, three RTX 3060 12GB cards (totaling 36 GB VRAM) might be better if your workflow supports multi-GPU setups.
  • RTX 5090: Excellent performance, but at ~€2000, it’s not for most. Also the Melting power connector issue.
  • RTX 6000 Pro (96 GB GDDR7): Roughly €8000, A dream card for all of us, but beyond reach for typical users.

There are others in the pipeline, for the VRAM and the price not really worth it. For the first time since the 2022 crypto crash, GPU prices are trending downward, and scalping has largely disappeared. Nvidia’s DGX Spark system aimed to capture small-scale inference and edge compute markets, but reviews suggest it underwhelms relative to cost especially against AMD’s and Apple’s alternatives. I have written more about this in NVIDIA DGX Spark: underwhelming and late to the Party

AMD: Getting better with Hardware and software
#

AMD’s ROCm stack has matured significantly in 2025, most LLM software works out of the box. The main bottleneck remains VRAM capacity, consumer cards above 16 GB are still rare.

  • RX 9070 / 9070 XT: Finally some cards are dropping below €600, making them very appealing.
  • RX 7900 XT: Still can find Around €700, offering great value and solid inference throughput but not as efficient as the newer cards.

Meanwhile, AMD’s Strix Halo APU systems combining CPU, GPU, and AI cores have outperformed Nvidia’s DGX Spark in local inference benchmarks at half to a third of the price. However, these mini-PCs (mostly from smaller Chinese OEMs) suffer from limited upgradeability and uncertain long-term quality. Larger vendors like Framework and Lenovo are stepping in with higher quality Strix halo based systems, but prices still hover between €1500–2000 making it still not really that cheap systems.

Intel: Playing Catchup
#

Intel’s Arc Pro series has been quietly redefining what’s possible in low-power local inference.

  • Arc B580, Definitely the budget king with 12Gb VRAM and excellent support in windows and linux.
  • Arc Pro B50: Around €350, runs entirely on PCIe power (~70W) no external connectors needed. It’s a quiet, cool, and efficient card that’s perfect for home inference setups.
  • Arc Pro B60: Coming soon in 24 GB and 48 GB configurations, expected to make Intel a serious player in midrange AI compute.

Intel’s push into efficient inference hardware suggests a future where silent, small-form inference PCs become the norm rather than the exception.

NPUs: Great for some AI tasks Acceleration, Not Yet for Inference
#

2025 saw the explosion of NPU (Neural Processing Unit) hardware from AMD Strix point, Intel’s Lunar Lake, Qualcomm and Apple’s M-series chips. From the marketing, these AI-ready devices are said to excel at on-device AI acceleration for background tasks, image enhancement, speech recognition, and small transformer models. An NPU can do most of the tasks a GPU can do but at a fraction of the power usage. However, NPU based inference has not yet materialized into meaningful performance gains for larger, general-purpose LLMs. While they’re efficient for embedded or real-time tasks, NPUs currently lack the flexibility and VRAM needed for heavier inference workloads.

Cloud Inference: Cheap, Convenient But With Caveats
#

One of the strongest arguments against investing heavily in local hardware that will become obsolete is just how cheap and accessible cloud‑based inference has become. As of October 2025, you can subscribe to any of the major AI assistants for roughly €20–€23 / month, and if one wanted to hold multiple subscriptions they could though realistically only one is needed for most use‑cases. Here are some current pricing examples:

  • ChatGPT Plus (OpenAI) €23/month
  • Claude Pro (Anthropic) €22/month
  • Google AI Pro €22/month

That means paying ~€22/month gives you access to state‑of‑the‑art models and infrastructure without having to configure GPUs, deal with drivers, cooling, or sit‑up costs. If all you need is “good enough” inference and you’re fine with cloud constraints. Also you get a discount if you buy it for a year. Current cloud pricing is likely unsustainable long‑term, as infrastructure costs rise, providers may raise prices. 2025 bought lower usage limits, expensive plans for higher usage limits etc. The data is probably not going to be secure in long run as a lot of these model providers have Privacy & data ownership concerns. A hybrid strategy often works well: use cloud inference for many tasks especially bursty or unpredictable workloads and maintain a modest local inference setup for consistent, privacy‑sensitive, or high‑control tasks.

If the AI Bubble Pops: A Second Revolution in Private Compute
#

There’s growing talk that the AI investment bubble inflated by corporate hype and VC-backed cloud spending might eventually collapse or cool down. If that happens, history may repeat itself: just like the crypto bust of 2022 flooded the market with affordable GPUs. An AI downturn could trigger a second hardware renaissance. Mini AI servers and inference boxes could become affordable for individual developers. This would accelerate the growth of private AI compute, allowing small teams, and hobbyists to run large models locally no cloud subscription required.