NVIDIA vs custom ASICs — what TPU, Trainium, and MTIA actually do (and why most of them target inference, not training)
Google TPU, AWS Trainium, Meta MTIA, Microsoft Maia. Four hyperscaler custom-silicon programs in flight, three of them inference-first, one (TPU) that has reached training parity at scale. The technical reason matters: training is where CUDA's moat lives; inference is where it's leakiest. This is the actual competitive landscape under the headline.
The bull case on $NVDA is "CUDA defends." The bear case is "custom ASICs eat." Both are right at different layers of the stack. Custom ASICs are eating share — at specific workloads, in specific deployment patterns, with specific economic profiles. They are not (yet) eating the part of the AI workload that drives $NVDA's order book.
Understanding which is which requires unpacking what each major custom-ASIC program actually does. Most published trader takes group the four together as "custom silicon" and treat the category as a single competitor. The category is heterogeneous: TPU and Trainium-the-flagship are training-and-inference dual-purpose, MTIA and Maia and Inferentia and the rest of the Trainium family are inference-targeted. The distinction is where the bear case has teeth and where it doesn't.
This article maps each major program, what workload it's actually deployed against, and how to read the disclosures.
The TL;DR. Training is where NVDA's CUDA moat is hardest to displace — frontier-model training needs the full CUDA library stack, the NCCL multi-GPU collectives, and the Nsight tooling for kernel optimization. Inference is materially leakier — the workload is more static, the model shape is known, the kernels can be hand-rolled once and deployed forever. The four hyperscaler ASIC programs target this difference: TPU went after training (and succeeded), Trainium and MTIA are inference-first, Maia is inference-only.
Why training is different from inference
The first step in any custom-silicon assessment is splitting AI compute into the two regimes:
Training — large-scale, multi-week, multi-billion-dollar runs on clusters of thousands of GPUs. The workload is dynamic (loss landscape changes, hyperparameters get tuned, the kernel mix shifts as the model converges), the model architecture is evolving (new attention variants, new optimizers, new activation functions get tried weekly at frontier labs), and the failure modes are subtle (a single bit-flip in a parameter gradient can corrupt the run). The software stack must be flexible, debuggable, and extensible. This is where CUDA's library stack — cuDNN, NCCL, the kernel tooling — has been compounding for 15 years. Custom silicon for training is hard because the substrate has to keep up with a moving target.
Inference — production serving of trained models. The workload is static (the model architecture is frozen, the weights are loaded once), the kernels are known in advance (you optimize once, deploy for the model's serving lifetime), and the cost optimization target is throughput-per-watt and latency-per-token. This is where custom silicon has the strongest economic case — a fixed-architecture model running 24/7 doesn't need the full CUDA flexibility; it needs the cheapest possible silicon that runs this specific shape of computation efficiently.
The economic math:
- Training spend at a hyperscaler is ~25-30% of AI compute capex
- Inference spend is ~70-75%, growing as deployed model count grows
- Custom silicon mostly targets the larger, easier-to-attack inference pool
This is why "hyperscaler custom silicon" needs to be read at the workload level, not the program level.
Google TPU — the training success story
Google began TPU development in 2013, deployed TPU v1 internally in 2015, and announced it publicly in 2016 with a paper at ISCA. The system architecture is a systolic-array matmul engine with the now-iconic chip-mesh interconnect, optimized first for inference (TPU v1, v2) then for training-and-inference (v3 onward).
TPU v5e shipped 2024. Ironwood (TPU v7) shipped 2025. Each generation has expanded:
- Training parity at scale. Frontier model training runs (Gemini, Anthropic's Claude on Google's TPU pool announced in 2024) on TPU at near-NVIDIA throughput. The XLA compiler + JAX combination has become a credible alternative to PyTorch + CUDA for new training projects.
- Inference dominance for Google's own workloads. Search ranking, ads click-through prediction, YouTube recommendation, all run on TPU at production scale.
- External cloud availability. TPU is sold on Google Cloud at competitive pricing vs NVIDIA instances on equivalent workloads.
Why TPU succeeded where others have struggled:
- Time. 10+ years of compounding investment in the software stack (XLA, JAX, Pathways orchestration).
- Tight model-substrate co-design. Google's research teams ship models that work because they're co-designed with TPU's strengths (large matmul tiles, hardware-supported sparsity, the chip-mesh interconnect).
- Internal customer captive. Google's own ML platform team is the first customer. They have no political reason to prefer NVIDIA over TPU. The internal pull validates the substrate before external customers see it.
The result: Google's internal NVIDIA spend is materially lower (as a fraction of internal compute) than the other top-4 hyperscalers' NVIDIA spend. TPU is the existence proof that a hyperscaler can in-house full-stack. It's the canary for the customer concentration risk on NVDA.
AWS Trainium and Inferentia — the inference-first play
AWS started with Inferentia in 2019 — a pure-inference chip optimized for transformer-style models at low latency and competitive price/perf. Inferentia 2 (2023) extended the architecture. AWS uses Inferentia heavily for internal inference workloads (Alexa, Rekognition, Comprehend, the Bedrock managed LLM service) and offers Inf1/Inf2 EC2 instances to external customers.
Trainium launched 2020 as the training counterpart. The first generation lagged NVIDIA materially on training throughput; Trainium2 (2024) closed much of the gap on specific workloads. Anthropic announced a major Trainium2 capacity commitment in 2024 — the second frontier lab (after Google's TPU/Anthropic partnership) to commit to non-NVIDIA training silicon at material scale.
The strategic positioning:
- AWS optimizes Inferentia for AWS internal workloads (high volume, fixed model shapes, latency-sensitive).
- Trainium targets external customer training workloads where the price/perf advantage is large enough to overcome the software-stack learning curve. Anthropic's commitment is the proof point.
- The Neuron SDK + AWS PyTorch integration is the software-stack effort. It's behind CUDA but ahead of every non-Google alternative.
Concentration impact on NVDA:
- AWS internal inference spend is the most NVDA-vulnerable line. Inferentia is already absorbing it.
- AWS internal training spend is partially NVDA-vulnerable depending on Trainium2 ramp and Trainium3 timing.
- AWS Cloud customer NVIDIA spend (EC2 P-series and G-series instances) is the part that grows with AI demand and is structurally protected — external customers pick NVIDIA because they've already built on CUDA.
The net effect on NVDA revenue from AWS over 2026-2028: probably flat-to-down on internal share, up on customer-cloud demand. The aggregate is workload-mix-dependent and only partially in AWS's control.
Meta MTIA — the catch-up program
Meta launched MTIA (Meta Training and Inference Accelerator) v1 in 2023, originally inference-focused. MTIA v2 (2024-2025) targeted larger-scale inference for Meta's internal ad-ranking and content-recommendation models — workloads that consume enormous compute at known model shapes, exactly the inference sweet spot.
Meta has explicitly said in earnings calls that MTIA will not (initially) replace NVIDIA for frontier-model training. The Llama series — Meta's frontier LLM family — has trained on NVIDIA H100s and B200s through 2025. MTIA's mandate is the recommendation and ad-stack workload, which is the larger share of Meta's compute spend but lower-profile externally.
The competitive threat to NVDA from MTIA:
- Near-term (2025-2026): material; Meta is plausibly absorbing 20-30% of its internal inference compute on MTIA, which directly displaces NVIDIA spend at that workload.
- Medium-term (2027-2028): unknown; depends on whether MTIA v3 reaches training parity, which Meta has not publicly committed to. The Llama 4 / Llama 5 training substrate is the canonical signal.
- Long-term (2029+): if Meta follows the Google path, MTIA becomes a structural reduction in NVDA's TAM at Meta. Whether they do depends on Meta-internal politics and capex allocation that's hard to forecast.
The hyperscaler concentration analysis frames Meta as the second-most-likely hyperscaler (after Google) to reach "majority internal on custom silicon" given the size and economic profile of their workload.
Microsoft Maia — the latest entrant
Microsoft announced Maia 100 in 2023 and shipped it in 2024-2025. The architecture is purpose-built for transformer inference at large scale, with explicit positioning around Azure's OpenAI-partner workload (ChatGPT and downstream model serving).
Maia is the youngest of the four programs and the least mature in software stack. Microsoft has not (as of mid-2026) publicly disclosed Maia capacity as a percentage of Azure AI compute. The directional read from industry sources: Maia is absorbing some Azure internal inference workload but Microsoft remains heavily NVIDIA-dependent for both training (the OpenAI partnership runs on NVIDIA Blackwell at scale) and the majority of customer inference.
Microsoft's position is structurally different from Meta's or Amazon's:
- The OpenAI partnership is a multi-year capacity commitment to specific frontier-model training and inference workloads, mostly on NVIDIA.
- Azure's customer AI workload is heavily NVIDIA-defaulted because Azure's selling proposition includes "you can train on the same substrate you've already built on" (NVIDIA + CUDA).
- Maia's path to material share is the Microsoft-internal workload only — Office Copilot inference, Bing AI, Microsoft 365 AI features. That's a substantial but not majority of Microsoft's AI compute spend.
Net read: Microsoft is the slowest-moving of the four hyperscalers on custom silicon, which makes Microsoft structurally NVDA-bullish for the medium term. The bear scenario for NVDA at Microsoft requires Maia v2 or v3 to ramp meaningfully in 2027-2028, and Microsoft has not publicly signaled that timeline.
Where this leaves the NVDA thesis
The four hyperscaler custom-silicon programs, sorted by competitive threat to NVDA:
| Rank | Program | Status | NVDA impact 2026-2028 | |------|---------|--------|----------------------| | 1 | Google TPU | Mature, training + inference | Material; Google internal NVDA spend already structurally reduced | | 2 | AWS Trainium/Inferentia | Inference mature, training ramping | Material on inference, partial on training (Trainium2 ramp) | | 3 | Meta MTIA | Inference scaling, training TBD | Inference share active; training neutral through 2027 | | 4 | Microsoft Maia | Inference, early stage | Limited; OpenAI partnership keeps NVDA spend high |
Oracle has no custom-silicon program and remains structurally long NVDA across both training and inference.
A clean way to read the landscape: NVDA's TAM at the top-4 hyperscalers is bifurcating. The training share — protected by the CUDA moat — stays largely NVDA. The inference share is leakier and is actively being absorbed by custom silicon, especially at Google and AWS. Meta is mid-cycle. Microsoft is late.
The trade-relevant version. The custom-silicon bear case on NVDA is real but narrower than the headline suggests. It targets the inference slice of hyperscaler spend (~70% of dollar volume but lower margin per unit of compute). The training slice (smaller dollar volume, higher margin per unit) is structurally protected by CUDA. Net effect: NVDA gross margins may compress 200-500 bps over 2027-2029 as inference shifts to custom silicon, but the training revenue base holds. Bears expecting a step-function decline are mispriced; bulls expecting current concentration to hold indefinitely are also mispriced. The right read is "gradual margin erosion, training revenue intact, multiple compresses 10-20%."
Three signals that would force a thesis revision
1. A second hyperscaler ships a frontier-model training paper on custom silicon. Anthropic on TPU was the first (2024). If a second lab publishes equivalent on Trainium, MTIA, or Maia by 2027, the "CUDA defends training" thesis weakens. Watch for academic papers from Anthropic-on-Trainium2, Llama-X-on-MTIA, or any frontier lab on Microsoft Maia.
2. PyTorch backend telemetry shifts. Meta-published PyTorch backend share data (rarely disclosed) showing over 5% of PyTorch training compute on non-CUDA backends. Currently estimated under 3%.
3. NVDA's "compute" vs "networking" revenue mix. NVDA discloses the split each quarter. If the compute share starts compressing while networking grows (because customers keep buying Mellanox/InfiniBand fabric for their own ASIC clusters), that's the early signal that the customer-substrate is shifting away from NVDA GPUs while NVDA networking remains the default fabric.
Bottom line
Custom ASICs are real competitors to NVDA, but the competitive surface is inference-first for three of the four programs. Training — the most CUDA-dependent workload — remains structurally NVDA. TPU is the exception that proves the rule: Google has demonstrated the migration is possible given a decade of compounding software investment, and the implicit playbook is being followed at varying depths by AWS, Meta, and Microsoft.
The thesis read: long NVDA on training revenue and the CUDA moat; bearish on margin trajectory as inference shifts to custom silicon at the top-4 hyperscalers. Net: a multi-year multiple compression rather than a step-function decline. The supply-side constraint caps the ramp regardless. The customer concentration is the variable that determines where in that band the stock prints.
NVDA dashboard on QuantAbundancia — thesis panel with current marks.
The CUDA moat — the software ecosystem that defends training but leaks on inference.
NVIDIA's HBM bottleneck — the supply-side ceiling that gates the ramp regardless of customer composition.
Hyperscaler customer concentration — the demand-side concentration risk that custom ASICs are designed to exploit.
Related bubbles
Get the daily digest.
One email a day · alerts + bubble shifts + new research. Free during beta.
No spam. One email per day max. Telegram alerts coming with the paid tier.