Energy efficiency of AI hardware: a systematic review of GPU, TPU, and NPU architectures in the LLM era


Gül F.

JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES, cilt.101, sa.1, ss.1, 2026 (SCI-Expanded, Scopus)

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 101 Sayı: 1
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1007/s44443-026-00900-6
  • Dergi Adı: JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES
  • Derginin Tarandığı İndeksler: Scopus, Technology Collection (ProQuest), Aerospace Database, Science Citation Index Expanded (SCI-EXPANDED), INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.1
  • Recep Tayyip Erdoğan Üniversitesi Adresli: Evet

Özet

Abstract The energy consumption of artificial intelligence (AI) systems has emerged as a critical sustainability challenge, yet no comprehensive systematic review compares graphics processing unit (GPU), tensor processing unit (TPU), and neural processing unit (NPU) energy efficiency in the large language model (LLM) era. This paper presents a PRISMA 2020-compliant systematic review synthesizing 62 unique studies (January 2019–March 2026) from IEEE Xplore, Scopus, and Web of Science, supplemented by a delineated grey-literature corpus of MLPerf benchmark reports and vendor technical reports. We propose a three-tier normalization framework – TFLOP/W (hardware-normalized), Tokens/J (workload-normalized), and gCO₂eq/1000 tokens (system-normalized) – to resolve the metric heterogeneity that currently precludes cross-platform synthesis. Under standardized LLM inference conditions (7B-parameter model, batch size 32, sequence length 512), Google TPU v5e achieves approximately 10.66 Tokens/J, estimated to exceed the NVIDIA H100 (6.00 Tokens/J) by approximately 78% under normalized conditions (± 15–25% uncertainty). GPU platforms demonstrate superior quantization flexibility, with native FP8 support yielding approximately 1.9 × efficiency gains over FP16 baselines. Cloud NPU platforms are constrained by software ecosystem immaturity, while edge NPU efficiency is systematically underestimated by SoC-level reporting. Software stack optimization contributes 15–30% efficiency variation independent of hardware selection. Five critical research gaps are identified: absence of standardized benchmarks, lack of NPU-isolated power measurement, insufficient LLM inference workload characterization, limited lifecycle embodied energy assessment, and underrepresentation of emerging inference paradigms.