It sounds like a smart term. In simple words, it is about making models faster and cheaper. It involves techniques to reduce computational costs, latency, and memory usage while maintaining or improving model accuracy.
If a chatbot generates a response of 300 tokens, with each token taking 10 milliseconds to produce, the user experience will be significantly better than if each token took 100 milliseconds to generate.
At this moment, I do not know much about inference optimization techniques. Perplexity says there are such as: pruning, quantization, knowledge distillation, weight sharing, low-rank factorization, early exit mechanisms, deployment strategy, caching and memoization, parallelism and batching. And probably more.
Sources:
AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4
Perplexity