MQA Adapt - Adapting MQA for efficient Inference

MQAdapt explores whether Multi-Query Attention (MQA) can be adaptively applied at inference time to reduce compute and memory usage in transformer models, with minimal performance drop. Motivated by the success of inference-time optimizations like GQA and PTQ, we investigate three research questions: RQ1: Does MQA reduce inference latency? RQ2: Are some transformer layers more sensitive to MQA than others? RQ3: How can we best select layers for MQA under a given budget?

Our experiments show a linear improvement in compute and memory efficiency with increasing MQA layers. However, layer sensitivity varies significantly — lower layers are more critical to overall performance, while alternating layers helps preserve accuracy. Greedy strategies that apply MQA to consecutive layers lead to steep accuracy drops, while alternating layer sampling stabilizes performance, enabling up to 5 layers to use MQA with negligible loss. MQAdapt offers a practical path to inference-time efficiency gains without retraining, especially promising for long-context generation tasks.

View Full Report (PDF)

View Slides (PPT)