ERNIE 4.5 Gets a Major Inference Speed Boost

We’ve just rolled out a powerful update for the ERNIE-4.5-21B-A3B and ERNIE-4.5-300B-A47B models that speeds up inference for long-context tasks. By integrating a new sparse attention technology, you can now process long documents and conversations much faster, with minimal impact on accuracy.

This update is currently available for the “-Paddle” versions of the ERNIE-4.5-21B-A3B and ERNIE-4.5-300B-A47B models when deployed with FastDeploy.

What You’ll Experience: A Major Performance Boost 🚀

With this update, you’ll see significant improvements in speed and efficiency. The gains are substantial across both updated models:

Performance Gains for ERNIE-4.5-21B-A3B

Metric	Before (Full Attention)	After (Sparse Attention)	Improvement
Queries Per Second (QPS)	0.101	0.150	+48%
Decode Speed (token/s)	13.32	18.12	+36%
Time to First Token (s)	8.082	5.466	-48%
End-to-End Latency (s)	61.400	42.157	-46%

Performance Gains for ERNIE-4.5-300B-A47B

Metric	Before (Full Attention)	After (Sparse Attention)	Improvement
Queries Per Second (QPS)	0.066	0.081	+23%
Decode Speed (token/s)	5.07	6.75	+33%
Time to First Token (s)	13.812	10.584	-30%
End-to-End Latency (s)	164.704	132.745	-24%

Performance evaluated on the longbook_sum_eng subset from InfiniteBench with a mean input length of ~113K tokens.

Introducing PLAS

This speed-up is powered by PLAS (Pluggable Lightweight Attention for Sparsity), a novel sparse attention mechanism.

Instead of the traditional attention method that compares every single token in a long text against every other token, PLAS works smarter. It divides the text into blocks and uses a small, learnable module to intelligently select only the most relevant blocks for its calculations.

The best part is its “pluggable” nature. We can add PLAS to a fully trained model without changing the original weights, ensuring that the model’s core knowledge remains intact.

Affect to Accuracy

We know that performance can’t come at the cost of accuracy. The PLAS method was specifically designed to be nearly lossless. Our evaluations on long-context benchmarks show that the difference in precision is negligible for both models.

Model	Benchmark	Full Attention	Sparse Attention (PLAS)
ERNIE-4.5-21B-A3B	LongBenchV2	31.48	31.45
	Ruler (128K)	25.48	25.05
ERNIE-4.5-300B-A47B	LongBenchV2	41.02	41.05
	Ruler (128K)	58.18	57.85

Evaluation results show minimal precision changes, ensuring reliable model output.

How to Get Started

If you’re using a -Paddle version of the ERNIE-4.5-21B-A3B or ERNIE-4.5-300B-A47B models with FastDeploy, enabling sparse attention is simple.

Just set the environment variable and add the PLAS configuration to your launch command.

# Set the environment variable to enable the PLAS attention backend
export FD_ATTENTION_BACKEND="PLAS_ATTN"

# Launch the API server with your model (e.g., ERNIE-4.5-300B-A47B-Paddle) and PLAS configuration
python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-300B-A47B-Paddle  \
    --port 8180 \
    --metrics-port 8181 \
    --quantization wint4 \
    --tensor-parallel-size 4 \
    --engine-worker-queue-port 8182 \
    --max-model-len 131072 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --enable-chunked-prefill \
    --plas-attention-config '{"plas_encoder_top_k_left": 50, "plas_encoder_top_k_right": 60,"plas_decoder_top_k_left": 100, "plas_decoder_top_k_right": 120}'

Command example from the ERNIE-4.5-300B-A47B-Paddle model card.

For more technical details, you can refer to the official PLAS Attention documentation.

What You’ll Experience: A Major Performance Boost 🚀#

Performance Gains for ERNIE-4.5-21B-A3B#

Performance Gains for ERNIE-4.5-300B-A47B#

Introducing PLAS#

Affect to Accuracy#

How to Get Started#