We’ve just rolled out a powerful update for the ERNIE-4.5-21B-A3B and ERNIE-4.5-300B-A47B models that speeds up inference for long-context tasks. By integrating a new sparse attention technology, you can now process long documents and conversations much faster, with minimal impact on accuracy.
This update is currently available for the “-Paddle” versions of the ERNIE-4.5-21B-A3B and ERNIE-4.5-300B-A47B models when deployed with FastDeploy.
What You’ll Experience: A Major Performance Boost 🚀
With this update, you’ll see significant improvements in speed and efficiency. The gains are substantial across both updated models:
Performance Gains for ERNIE-4.5-21B-A3B
Metric | Before (Full Attention) | After (Sparse Attention) | Improvement |
---|---|---|---|
Queries Per Second (QPS) | 0.101 | 0.150 | +48% |
Decode Speed (token/s) | 13.32 | 18.12 | +36% |
Time to First Token (s) | 8.082 | 5.466 | -48% |
End-to-End Latency (s) | 61.400 | 42.157 | -46% |
Performance Gains for ERNIE-4.5-300B-A47B
Metric | Before (Full Attention) | After (Sparse Attention) | Improvement |
---|---|---|---|
Queries Per Second (QPS) | 0.066 | 0.081 | +23% |
Decode Speed (token/s) | 5.07 | 6.75 | +33% |
Time to First Token (s) | 13.812 | 10.584 | -30% |
End-to-End Latency (s) | 164.704 | 132.745 | -24% |
Performance evaluated on the longbook_sum_eng subset from InfiniteBench with a mean input length of ~113K tokens.
Introducing PLAS
This speed-up is powered by PLAS (Pluggable Lightweight Attention for Sparsity), a novel sparse attention mechanism.
Instead of the traditional attention method that compares every single token in a long text against every other token, PLAS works smarter. It divides the text into blocks and uses a small, learnable module to intelligently select only the most relevant blocks for its calculations.
The best part is its “pluggable” nature. We can add PLAS to a fully trained model without changing the original weights, ensuring that the model’s core knowledge remains intact.
Affect to Accuracy
We know that performance can’t come at the cost of accuracy. The PLAS method was specifically designed to be nearly lossless. Our evaluations on long-context benchmarks show that the difference in precision is negligible for both models.
Model | Benchmark | Full Attention | Sparse Attention (PLAS) |
---|---|---|---|
ERNIE-4.5-21B-A3B | LongBenchV2 | 31.48 | 31.45 |
Ruler (128K) | 25.48 | 25.05 | |
ERNIE-4.5-300B-A47B | LongBenchV2 | 41.02 | 41.05 |
Ruler (128K) | 58.18 | 57.85 |
Evaluation results show minimal precision changes, ensuring reliable model output.
How to Get Started
If you’re using a -Paddle version of the ERNIE-4.5-21B-A3B or ERNIE-4.5-300B-A47B models with FastDeploy, enabling sparse attention is simple.
Just set the environment variable and add the PLAS configuration to your launch command.
# Set the environment variable to enable the PLAS attention backend
export FD_ATTENTION_BACKEND="PLAS_ATTN"
# Launch the API server with your model (e.g., ERNIE-4.5-300B-A47B-Paddle) and PLAS configuration
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 \
--metrics-port 8181 \
--quantization wint4 \
--tensor-parallel-size 4 \
--engine-worker-queue-port 8182 \
--max-model-len 131072 \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--plas-attention-config '{"plas_encoder_top_k_left": 50, "plas_encoder_top_k_right": 60,"plas_decoder_top_k_left": 100, "plas_decoder_top_k_right": 120}'
Command example from the ERNIE-4.5-300B-A47B-Paddle model card.
For more technical details, you can refer to the official PLAS Attention documentation.