Enhanced AI Performance with NVIDIA TensorRT 10.0's Weight-Stripped Engines
NVIDIA has unveiled TensorRT 10.0, a significant upgrade to its inference library, introducing weight-stripped engines designed to optimize AI application deployment. According to the NVIDIA Technical Blog, these new engines reduce engine shipment size by more than 95%, focusing solely on execution code.
What is a Weight-Stripped Engine?
Weight-stripped engines, introduced in TensorRT 10.0, contain only the execution code (CUDA kernels) without weights, making them significantly smaller than traditional engines. By stripping weights during the build phase, these engines retain only essential weights for performance optimization, supporting ONNX models and other network definitions. This allows weight changes without the need to rebuild the engine, facilitating quick deserialization and maintaining high inference performance.
Benefits of Weight-Stripping
Traditional TensorRT engines included all network weights, leading to redundant weights across various hardware-specific engines. This often resulted in large application binaries. Weight-stripped engines address this issue by minimizing weight duplication, achieving over 95% compression for Convolutional Neural Networks (CNN) and Large Language Models (LLM). This enables more AI functionality to be packed into applications without increasing their size. Furthermore, these engines are compatible with TensorRT minor updates and can run using a lean runtime of approximately 40 MB.
Building and Deploying Weight-Stripped Engines
Building a weight-stripped engine involves using real weights for optimization decisions, ensuring consistent performance when refitted later. TensorRT optimizes computations by folding static nodes and introducing fusion optimizations. The TensorRT Cloud, available in early access for select partners, also facilitates the creation of weight-stripped engines from ONNX models.
Deploying these engines is straightforward. Apps can refit weight-stripped engines with weights from the ONNX file on the end-user device within seconds. After serialization, refitted engines maintain the quick deserialization efficiency TensorRT is known for, without recurring refit costs. The lean runtime of TensorRT 10.0 (~40 MB) supports this process, ensuring compatibility with next-generation GPUs without requiring app updates.
Case Study and Performance Metrics
A case study on an NVIDIA GeForce RTX 4090 GPU demonstrated more than 99% compression with SDXL. The table below highlights the compression comparison:
SDXL fp16 | Full engine size (MB) | Weight-stripped engine size (MB) |
clip | 237.51 | 4.37 |
clip2 | 1329.40 | 8.28 |
unet | 6493.25 | 58.19 |
The support for weight-stripped TensorRT-LLM engines is coming soon, with internal builds already showing significant compression on various LLMs.
Limitations and Future Developments
Currently, weight-stripped functionality in TensorRT 10.0 is limited to refitting with identical build-time weights to ensure maximum performance. Users cannot make layer-level decisions on which weights to strip, a limitation that may be addressed in future releases. Support for weight-stripped engines in TensorRT-LLM will also be available soon.
Integration with ONNX Runtime
TensorRT 10.0's weight-stripped functionality has been integrated into ONNX Runtime (ORT), starting from ORT 1.18.1. This integration allows TensorRT to offer the same functionality through ORT APIs, reducing shipment sizes when catering to diverse customer hardware. The ORT integration uses the EP context node-based logic to embed serialized TensorRT engines within an ONNX model, bypassing the need for builder resources and significantly reducing setup time.
Conclusion
TensorRT 10.0's weight-stripped engines enable extensive AI functionality in applications without increasing their size, leveraging TensorRT's peak performance on NVIDIA GPUs. On-device refitting allows for continuous updates with improved weights without rebuilding engines, paving the way for the future of generative AI models.