cat-serve

Content-Aware Caching and Scheduling for Transformer Model Serving

March 31st 2025 @ Rotterdam, The Netherlands

Held in conjunction with ASPLOS 2025

About

The increasing deployment of transformer models in production environments has revealed fundamental challenges in traditional caching and scheduling approaches that treat all requests equally. This workshop focuses on the emerging paradigm of content-aware system design for transformer serving, where caching decisions and scheduling policies are guided by the semantic patterns and computational characteristics of incoming requests adaptively. As transformer applications span from simple queries to complex chain-of-thought reasoning and RAG-enhanced processing, the variance in resource requirements and execution patterns demands more sophisticated serving strategies.

CAT-Serve brings together researchers and practitioners to explore novel approaches that leverage request content understanding for system optimization.

Agenda

Time                 Topic Speaker              Institution
8:30-9:00 Coffee and Registration    
9:00-9:10 Welcome speech Prof. Freddy Gabbay HUJI, Israel
9:10-9:30 Keynote Optimizing Transformer Model Serving: From Training to Content-Aware Inference Gil Bloch, Principal architect Nvidia
9:30-10:00 Context fast fusion for ehanced LLM Scheduling Dr. Joseph Kampeas Huawei Tel-Aviv Research Center, Israel
10:00-10:30 Boosting Transformer Efficiency with Decomposition and Adaptive Scheduling Dr. Ori Schweitzer Technion - Israel Institute of Technology, Israel
10:30-11:00 Coffee Break    
11:00-11:30 CacheTrie: Optimizing Reuse of KV Cache via Path-compressed Trie Chou Weizhong Huawei Computing Product Line in cooperation with University of Science and Technology of China, China
11:30-12:00 SSSD: Simply-Scalable Speculative Decoding for Large Batch LLM Inference Michele Marzollo Huawei Zurich Research Center
12:00-12:30 Energy-aware and Adaptive Model Serving for DL Inference Stream” (invited talk) Prof. Demetris Trihinas University of Nicosia, Cyprus
12:30-13:00 Planning Rust Features for System Programming Requirements (invited talk) Prof. Yijun Yu The Open University, UK and a Director of Ada Language Engineering Lab at Huawei Ireland Research Center
12:30-14:00 Lunch Break    

Important Dates

All deadlines are AOE 11:59pm.

Call for Papers

CAT-Serve emphasizes solutions that bridge hardware capabilities, system software, and ML serving frameworks to enable intelligent resource management.

Topics include, but not limited to:

Aiming to find novel ways to:

Submission Guidelines

Paper submission system:

Submit your paper here
Please note you have to be a CMT registered user to submit.
Register to CMT here

Workshop Committee

Prof. Freddy Gabbay, Hebrew University

Prof. Avi Mendelson, Technion

Dr. Xiuqiao Li, Huawei

Igor Gov, Huawei

Avigail Oron, Huawei

Contact Us

Any questions may be directed to: cat-serve2025@googlegroups.com


The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.