cat-serve

Content-Aware Caching and Scheduling for Transformer Model Serving

March 31st 2025 @ Rotterdam, The Netherlands

Held in conjunction with ASPLOS 2025

About

The increasing deployment of transformer models in production environments has revealed fundamental challenges in traditional caching and scheduling approaches that treat all requests equally. This workshop focuses on the emerging paradigm of content-aware system design for transformer serving, where caching decisions and scheduling policies are guided by the semantic patterns and computational characteristics of incoming requests adaptively. As transformer applications span from simple queries to complex chain-of-thought reasoning and RAG-enhanced processing, the variance in resource requirements and execution patterns demands more sophisticated serving strategies.

CAT-Serve brings together researchers and practitioners to explore novel approaches that leverage request content understanding for system optimization.

Agenda

Time	Topic	Speaker	Institution
8:30-9:00	Coffee and Registration
9:00-9:10	Welcome speech	Prof. Freddy Gabbay	HUJI, Israel
9:10-9:30	Keynote Optimizing Transformer Model Serving: From Training to Content-Aware Inference	Gil Bloch, Principal architect	Nvidia
9:30-10:00	Context fast fusion for ehanced LLM Scheduling	Dr. Joseph Kampeas	Huawei Tel-Aviv Research Center, Israel
10:00-10:30	Boosting Transformer Efficiency with Decomposition and Adaptive Scheduling	Dr. Ori Schweitzer	Technion - Israel Institute of Technology, Israel
10:30-11:00	Coffee Break
11:00-11:30	CacheTrie: Optimizing Reuse of KV Cache via Path-compressed Trie	Chou Weizhong	Huawei Computing Product Line in cooperation with University of Science and Technology of China, China
11:30-12:00	SSSD: Simply-Scalable Speculative Decoding for Large Batch LLM Inference	Michele Marzollo	Huawei Zurich Research Center
12:00-12:30	Energy-aware and Adaptive Model Serving for DL Inference Stream” (invited talk)	Prof. Demetris Trihinas	University of Nicosia, Cyprus
12:30-13:00	Planning Rust Features for System Programming Requirements (invited talk)	Prof. Yijun Yu	The Open University, UK and a Director of Ada Language Engineering Lab at Huawei Ireland Research Center
12:30-14:00	Lunch Break

Important Dates

Paper submission: February 9, 2025 ~~February 2, 2025~~
Author Notification: Febuary 21, 2025
Camera-ready version: March 23, 2025
Workshop: March 31, 2025

All deadlines are AOE 11:59pm.

Call for Papers

CAT-Serve emphasizes solutions that bridge hardware capabilities, system software, and ML serving frameworks to enable intelligent resource management.

Topics include, but not limited to:

Semantic-aware cache replacement policies
Workload-adaptive scheduling algorithms
Content-based resource prediction
Cross-request optimization opportunities

Aiming to find novel ways to:

Improve serving efficiency
Reduce tail latency
Enable better resource utilization across power, computing, and memory domains
All spanning diverse transformer workloads in single and multi-tenant environments

Submission Guidelines

CAT-Serve welcomes submissions of short papers, up to 3 pages excluding references, using a double-column format. You can use this Latex template.
Submissions should state the research problem, motivation, and technical contribution. All submissions must be in English. The submissions should be sent in a single PDF file.
Papers can present work in progress, exploratory/preliminary research or already published work.
Submissions will be assessed based on their novelty, technical quality, potential impact, interest, clarity, relevance, and reproducibility.
Reviews will not be blind, so please submit without anonymizing authors in the submitted PDF.
There will be no formal proceedings, allowing authors the flexibility to extend and publish their work in other conferences and journals.
For each accepted paper, at least one author must attend the workshop and present the paper.

Paper submission system:

Submit your paper here
Please note you have to be a CMT registered user to submit.
Register to CMT here

Workshop Committee

Prof. Freddy Gabbay, Hebrew University

Prof. Avi Mendelson, Technion

Dr. Xiuqiao Li, Huawei

Igor Gov, Huawei

Avigail Oron, Huawei

Contact Us

Any questions may be directed to: cat-serve2025@googlegroups.com

The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.