2025/toc.html at main · acmsocc/2025 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
<html xmlns:bkstg="http://www.atypon.com/backstage-ns" xmlns:urlutil="java:com.atypon.literatum.customization.UrlUtil" xmlns:pxje="java:com.atypon.frontend.services.impl.PassportXslJavaExtentions"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="Content-Style-Type" content="text/css"><style type="text/css">
            #DLtoc {
            font: normal 12px/1.5em Arial, Helvetica, sans-serif;
            }

            #DLheader {
            }
            #DLheader h1 {
            font-size:16px;
            }

            #DLcontent {
            font-size:12px;
            }
            #DLcontent h2 {
            font-size:14px;
            margin-bottom:5px;
            }
            #DLcontent h3 {
            font-size:12px;
            padding-left:20px;
            margin-bottom:0px;
            }

            #DLcontent ul{
            margin-top:0px;
            margin-bottom:0px;
            }

            .DLauthors li{
            display: inline;
            list-style-type: none;
            padding-right: 5px;
            }

            .DLauthors li:after{
            content:",";
            }
            .DLauthors li.nameList.Last:after{
            content:"";
            }

            .DLabstract {
            padding-left:40px;
            padding-right:20px;
            display:block;
            }

            .DLformats li{
            display: inline;
            list-style-type: none;
            padding-right: 5px;
            }

            .DLformats li:after{
            content:",";
            }
            .DLformats li.formatList.Last:after{
            content:"";
            }

            .DLlogo {
            vertical-align:middle;
            padding-right:5px;
            border:none;
            }

            .DLcitLink {
            margin-left:20px;
            }

            .DLtitleLink {
            margin-left:20px;
            }

            .DLotherLink {
            margin-left:0px;
            }

        </style><title>SoCC '24: Proceedings of the ACM Symposium on Cloud Computing</title></head><body><div id="DLtoc"><div id="DLheader"><h1>SoCC '24: Proceedings of the ACM Symposium on Cloud Computing</h1><a class="DLcitLink" title="Go to the ACM Digital Library for additional information about this proceeding" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/proceedings/10.1145/3698038"><img class="DLlogo" alt="Digital Library logo" height="30" src="https://dl.acm.org/specs/products/acm/releasedAssets/images/footer-logo1.png">
                Full Citation in the ACM Digital Library
            </a></div><div id="DLcontent"><h2>SESSION: Systems Supporting Machine Learning I: Scheduling</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698515">Hops: Fine-grained heterogeneous sensing, efficient and fair Deep Learning cluster scheduling system</a></h3><ul class="DLauthors"><li class="nameList">Qinghe Wang</li><li class="nameList">Futian Wang</li><li class="nameList Last">Xinwei Zheng</li></ul><div class="DLabstract"><div style="display:inline">
		<p>In recent years, the number of clusters and cloud platforms dedicated to deep learning acceleration has increased, and research on multi-tenant deep learning (DL) cluster scheduling systems has also advanced quickly. However, we have observed several shortcomings in these systems. Firstly, resources exhibit heterogeneity, but even the most advanced heterogeneity-aware schedulers can only reach the GPU-type level. In addition, most scheduling systems cannot perform well in balancing efficiency and fairness, which leads to unfair resource allocation and reduced user satisfaction. Moreover, we have noticed the phenomenon of cluster fragmentation and job starvation.</p> <p>In this paper, we propose a new scheduling architecture: Hops, which includes (1) fine-grained heterogeneity awareness and accurate throughput estimators, which allows for heterogeneity awareness at the server entity level. (2) Hops performs resource allocation by executing prior weighted integer linear programming (ILP) for specific placement locations, effectively balancing fairness and efficiency. (3) Hops introduces "latency ratio fairness" (LRF) as a user fairness criterion, which helps reduce starvation and enhance user experience. (4) To address cluster fragmentation, Hops intentionally uses low-sensitivity jobs to fill fragments. The final experimental results show that, in physical experiments, compared with the state-of-the-art scheduling architectures: Sia [17] and Gavel [32], Hops reduces cluster completion time by 18.5% to 34.2%, shortens average job completion time (JCT) by 27.4% to 45.9%, lowers waiting latency by 35.4% to 54.9%, significantly reduces cluster fragmentation, and performs significantly better in fairness metrics compared to Sia and Gavel. In the 512-GPU simulation experiments, Hops not only improves system efficiency but also reduces the maximum job latency ratio by over 21× and decreases cluster fragmentation to less than 1 GPU per round on average.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698523">Queue Management for SLO-Oriented Large Language Model Serving</a></h3><ul class="DLauthors"><li class="nameList">Archit Patke</li><li class="nameList">Dhemath Reddy</li><li class="nameList">Saurabh Jha</li><li class="nameList">Haoran Qiu</li><li class="nameList">Christian Pinto</li><li class="nameList">Chandra Narayanaswami</li><li class="nameList">Zbigniew Kalbarczyk</li><li class="nameList Last">Ravishankar Iyer</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698532">Kale: Elastic GPU Scheduling for Online DL Model Training</a></h3><ul class="DLauthors"><li class="nameList">Ziyang Liu</li><li class="nameList">Renyu Yang</li><li class="nameList">Jin Ouyang</li><li class="nameList">Weihan Jiang</li><li class="nameList">Tianyu Ye</li><li class="nameList">Menghao Zhang</li><li class="nameList">Sui Huang</li><li class="nameList">Jiaming Huang</li><li class="nameList">Chengru Song</li><li class="nameList">Di Zhang</li><li class="nameList">Tianyu Wo</li><li class="nameList Last">Chunming Hu</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Large-scale GPU clusters have been widely used for effectively training both online and offline deep learning (DL) jobs. However, elastic scheduling in most cases of resource schedulers is dedicated for offline model training where resource adjustment is planned ahead of time. The native autoscaling policy is on the basis of pre-defined threshold and, if applied directly in online model training, often suffers from belated resource adjustment, leading to diminished model accuracy. In this paper, we present Kale, a novel elastic GPU scheduling system to improve the performance of online DL model training. Through traffic forecasting and resource-throughput modeling, Kale automatically pinpoints the number of required GPUs that best accommodate the on-the-fly data samples before performing stabilized autoscaling. An advanced data shuffling strategy is further employed for balancing uneven samples among different training workers, thereby improving the runtime efficacy. Experiments show that Kale substantially outperforms the state-of-the-art solutions. Compared with the default HPA autoscaling strategy, Kale reduces the accumulated lag and downtime by 69.2% and 33.1%, respectively, whilst lowering the SLO violation rate from 19.57% to just 2.6%. Kale has been deployed at Kuaishou's production-level GPU clusters and successfully underpins real-time video recommendation and advertisement at scale.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698559">FedCaSe: Enhancing Federated Learning with Heterogeneity-aware Caching and Scheduling</a></h3><ul class="DLauthors"><li class="nameList">Redwan Ibne Seraj Khan</li><li class="nameList">Arnab K. Paul</li><li class="nameList">Yue Cheng</li><li class="nameList">Xun Steve Jian</li><li class="nameList Last">Ali R. Butt</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Federated learning (FL) has emerged as a new paradigm of machine learning (ML) with the goal of collaborative learning on the vast pool of private data available across distributed edge devices. The focus of most existing works in FL systems has been on addressing the challenges of computation and communication heterogeneity inherent in training with edge devices. However, the crucial impact of I/O and the role of limited on-device storage has not been explored fully in FL context. Without policies to exploit the on-device storage for placement of client data samples, and schedule clients based on I/O benefits, FL training can lead to inefficiencies, such as increased training time and impacted accuracy convergence.</p> <p>In this paper, we propose FedCaSe, a framework for efficiently caching client samples in-situ on limited on-device storage and scheduling client participation. FedCaSe boosts the I/O performance by exploiting a unique characteristic--- the experience, i.e., relative impact on overall performance, of data samples and clients. FedCaSe utilizes this information in adaptive caching policies for sample placement inside the limited memory of edge clients. The framework also exploits the experience information to orchestrate the future selection of clients. Our experiments with representative workloads and policies show that compared to the state of the art, FedCaSe improves the training time by 2.06× for accuracy convergence at the scale of thousands of clients.</p>
	</div></div>

					<h2>SESSION: Machine Learning Supporting Systems</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698569">SQLStateGuard: Statement-Level SQL Injection Defense Based on Learning-Driven Middleware</a></h3><ul class="DLauthors"><li class="nameList">Xin Liu</li><li class="nameList">Yuanyuan Huang</li><li class="nameList">Tianyi Wang</li><li class="nameList">Song Li</li><li class="nameList">Weina Niu</li><li class="nameList">Jun Shen</li><li class="nameList">Qingguo Zhou</li><li class="nameList Last">Xiaokang Zhou</li></ul><div class="DLabstract"><div style="display:inline">
		<p>SQL injection is a significant and persistent threat to web services. Most existing protections against SQL injections rely on traffic-level anomaly detection, which often results in high false-positive rates and can be easily bypassed by attackers. This paper introduces SQLStateGuard, the world's first middleware-driven statement-level SQL injection defense approach, to address these issues. The SQLStateGuard uses a custom SQL middleware based on the idea of Runtime Application Self-Protection to capture raw SQL statements. These statements are then analyzed by SQLSG-Net, a database-oriented detection network based on gated linear units. If SQLSG-Net detects malicious SQL statements, the SQL middleware will block them. Experiments show that the detection accuracy of SQLStateGuard exceeds 99%, outperforming existing approaches, and it can identify the type of a specific SQL injection. Additionally, SQLStateGuard has no fingerprint and does not respond to SQL syntax errors, making it more challenging for attackers to gather information. This paper also presents a novel dataset generation process for SQLStateGuard and shares two statement-level SQL injection datasets with the research community, including over 145,000 malicious SQL statements categorized by the type of SQL injection.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698519">Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS</a></h3><ul class="DLauthors"><li class="nameList">Vikramank Singh</li><li class="nameList">Zhao Song</li><li class="nameList">Balakrishnan Murali Narayanaswamy</li><li class="nameList">Kapil Eknath Vaidya</li><li class="nameList Last">Tim Kraska</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Database performance troubleshooting is a complex multi-step process that broadly involves three key stages- (a) Detection: determining what's wrong and when; (b) Root Cause Analysis (RCA): reasoning about why is the performance poor; (c) Resolution: identifying a fix. A plethora of techniques exist to address each of these problems, but they hardly work in real-world at scale. First, real-world customer workloads are noisy, non-stationary and quasi-periodic in nature rendering traditional detectors ineffective. Second, real-world production databases execute a highly diverse set of queries that skew the database statistics into long-tail distributions causing traditional RCA methods to fail. Third, these databases typically execute millions of such diverse queries every minute rendering traditional methods inefficient when deployed at scale.</p> <p>In this paper we describe Vista, a machine learning based performance troubleshooting framework for databases, and dive-deep into how it addresses the 3 real-world problems outlined above. Vista deploys a deep auto-regressive model trained on a large and diverse Amazon Relational Database Service (RDS) fleet with custom skip connections and periodicity alignment features to model long range and varying periodicity in customer workloads, and detects performance bottlenecks in the form of outliers. Furthermore, it efficiently filters only a top few dominating SQL queries from millions in a problematic workload, and uses a robust causal inference framework to identify the culprit queries and their statistics leading to a low false-positive and false-negative rate. Currently, Vista runs on hundreds of thousands of RDS databases, analyzes millions of workloads every day bringing down the troubleshooting time for RDS customers from hours to seconds. At the end, we also describe several challenges and learnings from implementing and deploying Vista at Amazon scale.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698525">Building AI Agents for Autonomous Clouds: Challenges and Design Principles</a></h3><ul class="DLauthors"><li class="nameList">Manish Shetty</li><li class="nameList">Yinfang Chen</li><li class="nameList">Gagan Somashekar</li><li class="nameList">Minghua Ma</li><li class="nameList">Yogesh Simmhan</li><li class="nameList">Xuchao Zhang</li><li class="nameList">Jonathan Mace</li><li class="nameList">Dax Vandevoorde</li><li class="nameList">Pedro Las-Casas</li><li class="nameList">Shachee Mishra Gupta</li><li class="nameList">Suman Nath</li><li class="nameList">Chetan Bansal</li><li class="nameList Last">Saravan Rajmohan</li></ul><div class="DLabstract"><div style="display:inline">
		<p>The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of software development and deployment is revolutionizing the information technology landscape. While code generation receives significant attention, a higher-impact application lies in using agents for the operational resilience of cloud services, which currently require significant human effort and domain knowledge. There is a growing interest in AI for IT Operations (AIOps), which aims to automate complex operational tasks, like fault localization and root cause analysis, reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds through AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents. This vision paper lays the groundwork for such a framework by framing the requirements and then discussing design decisions that satisfy them. We also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. We report promising results and lay the groundwork to build a modular and robust framework for building, evaluating, and improving agents for autonomous clouds.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698533">Zero-SAD: Zero-Shot Learning Using Synthetic Abnormal Data for Abnormal Behavior Detection on Private Cloud</a></h3><ul class="DLauthors"><li class="nameList">Jae-Seok Kim</li><li class="nameList">Joonho Seo</li><li class="nameList">Seon-Jin Hwang</li><li class="nameList">Jinmyeong Shin</li><li class="nameList Last">Yoon-Ho Choi</li></ul><div class="DLabstract"><div style="display:inline">
		<p>While many studies have been conducted to detect abnormal behavior in cloud environments by analyzing system call sequences, these studies often cannot be applied to real-world cloud environments since they do not consider actual user behavior and rely on publicly available datasets. In actual cloud environments, the frequency of duplicate system calls is significantly higher than that observed in these datasets. This discrepancy necessitates a considerably larger scale of analysis to fully understand the sequential relationships among system calls. In this paper, we propose a practical abnormal behavior detection system for private cloud environments. The proposed system comprises of a deduplicated embedding process that efficiently represents duplicate system calls occurring within the cloud into a single embedding vector and a zero-shot abnormal behavior detection process that rapidly analyzes the large volume of system call sequences generated by numerous users through a zero-shot learning model. To demonstrate the practicality of our proposed system, we use both publicly available datasets and datasets directly collected from real cloud environments by implementing attacks from the MITRE ATT&amp;CK framework as a proof of concept (PoC). Experimental results show that our system achieved an accuracy an accuracy of 92.13%, and it can detect attacks 5.48 times faster than existing research.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698564">Forecasting Algorithms for Intelligent Resource Scaling: An Experimental Analysis</a></h3><ul class="DLauthors"><li class="nameList">Yanlei Diao</li><li class="nameList">Dominik Horn</li><li class="nameList">Andreas Kipf</li><li class="nameList">Oleksandr Shchur</li><li class="nameList">Ines Benito</li><li class="nameList">Wenjian Dong</li><li class="nameList">Davide Pagano</li><li class="nameList">Pascal Pfeil</li><li class="nameList">Vikram Nathan</li><li class="nameList">Balakrishnan Narayanaswamy</li><li class="nameList Last">Tim Kraska</li></ul><div class="DLabstract"><div style="display:inline">
		<p>There has been a growing demand for making modern cloud-based data analytics systems cost-effective and easy to use. AI-powered intelligent resource scaling is one such effort, aiming at automating scaling decisions for serverless offerings like Amazon Redshift Serverless. The foundation of intelligent resource scaling lies in the ability to forecast query workloads and their resource consumption accurately. Although the forecasting problem has been extensively studied across various domains, there is a lack of thorough analysis of existing forecasting algorithms for large-scale, real-world cloud query workloads. This paper fills this gap by providing an in-depth analysis of forecasting algorithms for real-world cloud workloads, covering the fundamental data characteristics that distinguish query workload forecasting from prior problems and evaluating the strengths and limitations of existing algorithms in this new domain. We anticipate that our findings will provide valuable insights in informing the design of an efficient and effective solution for production use, as well as in steering the forecasting community toward more effective algorithms of high real-world impact.</p>
	</div></div>

					<h2>SESSION: Speed and Scalein Serverless</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698513">Snapipeline: Accelerating Snapshot Startup for FaaS Containers</a></h3><ul class="DLauthors"><li class="nameList">Yuqiao Lan</li><li class="nameList">Xiaohui Peng</li><li class="nameList Last">Yifan Wang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Due to the frequent starts and stops of numerous services in FaaS (Function as a Service), reducing cold start overhead is a core issue in improving the performance of container-based FaaS services. Snapshot and restore-based mechanisms effectively reduce the cold start time of containers by transforming container initialization overhead into restoration overhead. Consequently, this mechanism has become a research hotspot in accelerating the cold start of FaaS containers. Researchers introduce snapshot compression and decompress the snapshots to reduce the storage cost before starting instances. However, existing works have the following shortcomings: (1) File-mapped memory pages are not processed during snapshot compression, resulting in a significant amount of redundant data in memory; (2) The serial execution of snapshot decompression and instance restoration leads to high instance startup latency.</p> <p>To address these shortcomings, we propose the Snapipeline mechanism, which implements the following optimizations: (1) In the snapshot compression phase, it extends deduplication to file-mapped memory; (2) In the restoration phase, it leverages the hot and cold distinction in FaaS application memory to prioritize the memory pages restoration, pipelining snapshot decompression, memory pages restoration, and function execution. This mechanism hides the expensive snapshot decompression latency behind the instance restoration and function execution latency and removes the snapshot decompression from the critical path of instance startup. Evaluation on real-world FaaS application datasets shows that Snapipeline reduces memory usage by up to 28% and decreases end-to-end latency by an average of 53%, compared to the baseline.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698529">En4S: Enabling SLOs in Serverless Storage Systems</a></h3><ul class="DLauthors"><li class="nameList">Minghao Xie</li><li class="nameList">Chen Qian</li><li class="nameList Last">Heiner Litz</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Serverless computing promises scalability and cost-efficiency by decomposing monolithic tasks into small, stateless, self-contained functions. As functions only reserve hardware resources during their lifetime, and serverless providers such as Amazon Lambda define strict data size limits [50], data required for the whole lifetime of a monolithic task needs to be kept in an external ephemeral data store. This approach increases costs and introduces performance variability, causing serverless applications to violate service level objectives (SLOs). Traditional cloud storage solutions, such as AWS S3 and Redis, fail to provide low-cost and the enforcement of SLOs, while prior works on disaggregated data stores do not scale sufficiently due to: (1) increased scheduling costs when supporting many SLOs; (2) performance degradation in the presence of burst allowances and worsened interference with lenient ones; and (3) failed service differentiation with increased number of SLO. These challenges make SLO enforcement in serverless environments difficult, leading to unpredictable performance and costs that undermine the benefits of serverless computing.</p> <p>We introduce En4S, a high-performance, flash-based storage system designed for data-intensive serverless applications. En4S employs a profile-based scheduling framework with adaptive strategies to efficiently scale to many tenants with different SLOs. Key features include dynamic tenant handling, adaptive burst control, token reclaim control, and various optimizations to minimize scheduling costs while maintaining superior performance. By re-enabling SLO enforcement for disaggregated flash storage in cloud-native environments, En4S is crucial for modern serverless applications. Our implementation on Amazon EC2 and Lambda demonstrates substantial performance and cost improvements while reliably ensuring SLO compliance, enhancing the viability of serverless storage systems.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698509">Pre-Warming is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-Loading</a></h3><ul class="DLauthors"><li class="nameList">Yifan Sui</li><li class="nameList">Hanfei Yu</li><li class="nameList">Yitao Hu</li><li class="nameList">Jianxun Li</li><li class="nameList Last">Hao Wang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Serverless computing has rapidly prospered as a new cloud computing paradigm with agile scalability, pay-as-you-go pricing, and ease-to-use features for Machine Learning (ML) inference tasks. Users package their ML code into lightweight serverless functions and execute them using containers. Unfortunately, a notorious problem, called cold-starts, hinders serverless computing from providing low-latency function executions. To mitigate cold-starts, pre-warming, which keeps containers warm predictively, has been widely accepted by academia and industry. However, pre-warming fails to eliminate the unique latency incurred by loading ML artifacts. We observed that for ML inference functions, the loading of libraries and models takes significantly more time than container warming. Consequently, pre-warming alone is not enough to mitigate the ML inference function's cold-starts.</p> <p>This paper introduces InstaInfer, an opportunistic preloading technique to achieve instant inference by eliminating the latency associated with loading ML artifacts, thereby achieving minimal time cost in function execution. InstaInfer fully utilizes the memory of warmed containers to preload the function's libraries and model, striking a balance between maximum acceleration and resource wastage. We design InstaInfer to be transparent to providers and compatible with existing pre-warming solutions. Experiments on OpenWhisk with real-world workloads show that InstaInfer reduces up to 93% loading latency and achieves up to 8× speedup compared to state-of-the-art pre-warming solutions.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698512">Faascale: Scaling MicroVM Vertically for Serverless Computing with Memory Elasticity</a></h3><ul class="DLauthors"><li class="nameList">Xinmin Zhang</li><li class="nameList">Qiang He</li><li class="nameList">Hao Fan</li><li class="nameList Last">Song Wu</li></ul><div class="DLabstract"><div style="display:inline">
		<p>This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. We implement Faascale on Amazon Firecracker to evaluate its gains for the serverless platform. The results of experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies based the state-of-the-art snapshots technique, Faascale reduces time for cold-starting MicroVMs by 89.01% and functions execution time by 23.93% on average.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698561">Rethinking the Networking Stack for Serverless Environments: A Sidecar Approach</a></h3><ul class="DLauthors"><li class="nameList">Vishwanath Seshagiri</li><li class="nameList">Abhinav Gupta</li><li class="nameList">Vahab Jabrayilov</li><li class="nameList">Avani Wildani</li><li class="nameList Last">Kostis Kaffes</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Serverless platforms rely on legacy networking stacks for communication and data movement. We quantitatively analyze the performance of these stacks and show their mismatch with highly consolidated, virtualized modern serverless environments, focusing on Firecracker, the most common serverless virtualization framework. As serverless applications grow in complexity and interaction, the resulting network bottleneck is a prime source of user-perceived, end-to-end latency. In this paper, we present a detailed vision of a new, sidecar-based networking stack for serverless environments. Our primary design goal is to provide low-overhead networking while maintaining existing security guarantees. We outline the research challenges in both the control and the data plane that the community needs to tackle before such a sidecar architecture can be used in practice.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698567">Process-as-a-Service: Unifying Elastic and Stateful Clouds with Serverless Processes</a></h3><ul class="DLauthors"><li class="nameList">Marcin Copik</li><li class="nameList">Alexandru Calotoiu</li><li class="nameList">Gyorgy Rethy</li><li class="nameList">Roman Böhringer</li><li class="nameList">Rodrigo Bruno</li><li class="nameList Last">Torsten Hoefler</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Fine-grained serverless functions power many new applications that benefit from elastic scaling and pay-as-you-use billing model with minimal infrastructure management overhead. To achieve these properties, Function-as-a-Service (FaaS) platforms disaggregate compute and state and, consequently, introduce non-trivial costs due to the loss of data locality when accessing state, complex control plane interactions, and expensive inter-function communication. We revisit the foundations of FaaS and propose a new cloud abstraction, the cloud process, that retains all the benefits of FaaS while significantly reducing the overheads that result from disaggregation. We show how established operating system abstractions can be adapted to provide powerful granular computing on dynamically provisioned cloud resources while building our Process as a Service (PraaS) platform. PraaS improves current FaaS by offering data locality, fast invocations, and efficient communication. PraaS delivers remote invocations up to 17× faster and reduces communication overhead by up to 99%.</p>
	</div></div>

					<h2>SESSION: The Elastic Cloud</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698530">AutoBurst: Autoscaling Burstable Instances for Cost-effective Latency SLOs</a></h3><ul class="DLauthors"><li class="nameList">Rubaba Hasan</li><li class="nameList">Timothy Zhu</li><li class="nameList Last">Bhuvan Urgaonkar</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Burstable instances provide a low-cost option for consumers using the public cloud, but they come with significant resource limitations. They can be viewed as "fractional instances" where one receives a fraction of the compute and memory capacity at a fraction of the cost of regular instances. The fractional compute is achieved via rate limiting, where a unique characteristic of the rate limiting is that it allows for the CPU to burst to 100% utilization for limited periods of time. Prior research has shown how this ability to burst can be used to serve specific roles such as a cache backup and handling flash crowds. Our work provides a general-purpose approach to meeting latency SLOs via this burst capability while optimizing for cost. AutoBurst is able to achieve this by controlling both the number of burstable and regular instances along with how/when they are used. Evaluations show that our system is able to reduce cost by up to 25% over the state-of-the-art while maintaining latency SLOs.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698527">Is It Time To Put Cold Starts In The Deep Freeze?</a></h3><ul class="DLauthors"><li class="nameList">Carlos Segarra</li><li class="nameList">Ivan Durev</li><li class="nameList Last">Peter Pietzuch</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Cold-start times have been the "end-all, be-all" metric for research in serverless cloud computing over the past decade. Reducing the impact of cold starts matters, because they can be the biggest contributor to a serverless function's end-to-end execution time. Recent studies from cloud providers, however, indicate that, in practice, a majority of serverless functions are triggered by non-interactive workloads. To substantiate this, we study the types of serverless functions used in 35 publications and find that over 80% of functions are not semantically latency sensitive. If a function is non-interactive and latency insensitive, is end-to-end execution time the right metric to optimize in serverless? What if cold starts do not matter that much, after all?</p> <p>In this vision paper, we explore what serverless environments in which cold starts do not matter would look like. We make the case that serverless research should focus on supporting latency insensitive, i.e., batch, workloads. Based on this, we explore the design space for DFaaS, a serverless framework with an execution model in which functions can be arbitrarily delayed. DFaaS users annotate each function with a delay tolerance and, as long as the deadline has not passed, the runtime may interrupt or migrate function execution. Our micro-benchmarks suggest that, by targeting batch workloads, DFaaS can improve substantially the resource usage of serverless clouds and lower costs for users.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698543">Towards Swap-Free, Continuous Ballooning for Fast, Cloud-Based Virtual Machine Migrations</a></h3><ul class="DLauthors"><li class="nameList">Kevin Alarcón Negy</li><li class="nameList">Tycho Nightingale</li><li class="nameList">Hakim Weatherspoon</li><li class="nameList Last">Zhiming Shen</li></ul><div class="DLabstract"><div style="display:inline">
		<p>We have a production need to reduce the time for customers to live migrate their application virtual machine (VM) in the cloud. A single customer of ours migrates their nested, cloud-based, user virtual machines tens of thousands of times a month.</p> <p>Ballooning is one technique for modifying the size of a virtual machine and has been used to speed up VM migration and increase VM consolidation. However, it has a significant risk: the ominous out-of-memory (OOM) error. The issue is that it is infeasible to use ballooning during high-risk scenarios, namely during giant memory spikes and during live migration, for fear of swapping or worse, OOM errors.</p> <p>We advance the state of the art by optimizing the Linux balloon driver for VM migration in a non-overcommitted context, resulting in being able to handle both high-risk scenarios without relying on swapping and without causing OOM errors. We add a user-space continuous ballooning program that, in tandem with our balloon driver modifications, can handle memory spikes of hundreds of gigabytes, as well as survive an indefinite number of migrations.</p> <p>In this paper, we discuss our minimal changes to Linux, describe our continuous ballooning program, and evaluate our now in-production, cloud solution on real-world applications. Our tests are designed to measure resilience in the face of several memory spikes and live migrations. In our tests, we add at most 8% overhead, yet can provide a migration speedup of at least 52% for giant VMs with memory intensive applications reaching almost 600 GB.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698545">PCLive: Pipelined Restoration of Application Containers for Reduced Service Downtime</a></h3><ul class="DLauthors"><li class="nameList">Shiv Bhushan Tripathi</li><li class="nameList Last">Debadatta Mishra</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Application containers are widely used in contemporary cloud computing environments. Migration of containers across hosts provides cost-effective cloud management by enabling improved server consolidation, load balancing and enhanced fault tolerance. One of the primary objectives of container migration is to reduce the service downtime of applications hosted in containers. The service downtime depends on performing the migration activities efficiently, specifically from the time the container is stopped on the source host till it is restored and fully functional at the destination host.</p> <p>In this paper, we show that, the state-of-the-art pre-copy migration strategy for containers using checkpoint and restore techniques (e.g., CRIU) inflates the downtime due to its inherent limitations in the restoration procedures, particularly for containers with large memory working set size. We propose PCLive to address this bottleneck using a pipelined restore mechanism. Compared to the baseline CRIU pre-copy migration, PCLive results in up to ~38.8x reduction in restoration time which leads to a reduction of service downtime by up to ~2.7x for migration of a container hosting the Redis key-value store over an one Gbps network. We also present comprehensive comparative analysis of the resource cost for the proposed solution along with additional optimizations to demonstrate that PCLive can reduce the application downtime in a resource efficient manner leveraging its flexible and efficient design choices.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698522">Scheduling for Reduced Tail Task Latencies in Highly Utilized Datacenters</a></h3><ul class="DLauthors"><li class="nameList">Smita Vijayakumar</li><li class="nameList">Anil Madhavapeddy</li><li class="nameList Last">Evangelia Kalyvianaki</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Modern datacenters run diverse workloads that increasingly comprise data-parallel computational jobs. There has been a steady rise in their demand leading to high-volume traffic. To meet these demands, datacenter providers operate their clusters at levels of high utilization. We show that under such conditions, existing schedulers impose large wait times on tail tasks, leading to long job completion time. We propose a new decentralized scheduler, Murmuration, that reduces the total wait time of tasks. It employs multiple communicating schedulers to schedule tasks of jobs such that their start times are as close together as possible, ensuring small tail task completion time and better average job completion time.</p> <p>Our evaluation of Murmuration using publicly available workloads on a real-world cluster shows 15% --- 25% faster job completion time than that of the default Kubernetes scheduler for different arrival characteristics. We show that Murmuration scales to incoming workloads by scheduling more than a million tasks in a matter of minutes. We further enhance the design of Murmuration by incorporate queue reordering techniques to order the scheduling and execution of jobs and tasks. Simulations evaluated on two industry workloads show that with queue re-ordering, Murmuration outperforms other schedulers with a 100× better median job completion time than that of current schedulers.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698566">Krios: Scheduling Abstractions and Mechanisms for Enabling a LEO Compute Cloud</a></h3><ul class="DLauthors"><li class="nameList">Vaibhav Bhosale</li><li class="nameList">Ada Gavrilovska</li><li class="nameList Last">Ketan Bhardwaj</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Low Earth Orbit (LEO) satellites are an important facet of global connectivity providing high speed Internet, cellular, IoT connectivity and so on. Combined with the rich resource availability on each satellite, LEO satellites represent a new, emerging cloud frontier - the LEO Compute Cloud. However, satellite mobility introduces non-trivial challenges when orchestrating applications for a LEO compute cloud, making it harder to deploy applications without increasing the latency and bandwidth costs. In this paper, we identify the concrete challenges in using state-of-the-art terrestrial orchestrators for a LEO compute cloud. We present Krios - a LEO compute cloud orchestration system that hides the complexities introduced by satellite mobility and enables a practical LEO compute cloud. The design of Krios is centered around a novel LEO zones abstraction that allows application providers to specify where their applications should be available. Krios provides crucial system support to enable the LEO zones abstraction, ensuring uninterrupted availability of applications in LEO zones via proactive and predictive application handovers. Our experimental evaluation of Krios with representative applications demonstrates a practical and efficient LEO compute cloud, without requiring any disruptive changes in applications and with modest system overheads. With Krios, LEO orchestration requires just ~1 application instance at a time to maintain the same availability as what prior work achieves by deploying application instances on all satellites or by performing 6-10 times more frequent expensive handovers.</p>
	</div></div>

					<h2>SESSION: When Things Go Wrong in the Cloud</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698568">Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud Systems</a></h3><ul class="DLauthors"><li class="nameList">P. C. Sruthi</li><li class="nameList">Zinan Guo</li><li class="nameList">Deming Chu</li><li class="nameList">Zhengyan Chen</li><li class="nameList Last">Yongle Zhang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Debugging in production cloud systems (or live debugging) is a critical yet challenging task for on-call developers due to the financial impact of cloud service downtime and the inherent complexity of cloud systems. Unfortunately, how debugging is performed, and the unique challenges faced in the production cloud environment have not been investigated in detail.</p> <p>In this paper, we perform the first fine-grained, observational study of 93 real-world debugging experiences of production cloud failures in 15 widely adopted open-source distributed systems including distributed storage systems, databases, computing frameworks, message passing systems, and container orchestration systems. We examine each debugging experience with a fine-grained lens and categorize over 1700 debugging steps across all incidents. Our study provides a detailed picture of how developers perform various diagnosis activities including failure reproduction, anomaly analysis, program analysis, hypothesis formulation, information collection and online experiments.</p> <p>Highlights of our study include: (1) Analyses of the taxonomies and distributions of both live debugging activities and the underlying reasons for hypothesis forking, which confirm the presence of expert debugging strategies in production cloud systems, and offer insights to guide the training of novice developers and the development of tools that emulate expert behavior. (2) The identification of the primary challenge in anomaly detection (or, observability) for end-to-end debugging: the collection of system-specific data (17.1% of data collected). In comparison, nearly all (96%) invariants utilized to detect anomalies are already present in existing monitoring tools. (3) The identification of the importance of online interventions (i.e., in-production experiments that alter system execution) for live debugging - they are performed as frequently as information collection - with an investigation of different types of interventions and challenges. (4) An examination of novel debugging techniques developers utilized to overcome debugging challenges inherent to or amplified in cloud systems, which offer insights for the development of enhanced debugging tools.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698534">Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure</a></h3><ul class="DLauthors"><li class="nameList">Chaoyun Zhang</li><li class="nameList">Randolph Yao</li><li class="nameList">Si Qin</li><li class="nameList">Ze Li</li><li class="nameList">Shekhar Agrawal</li><li class="nameList">Binit R. Mishra</li><li class="nameList">Tri Tran</li><li class="nameList">Minghua Ma</li><li class="nameList">Qingwei Lin</li><li class="nameList">Murali Chintalapati</li><li class="nameList Last">Dongmei Zhang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events. It employs double machine learning combined with causal forest to produce precise and reliable mitigation recommendations based solely on limited observational data collected from the historical unhealthy events. To enhance the causal inference model, Deoxys further incorporates a policy fallback mechanism based on model uncertainty and action overriding mechanisms to (i) improve the reliability of the system, and (ii) strike a good tradeoff between downtime reduction and resource utilization, thereby enhancing the overall system performance.</p> <p>After deploying Deoxys in a large-scale cloud infrastructure at Microsoft, our observations demonstrate that Deoxys significantly reduces average VM downtime by 53% compared to a legacy policy, while leading to 49.5% lower VM interruption rate. This substantial improvement enhances the reliability and stability of cloud platforms, resulting in a seamless customer experience.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698508">INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive Paths</a></h3><ul class="DLauthors"><li class="nameList">Ziwei Huang</li><li class="nameList">Mengyao Xie</li><li class="nameList">Shibo Tang</li><li class="nameList">Zihao Chang</li><li class="nameList">Zhicheng Yao</li><li class="nameList">Yungang Bao</li><li class="nameList Last">Sa Wang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Identifying and managing performance interference in clouds has long been a critical and challenging task for cloud providers. They keep seeking useful performance indicators from underlying systems to monitor cloud applications accurately. However, state-of-the-art indicators are either sensitive to limited applications and resource contention or are unrobust to the continually changing production environments. There still lacks a practical and efficient indicator for production environments.</p> <p>This paper proposes INS, a cloud runtime system that can effectively detect the performance fluctuation of online cloud applications and reallocate resources to curb performance interference. It proposes INSPath as the new performance indicator to describe the degree of performance degradation and pinpoint the resource bottlenecks. Our evaluation of nine widely-used applications demonstrates that INS can detect the SLO violations of applications and identify the resource bottleneck accurately. Meanwhile, INS outperforms state-of-the-art PARTIES with more responsive and effective resource tuning and fewer SLO violations.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698554">TailClipper: Reducing Tail Response Time of Distributed Services Through System-Wide Scheduling</a></h3><ul class="DLauthors"><li class="nameList">Nathan Ng</li><li class="nameList">Abel Souza</li><li class="nameList">Ahmed Ali-Eldin</li><li class="nameList">David Irwin</li><li class="nameList">Don Towsley</li><li class="nameList Last">Prashant Shenoy</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Reducing tail latency has become a crucial issue for optimizing the performance of online cloud services and distributed applications. In distributed applications, there are many causes of high end-to-end tail latency, including operating system delays, request re-ordering due to fan-out/fanin, and network congestion. Although recent research has focused on reducing tail latency for individual application components, such as by replicating requests and scheduling, in this paper, we argue for a holistic approach for reducing the end-to-end tail latency across application components. We propose TailClipper, a distributed scheduler that tags each arriving request with an arrival timestamp, and propagates it across the microservices' call chain. TailClipper then uses arrival timestamps to implement an oldest request first scheduler that combines global first-come first serve with a limited form of processor sharing to reduce end-to-end tail latency. In doing so, TailClipper can counter the performance degradation caused by request reordering in multi-tiered and microservices-based applications. We implement TailClipper as a userspace Linux scheduler and evaluate it using cloud workload traces and a real-world microservices application. Compared to state-of-the-art schedulers, our experiments reveal that TailClipper improves the 99th percentile response time by up to 81%, while also improving the mean response time and the system throughput by up to 54% and 29% respectively under high loads.</p>
	</div></div>

					<h2>SESSION: Systems Supporting Machine Learning II</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698510">On-demand and Parallel Checkpoint/Restore for GPU Applications</a></h3><ul class="DLauthors"><li class="nameList">Yanning Yang</li><li class="nameList">Dong Du</li><li class="nameList">Haitao Song</li><li class="nameList Last">Yubin Xia</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Leveraging serverless computing for cloud-based machine learning services is on the rise, promising cost-efficiency and flexibility are crucial for ML applications relying on high-performance GPUs and substantial memory. However, despite modern serverless platforms handling diverse devices like GPUs seamlessly on a pay-as-you-go basis, a longstanding challenge remains: startup latency, a well-studied issue when serverless is CPU-centric. For example, initializing GPU apps with minor GPU models, like MobileNet, demands several seconds. For more intricate models such as GPT-2, startup latency can escalate to around 10 seconds, vastly overshadowing the short computation time for GPU-based inference. Prior solutions tailored for CPU serverless setups, like fork() and Checkpoint/Restore, cannot be directly and effectively applied due to differences between CPUs and GPUs.</p> <p>This paper presents gCROP (GPU Checkpoint/Restore made On-demand and Parallel), the first GPU runtime that achieves &lt;100ms startup latency for GPU apps with up to 774 million parameters (3.1GB GPT-2-Large model). The key insight behind gCROP is to selectively restore essential states on demand and in parallel during boot from a prepared checkpoint image. To this end, gCROP first introduces a global service, GPU Restore Server, which can break the existing barrier between restore stages and achieve parallel restore. Besides, gCROP leverages both CPU and GPU page faults, and can on-demand restore both CPU and GPU data with profile-guided order to mitigate costs caused by faults. Moreover, gCROP designs a multi-checkpoint mechanism to increase the common contents among checkpoint images and utilizes deduplication to reduce storage costs. Implementation and evaluations on AMD GPUs show significant improvement in startup latency, 6.4x-24.7x compared with booting from scratch and 3.9x-23.5x over the state-of-the-art method (CRIU).</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698521">MoEsaic: Shared Mixture of Experts</a></h3><ul class="DLauthors"><li class="nameList">Umesh Deshpande</li><li class="nameList">Travis Janssen</li><li class="nameList">Mudhakar Srivatsa</li><li class="nameList Last">Swaminathan Sundararaman</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Mixture of Expert (MoE) models consist of several experts, each specializing in a specific task. During inference, a subset of the experts is invoked based on their relevance to the request. MoE's modular architecture lets users compose their model from popular off-the-shelf experts. This leads to multiple MoE deployments with identical experts. The duplication of experts across model instances results in excessive GPU memory consumption and increased model serving cost. Moreover, since all experts are not invoked for each request, individual experts rarely receive enough requests to exploit the GPUs' computational capabilities, resulting in low GPU utilization. To address these problems, we propose Shared Mixture of Experts in MoEsaic. MoEsaic automatically identifies and deduplicates identical experts across model instances, thus reducing their memory footprint. Moreover, it batches the requests directed toward the identical experts belonging to different clients, which also improves the processing efficiency. We show that for Mixtral-8x7B model, when compared to deploying dedicated MoE instances, MoEsaic can serve 7X more model instances with little impact on inference performance.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698548">FaPES: Enabling Efficient Elastic Scaling for Serverless Machine Learning Platforms</a></h3><ul class="DLauthors"><li class="nameList">Xiaoyang Zhao</li><li class="nameList">Siran Yang</li><li class="nameList">Jiamang Wang</li><li class="nameList">Lansong Diao</li><li class="nameList">Lin Qu</li><li class="nameList Last">Chuan Wu</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Serverless computing platforms have become increasingly popular for running machine learning (ML) tasks due to their user-friendliness and decoupling from underlying infrastructure. However, auto-scaling to efficiently serve incoming requests still remains a challenge, especially for distributed ML training or inference jobs in a serverless GPU cluster. Distributed training and inference jobs are highly sensitive to resource configurations, and demand high model efficiency throughout their lifecycle. We propose FaPES, a FaaS-oriented Performance-aware Elastic Scaling system to enable efficient resource allocation in serverless platforms for ML jobs. FaPES enables flexible resource loaning between virtual clusters for running training and inference jobs. For running inference jobs, servers are reclaimed on demand with minimal preemption overhead to guarantee service level objective (SLO); for training jobs, optimal GPU allocation and model hyperparameters are jointly adapted based on an ML-based performance model and a resource usage prediction board, alleviating users from model tuning and resource specification. Evaluation on a 128-GPU testbed demonstrates up to 24.8% job completion time reduction and ×1.8 Goodput improvement, as compared to representative elastic scaling schemes.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698555">KACE: Kernel-Aware Colocation for Efficient GPU Spatial Sharing</a></h3><ul class="DLauthors"><li class="nameList">Bing-Shiun Han</li><li class="nameList">Tathagata Paul</li><li class="nameList">Zhenhua Liu</li><li class="nameList Last">Anshul Gandhi</li></ul><div class="DLabstract"><div style="display:inline">
		<p>GPU spatial sharing among jobs is an effective approach to increase resource utilization and reduce the monetary and environmental costs of running deep learning workloads. While hardware support for GPU spatial sharing already exists, accurately predicting GPU interference between colocated workloads remains a concern. This makes it challenging to improve GPU utilization by sharing the GPU between workloads without severely impacting their performance. Existing approaches to identify and mitigate GPU interference often require extensive profiling and/or hardware modifications, making them difficult to deploy in practice.</p> <p>This paper presents KACE, a lightweight, prediction-based approach to effectively colocate workloads on a given GPU. KACE adequately predicts colocation interference via exclusive kernel metrics using limited training data and minimal training time, eliminating the need for extensive online profiling of each new workload colocation. Experimental results using various training and inference workloads show that KACE outperforms existing rule-based and prediction-based policies by 16% and 11%, on average, respectively, and is within 10% of the performance achieved by an offline-optimal oracle policy.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698557">Pack: Towards Communication-Efficient Homomorphic Encryption in Federated Learning</a></h3><ul class="DLauthors"><li class="nameList">Zeyuan Zuo</li><li class="nameList">Ningxin Su</li><li class="nameList">Baochun Li</li><li class="nameList Last">Teng Zhang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Federated learning allows multiple clients to collaboratively train a shared model without sharing local private data. It is regarded as privacy-preserving since only model updates are communicated. Unfortunately, it has been shown in the recent literature that, model updates transmitted by participating clients can be used by a malicious server in gradient leakage attacks to obtain private training data. To prevent such potential leakage from occurring, it has widely been acknowledged that homomorphic encryption can be used to encrypt these model updates before sending them to the server, which performs computations directly on encrypted data. Although homomorphic encryption has a strong guarantee on privacy, its practical use increases communication overhead by around 17×, even with its most efficient implementation, called CKKS. In this paper, we present Pack, a novel communication-efficient mechanism over CKKS, designed specifically to reduce the communication overhead by a substantial margin. In addition, we propose new error correction and weight filtering mechanisms in Pack to improve the accuracy of the trained model. Compared to vanilla CKKS, Pack reduces the communication overhead by 3.1×, while increasing the accuracy by 5.5% and 2.5% under the i.i.d. and non-i.i.d. settings.</p>
	</div></div>

					<h2>SESSION: The Green Cloud</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698556">InferCool: Enhancing AI Inference Cooling through Transparent, Non-Intrusive Task Reassignment</a></h3><ul class="DLauthors"><li class="nameList">Qiangyu Pei</li><li class="nameList">Lin Wang</li><li class="nameList">Dong Zhang</li><li class="nameList">Bingheng Yan</li><li class="nameList">Chen Yu</li><li class="nameList Last">Fangming Liu</li></ul><div class="DLabstract"><div style="display:inline">
		<p>The increasing power consumption of AI inference in modern datacenters has escalated cooling demands significantly, necessitating the adoption of potent cooling approaches like water cooling. Unlike traditional cloud workloads, AI inference has unique characteristics that create substantial gaps in achieving optimal cooling efficiency. In this work, we present the first comprehensive measurement study of AI inference cooling across various models within an industrial-ready scheduling framework, highlighting significant inefficiencies and their causes. To fill the gap while following the fundamental requirements of cooling systems, we explore a new opportunity presented by modern Multi-Instance GPU-enabled inference serving, where the scheduling dimension is naturally orthogonal to the cooling dimension. Building on this insight, we develop InferCool, a cooling middleware designed to enhance cooling efficiency for inference serving through transparent, non-intrusive task reassignment. It includes a streamlined power and temperature prediction approach and a thermal-aware, adaptive application deployment and request scheduling mechanism. Real-world experiments on a water-cooled testbed and a three-node cluster demonstrate that InferCool can reduce the maximum GPU temperature by 5°C across eight A100 GPUs, equivalent to cooling energy savings of about 20%. Importantly, InferCool requires no modifications to existing cooling infrastructures and is compatible with existing scheduling systems.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698516">CDN-Shifter: Leveraging Spatial Workload Shifting to Decarbonize Content Delivery Networks</a></h3><ul class="DLauthors"><li class="nameList">Jorge Murillo</li><li class="nameList">Walid A. Hanafy</li><li class="nameList">David Irwin</li><li class="nameList">Ramesh Sitaraman</li><li class="nameList Last">Prashant Shenoy</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Content Delivery Networks (CDNs) are Internet-scale systems that deliver streaming and web content to users from many geographically distributed edge data centers. Since large CDNs can comprise hundreds of thousands of servers deployed in thousands of global data centers, they can consume a large amount of energy for their operations and thus are responsible for large amounts of Green House Gas (GHG) emissions. As these networks scale to cope with increased demand for bandwidth-intensive content, their emissions are expected to rise further, making sustainable design and operation an important goal for the future. Since different geographic regions vary in the carbon intensity and cost of their electricity supply, in this paper, we consider spatial shifting as a key technique to jointly optimize the carbon emissions and energy costs of a CDN. We present two forms of shifting: spatial load shifting, which operates within the time scale of minutes, and VM capacity shifting, which operates at a coarse time scale of days or weeks. The proposed techniques jointly reduce carbon and electricity costs while considering the performance impact of increased request latency from such optimizations. Using real-world traces from a large CDN and carbon intensity and energy prices data from electric grids in different regions, we show that increasing the latency by 60ms can reduce carbon emissions by up to 35.5%, 78.6%, and 61.7% across the US, Europe, and worldwide, respectively. In addition, we show that capacity shifting can increase carbon savings by up to 61.2%. Finally, we analyze the benefits of spatial shifting and show that it increases carbon savings from added solar energy by 68% and 130% in the US and Europe, respectively.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698531">Accountable Carbon Footprints and Energy Profiling For Serverless Functions</a></h3><ul class="DLauthors"><li class="nameList">Prateek Sharma</li><li class="nameList Last">Alexander Fuerst</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Cloud computing is a significant and growing cause of carbon emissions. Understanding the energy consumption and carbon footprints of cloud applications is a fundamental prerequisite to raising awareness, designing sustainability metrics, and creating targeted system optimizations. In this paper, we address the challenges of providing accurate and full-system (not just CPU) carbon footprints for serverless (FaaS) functions. To the best of our knowledge, this is the first work which develops an energy and carbon metrology framework for FaaS.</p> <p>Carbon footprints require a new approach to energy profiling. We use FaaS workload properties such as locality to develop a simple and practical online statistical disaggregation approach. Our fine-grained per-invocation carbon footprints also include shared hardware and software emissions, and use insights from Shapley values to fairly account for both operational and embodied emissions. Owing to the growing importance of carbon measurement, we develop a new rigorous marginal energy based validation methodology which results in accountable, complete, and fair footprints. Over a wide range of FaaS workloads and hardware platforms, our energy footprints have an accuracy of &gt; 99%.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698542">The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware Scheduling</a></h3><ul class="DLauthors"><li class="nameList">Noman Bashir</li><li class="nameList">Varun Gohil</li><li class="nameList">Anagha Belavadi Subramanya</li><li class="nameList">Mohammad Shahrad</li><li class="nameList">David Irwin</li><li class="nameList">Elsa Olivetti</li><li class="nameList Last">Christina Delimitrou</li></ul><div class="DLabstract"><div style="display:inline">
		<p>The rapid increase in computing demand and corresponding energy consumption have focused attention on computing's impact on the climate and sustainability. Prior work proposes metrics that quantify computing's carbon footprint across several lifecycle phases, including its supply chain, operation, and end-of-life. Industry uses these metrics to optimize the carbon footprint of manufacturing hardware and running computing applications. Unfortunately, prior work on optimizing datacenters' carbon footprint often succumbs to the sunk cost fallacy by considering embodied carbon emissions (a sunk cost) when making operational decisions (i.e., job scheduling and placement), which leads to operational decisions that do not always reduce the total carbon footprint.</p> <p>In this paper, we evaluate carbon-aware job scheduling and placement on a given set of servers for several carbon accounting metrics. Our analysis reveals state-of-the-art carbon accounting metrics that include embodied carbon emissions when making operational decisions can increase the total carbon footprint of executing a set of jobs. We study the factors that affect the added carbon cost of such suboptimal decision-making. We then use a real-world case study from a datacenter to demonstrate how the sunk carbon fallacy manifests itself in practice. Finally, we discuss the implications of our findings in better guiding effective carbon-aware scheduling in on-premise and cloud datacenters.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698544">Exploring the Efficiency of Renewable Energy-based Modular Data Centers at Scale</a></h3><ul class="DLauthors"><li class="nameList">Jinghan Sun</li><li class="nameList">Zibo Gong</li><li class="nameList">Anup Agarwal</li><li class="nameList">Shadi Noghabi</li><li class="nameList">Ranveer Chandra</li><li class="nameList">Marc Snir</li><li class="nameList Last">Jian Huang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Modular data centers (MDCs) that can be placed right at the energy farms and powered mostly by renewable energy, is a flexible and effective approach to lowering the carbon footprint of data centers. However, the main challenge of using renewable energy is the high variability of power produced, which implies large volatility in powering computing resources at MDCs, and degraded application performance due to the task evictions and migrations. This causes challenges for platform operators to decide the MDC deployment.</p> <p>To this end, we present SkyBox, a framework that employs a learning-based approach for platform operators to explore the efficient use of renewable energy with MDC deployment across geographical regions. SkyBox is driven by the insights based on our study of real-world power traces from a variety of renewable energy farms - the predictable production of renewable energy and the complementary nature of energy production patterns across different renewable energy sources and locations. With these insights, SkyBox uses the coefficient of variation metric to select the qualified renewable farms, it can identify a set of farms with complementary energy production patterns with a subgraph identification algorithm. After that, SkyBox enables smart workload placement and migrations to further tolerate the power variability. Our experiments with real power traces and datacenter workloads show that SkyBox has the lowest carbon emissions compared with existing approaches. SkyBox also minimizes the negative impact of the power variability on cloud applications, enabling it an effective solution of utilizing renewable energy for modern data centers.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698546">The Hidden Carbon Footprint of Serverless Computing</a></h3><ul class="DLauthors"><li class="nameList">Rohan Basu Roy</li><li class="nameList">Raghavendra Kanakagiri</li><li class="nameList">Yankai Jiang</li><li class="nameList Last">Devesh Tiwari</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Due to the unique aspects of serverless computing like keep-alive and co-location of functions, it is challenging to account for its carbon footprint. This is the first work to introduce the need for systematic methodologies for carbon accounting in the serverless environment, propose new methodologies and in-depth analysis, and highlight how the carbon footprint estimation can vary based on the chosen methodology. It discusses how serverless-specific scheduling choices can impact the tradeoffs between performance and carbon footprint, with an aim toward standardizing methodological choices and identifying opportunities for future improvements.</p>
	</div></div>

					<h2>SESSION: The Basics</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698518">uIO: Lightweight and Extensible Unikernels</a></h3><ul class="DLauthors"><li class="nameList">Masanori Misono</li><li class="nameList">Peter Okelmann</li><li class="nameList">Charalampos Mainas</li><li class="nameList Last">Pramod Bhatotia</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Unikernels specialize operating systems by tailoring the kernel for a specific application at compile time. While the specialized library OS approach provides a smaller OS image-thus improving the bootup process, performance, migration costs, and reliable/trusted computing base---at the same time, unikernels lack run-time extensibility, which is imperative to support "on-demand" auxiliary tasks and tools, e.g., debugging, monitoring, re-configuration, and system management and deployment in a typical cloud environment. Consequently, unikernels present a fundamental trade-off between slimness of the OS image size at the compile time vs. flexibility of supported auxiliary functionality at the run-time.</p> <p>This work strives to balance this trade-off by keeping the unikernel system image as minimal as possible to solely support the application functionality in the "common case", while providing "on-demand" extensibility for auxiliary tasks at run-time. The key challenge is to support run-time extensibility through a generic interface in a safe manner.</p> <p>To this end, the paper presents uIO---a "safe overlay" abstraction to provide runtime extensibility in unikernels, while maintaining the unikernel benefits. In particular, uIO leverages a generic VirtIO-based interface to provide an overlay for auxiliary programs, i.e., users can load external programs into the unikernels' address space and run them, i.e., "on-demand" extensibility through a generic file system interface. To provide safe execution within an overlay, uIO provides isolation mechanisms leveraging hardware-assisted memory isolation (MPK) and language-runtime-based execution (eBPF). We implement a prototype of uIO based on Unikraft and demonstrate its applicability to support a range of auxiliary use cases. uIO incurs negligible performance overheads for application execution in the common case while providing run-time extensibility to support auxiliary use cases.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698511">Racos: Improving Erasure Coding State Machine Replication using Leaderless Consensus</a></h3><ul class="DLauthors"><li class="nameList">Jonathan Zarnstorff</li><li class="nameList">Lucas Lebow</li><li class="nameList">Christopher Siems</li><li class="nameList">Dillon Remuck</li><li class="nameList">Colin Ruiz</li><li class="nameList Last">Lewis Tseng</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Cloud storage systems often adopt state machine replication (SMR) to ensure reliability and availability. Most SMR systems use "full-copy" replication across all nodes, which leads to degraded performance for data-intensive workloads, due to high disk and network I/O costs. Erasure coding has recently been integrated with leader-based SMR systems to reduce the costs, e.g., RS-Paxos, CRaft, HRaft, and FRaft. However, these systems still have bottlenecks at the leader, limiting their performance when handling large datasets.</p> <p>To address the bottlenecks, this paper proposes Racos, which integrates erasure coding with a recent leaderless SMR protocol, Rabia Unlike Paxos or Raft, Rabia uses a leaderless design for reaching consensus, making it suitable for our purpose. Compared to a leader-based design, Racos distributes workload evenly, alleviating the bottlenecks.</p> <p>We integrate our system Racos with etcd, a distributed key-value storage that powers many production systems including Kubernetes. Our evaluation, using YCSB, shows that Racos outperforms the closest competitors by up to 2.26x in throughput within local-area networks and reduce median latency by up to 76.8% in wide variety of workloads.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698514">Occam's Razor for Distributed Protocols</a></h3><ul class="DLauthors"><li class="nameList">Ziliang Lai</li><li class="nameList">Fan Cui</li><li class="nameList">Hua Fan</li><li class="nameList">Eric Lo</li><li class="nameList">Wenchao Zhou</li><li class="nameList Last">Feifei Li</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Optimizing distributed protocols has traditionally been a real pain, requiring experts to figure out where improvements can be made, along with rigorous correctness proofs and meticulous implementation. This paper presents a theory to systematize this process. The proposed theory can optimize any existing distributed protocols while preserving all their original quality attributes (e.g., generality, correctness). Crucially, applying the optimizations derived from this theory does not necessitate touching the original implementation --- all you need is just to bolt on a few new message adapters. Case studies demonstrate the effectiveness of this approach. For instance, applying the theory to optimize Spanner's atomic commit protocol results in a mere 4 new message adapters, but improving its latency by up to 51% and peak throughput by up to 56%.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698550">VWeiST: A Scalable and Efficient Proof-of-Stake Blockchain Consensus</a></h3><ul class="DLauthors"><li class="nameList">Hang Xiong</li><li class="nameList">Cheng Qu</li><li class="nameList Last">Jing Li</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Due to the susceptibility to nothing-at-stake and long-range attacks, the Proof-of-Stake consensus faces challenges in securely and efficiently confirming blocks. We propose a new Proof-of-Stake consensus, Voted Weightest Sub-Tree(VWeiST) consensus. It assigns weights to each block through voting, and nodes confirm blocks by calculating the probability that each block's weight can be exceeded by other competitors. We employ a multi-round voting approach, where a small number of nodes are randomly selected as the committee nodes to vote in each round. This approach results in particularly low communication overhead per block, allowing for scalability to a large number of nodes. Compared to other consensus, our mechanism requires fewer rounds of voting to confirm a block, offering advantages in throughput and transaction latency. In the experiments, VWeiST achieves latency and round reductions down to 40% and 29% of the comparison method's levels at most. Furthermore, we theoretically prove that the consensus ensures liveness and probabilistic safety.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698562">Securing a Multiprocessor KVM Hypervisor with Rust</a></h3><ul class="DLauthors"><li class="nameList">Yu-Hsun Chiang</li><li class="nameList">Wei-Lin Chang</li><li class="nameList">Shih-Wei Li</li><li class="nameList Last">Jan-Ting Tu</li></ul><div class="DLabstract"><div style="display:inline">
		<p>As computations have increasingly shifted to virtual machines (VMs) running on a hypervisor, the security of the hypervisor is of critical concern. Rust has gained significant traction among developers due to its software safety guarantees and performance efficiency. This work explores building on Rust's safety features to construct a secure KVM hypervisor. We retrofit KVM to incorporate a Rust-based core to protect virtual machines. We build on Rust's type and lifetime system in a novel way to secure the core's memory accesses in a concurrent environment. Our resulting KVM implementation, KrustVM, incorporates a data race and deadlock-free core to protect VM confidentiality and integrity against privileged attackers who control the host Linux kernel while preserving KVM's commodity features and performance.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698558">SURE: Secure Unikernels Make Serverless Computing Rapid and Efficient</a></h3><ul class="DLauthors"><li class="nameList">Federico Parola</li><li class="nameList">Shixiong Qi</li><li class="nameList">Anvaya B. Narappa</li><li class="nameList">K. K. Ramakrishnan</li><li class="nameList Last">Fulvio Risso</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Current serverless platforms introduce non-trivial overheads when chaining and orchestrating loosely-coupled microservices. Containerized function runtimes are also constrained by insufficient isolation and excessive startup time. This motivates our exploration of a more efficient, secure, and rapid serverless design. We describe SURE, a unikernel-based serverless framework for fast function startup, equipped with a high-performance and secure data plane. SURE's data plane supports distributed zero-copy communication via the seamless interaction between zero-copy protocol stack (Z-stack) and local shared memory processing. To establish a lightweight service mesh, SURE uses library-based sidecars instead of individual userspace sidecars. We leverage Intel's Memory Protection Keys (MPK) as a lightweight capability to ensure safe access to the shared memory data plane. It also isolates the Trusted Computing Base (TCB) components in SURE's function runtime (e.g., library-based sidecar, scheduler, etc) from untrusted user code, while preserving the efficient single-address-space nature of unikernels. In particular, SURE prevents unintended privilege escalation involving MPK with an enhanced TCB. These combined efforts create a more secure and robust data plane while improving throughput up to 79X over Knative, a representative open-source serverless platform.</p>
	</div></div>

					<h2>SESSION: Bits on Disk</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698528">TianMen: a DPU-based storage network offloading structure for disaggregated datacenters</a></h3><ul class="DLauthors"><li class="nameList">Weiyue Zhao</li><li class="nameList">Jingya Wu</li><li class="nameList">Wenyan Lu</li><li class="nameList">Xiaowei Li</li><li class="nameList Last">Guihai Yan</li></ul><div class="DLabstract"><div style="display:inline">
		<p>In modern disaggregated datacenters, the storage network which interconnects the compute and memory pools becomes the performance bottleneck. The high-end RDMA devices cannot meet the complex requirements of storage networks, due to the limited RDMA semantics and throughput. Existing solutions essentially follow the monolithic design, so they suffer from underutilized resources and high scaling costs. In this paper, we design TianMen, which offloads the storage network by extending RDMA semantics and customizing communication hardware structure. Specifically, we use DPU as the infrastructure, leveraging the rich storage and compute resources. TianMen enables fully disaggregated storage system that bypasses the server-side CPU, and supports elastic resource pools. Experimental results show that, compared with state-of-the-art solutions: 1) Tian-Men achieves 1 RTT for GET/PUT operations and up to 6× access acceleration; 2) TianMen provides CPU bypass storage network management, including 3.2× speedup of metadata consistency management, per-request load balancing, and 10s microsecond-level fault recovery latency; 3) TianMen saturates the communication bandwidth when processing small payload, increasing the bandwidth utilization by 34.2%. And TianMen achieves 2.27× throughput compared to the commercial RNICs.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698507">H2C-Dedup: Reducing I/O and GC Amplification for QLC SSDs from the Deduplication Metadata Perspective</a></h3><ul class="DLauthors"><li class="nameList">Yunsheng Dong</li><li class="nameList">Boju Chen</li><li class="nameList">Yanqi Pan</li><li class="nameList">Xiangyu Zou</li><li class="nameList Last">Wen Xia</li></ul><div class="DLabstract"><div style="display:inline">
		<p>QLC SSDs have gained increasing popularity in cloud computing, PCs, and smartphones due to their low prices and high density, but they suffer from extremely limited endurance. Deduplication can convert redundant chunk writes into fine-grained metadata updates, thereby promising to alleviate QLC wear. Nevertheless, our observation shows that existing deduplication approaches cause even more I/Os than non-deduplication systems. We find that the amplification comes from two sources: (1) I/O amplification due to the mismatched granularity between SSD I/O size (e.g., 4--16 KiB page) and deduplication metadata I/O size, and (2) SSD garbage collection (GC) amplification due to deduplication metadata updates for eliminating redundant chunks (i.e., increment reference count). To address the above problem, this paper proposes H2C-Dedup, which employs two essential techniques. First, to address I/O amplification, cold2hot-heating technique utilizes a log-structured metadata I/O scheme, which ensures that the deduplication metadata is flushed until it is accumulated within I/O cache to match with the OS I/O granularity. Second, to address GC amplification, hot2cold-suppression technique divides metadata into hot (e.g., to store reference count) and cold (e.g., to store fingerprint) segments, delta-encoding the hot entries to ensure both hot and cold metadata will not be modified once they are durable. As a result, H2C-Dedup significantly reduces the deduplication-induced I/O and GC amplification. We implement H2C-Dedup based on F2FS. Extensive experiments on the FEMU platform using microbenchmarks and real-world traces indicate that H2C-Dedup can extend to at most 3.3× and 3.4× lifespan while accelerating 26% and 37% I/O performance compared to SmartDedup and HF-Dedupe.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698539">RomeFS: A CXL-SSD Aware File System Exploiting Synergy of Memory-Block Dual Paths</a></h3><ul class="DLauthors"><li class="nameList">Yekang Zhan</li><li class="nameList">Haichuan Hu</li><li class="nameList">Xiangrui Yang</li><li class="nameList">Shaohua Wang</li><li class="nameList">Qiang Cao</li><li class="nameList">Hong Jiang</li><li class="nameList Last">Jie Yao</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Compute eXpress Link (CXL) based Solid-State Drives (CXL-SSDs), such as the Samsung CMM-H model, promise to offer CXL.mem memory and CXL.io block dual-mode interfaces. Nonetheless, whether and how cloud applications with diverse and varying access patterns benefit from such dual-mode CXL-SSD remains an open question for academia and industry.</p> <p>This paper proposes RomeFS, the first CXL-SSD aware file system, handling file operations by synergistically yet preferentially utilize complementary CXL.mem and CXL.io data paths. To this end, RomeFS presents key enabling techniques including 1) dual-path data layout to statically partition metadata and file data into CXL.mem and CXL.io data-zones respectively; 2) dual-path access for file data using the two data paths synergistically at runtime; 3) hybrid parallel file indexing for efficient per-file mapping to locate dispersed file data across the two data paths; 4) data defragmentation to merge dispersed file data to the CXL.io data-zone; and 5) metadata journaling and synergistic dual-path transactional write to ensure crash consistency with low overhead. We implement and evaluate RomeFS under two emulated hardware platforms. The experiments show that RomeFS outperforms state-of-the-art block-based file systems and PM-based file systems by up to 14.24× and 4.89× respectively.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698538">SmartGraph: A Framework for Graph Processing in Computational Storage</a></h3><ul class="DLauthors"><li class="nameList">Soheil Khadirsharbiyani</li><li class="nameList">Nima Elyasi</li><li class="nameList">Armin Haj Aboutalebi</li><li class="nameList">Chun-Yi Liu</li><li class="nameList">Changho Choi</li><li class="nameList Last">Mahmut Taylan Kandemir</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Graph processing plays a pivotal role in numerous large-scale applications, including social and transportation networks. One of the primary challenges in handling large-scale graph data is its tendency to surpass DRAM capacities. Conventional methods focus on minimizing I/O latency by decreasing disk I/O requests via predictive value calculations. However, these techniques often struggle with inefficient partitioning strategies that elevate DRAM needs, underutilized predictive calculations, and incur considerable synchronization overheads.</p> <p>In our research, we introduce and assess SmartGraph, a new graph partitioning and processing framework that is optimized for both CPU-driven systems and near-storage processing units, such as SmartSSDs. SmartGraph is designed to enhance data-flow within and between processing iterations, drastically reducing the execution latency of graph algorithms and removing synchronization overheads. This framework is especially advantageous in cloud environments, where scalability and efficient data management are paramount. Our experimental findings demonstrate that SmartGraph achieves an average improvement of 1.27x across four graph applications compared to the state-of-the-art framework, LUMOS, when tested on SmartSSDs using datasets like Friendster, LiveJournal, and Twitter. Our empirical analysis underscores the benefits of integrating SmartSSD technology with cloud-based graph processing to boost performance.</p>
	</div></div>

					<h2>SESSION: In the Cloud</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698520">ConMonitor: Lightweight Container Protection with Virtualization and VM Functions</a></h3><ul class="DLauthors"><li class="nameList">Shaowen Xu</li><li class="nameList">Qihang Zhou</li><li class="nameList">Zhicong Zhang</li><li class="nameList">Xiaoqi Jia</li><li class="nameList">Donglin Liu</li><li class="nameList">Heqing Huang</li><li class="nameList">Haichao Du</li><li class="nameList Last">Zhenyu Song</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Containers are widely used in multi-tenant cloud computing for their ease of deployment, minimal overhead, and fast start-up. However, the intrinsic shared kernel model of containers poses significant security threats, risking confidentiality and integrity from co-located containers or compromised OS. Researchers have proposed various methods to protect containers from untrusted OS, but few consider both the universality and efficiency. In this paper, we present ConMonitor---a lightweight and efficient container protection architecture. ConMonitor protects the security of container application data by introducing a compact virtualization software, called ConVisor, as a trusted computing base. ConVisor enforces isolation of the physical memory between containers and the kernel, and monitors the sensitive operations performed by the OS. To ensure the security of ConMonitor, we implement a Container Guardian to serve as an intermediary for the kernel, managing sensitive operations. Moreover, we also leverage the VMFUNC feature to achieve fast context switching, thereby mitigating the performance penalty associated with frequent context switching. We have implemented ConMonitor on Intel CPU with Virtualization Technology, and the evaluation results show that ConMonitor can protect the security of container applications with a negligible performance overhead.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698536">ByteMQ: A Cloud-native Streaming Data Layer in ByteDance</a></h3><ul class="DLauthors"><li class="nameList">Yancan Mao</li><li class="nameList">Ruohang Yin</li><li class="nameList">Liyuan Lei</li><li class="nameList">Peng Ye</li><li class="nameList">Shengfu Zou</li><li class="nameList">Shizheng Tang</li><li class="nameList">Yunzhe Guo</li><li class="nameList">Ye Yuan</li><li class="nameList">Xiaochen Yu</li><li class="nameList">Bo Wan</li><li class="nameList">Yunfei Gong</li><li class="nameList">Changli Gao</li><li class="nameList">Guanghui Zhang</li><li class="nameList">Jian Shen</li><li class="nameList">Rui Shi</li><li class="nameList Last">Richard T. B. Ma</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Real-time streaming data is generated in high volumes and consumed for statistical and analytical purposes, requiring efficient and effective management by Message Queuing Systems (MQS) that ensure high throughput and low latency. ByteDance relies extensively on MQS to handle its massive streaming data across various applications. However, existing MQS solutions often fall short of meeting ByteDance's high-volume, diverse requirements. To address these challenges, we propose ByteMQ (BMQ), a cloud-native streaming data layer designed to manage ByteDance's extensive streaming data needs efficiently in the cloud. BMQ features three key designs: 1) separation of messaging and storage, utilizing ByteDance's Federated Distributed File System (DFS) for high-performance data storage; 2) adaptive resource scheduling to balance workloads and redistribute resources across multiple availability zones; and 3) historical data restructuring to support offline applications with efficient structured data management. ByteDance has migrated 99.76% of its Kafka clusters to BMQ infrastructure, achieving about a 70% reduction in resource costs. This paper shares our journey of designing and implementing BMQ, providing insights that may benefit other organizations facing similar challenges.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698537">Dynamic Idle Resource Leasing To Safely Oversubscribe Capacity At Meta</a></h3><ul class="DLauthors"><li class="nameList">Nishant Gupta</li><li class="nameList">Iyswarya Narayanan</li><li class="nameList">Shivam Handa</li><li class="nameList">Sayak Chakraborti</li><li class="nameList">Pankit Thapar</li><li class="nameList">Baohua Shan</li><li class="nameList">Ariel Rao</li><li class="nameList">Yuanlai Liu</li><li class="nameList">Pengyuan Wang</li><li class="nameList">Yuqing Wu</li><li class="nameList">Qingyi Gao</li><li class="nameList">Chris Chao-Chun Cheng</li><li class="nameList">Sihan You</li><li class="nameList">Louis Huang</li><li class="nameList">Jingyuan Fan</li><li class="nameList">Kenny Yu</li><li class="nameList">Kevin Lin</li><li class="nameList">Tengfei Mu</li><li class="nameList">Parth Malani</li><li class="nameList">Haiying Wang</li><li class="nameList">Trey Lu</li><li class="nameList Last">Peter Zhang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Meta maintains additional capacity within its infrastructure to ensure high availability for business workloads, accommodating user growth, temporal traffic variations, and unforeseen regional failures. However, this strategic choice inherently leads to underutilization of resources. We employ oversubscription as an effective strategy to mitigate infrastructure underutilization.</p> <p>To ensure safe oversubscription, our design must guarantee that it does not result in capacity shortages for business workloads. To encourage adoption, (i) our system must also support workloads with stringent availability Service Level Objectives (SLOs), and (ii) usability is important, as service owners should be able to seamlessly utilize repurposed capacity with minimal code and configuration modifications.</p> <p>To address these challenges, we first present a system that dynamically identifies idle capacity in the fleet, which can be safely repurposed. We next present a Dynamic Resource Leasing Platform that transparently leases idle capacity with SLOs, minimizing service owner involvement. We discuss our design choices and share our experience in deploying this system across Meta's infrastructure.</p> <p>This system has been in production for 3.5 years, supporting user-facing production workloads and real-time internal workloads that require capacity availability SLOs, and offline workloads. It has served as a vital supply source during capacity shortages and overload scenarios. The system has enabled safe oversubscription of the infrastructure, and reduced our physical footprint by approximately 25% on a scale of millions of machines.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698547">Byways: High-Performance, Isolated Network Functions for Multi-Tenant Cloud Servers</a></h3><ul class="DLauthors"><li class="nameList">Xinyu Han</li><li class="nameList">Yuan Gao</li><li class="nameList">Gabriel Parmer</li><li class="nameList Last">Timothy Wood</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Network functions (NFs) have become pervasive in data centers as a means to monitor and transform traffic as it flows between services. Softwarization of the network has further added to the diversity of functions that can be deployed, yet managing the performance, efficiency, tenant-customizability, and security of these functions remains a major challenge. We present Byways, an abstraction that provides facilitates to safely deploy NFs alongside end-host VMs in a multi-tenant cloud environment. Byways guarantee strict isolation between the host system, the network functions, and VM-based cloud applications, while still maintaining high performance. A Byway manages a specific set of services, and an associated NF only processes flows associated with those services, using per-byway resources (e.g., processing time). This separation of end-host traffic across Byways provides strong fault isolation - a failing NF does not impact other services. Byways augment this isolation with per-Byway access rights that restrict a NFs access (e.g., read, write, drop) to the flow, limiting the impact of a faulty NF on even its own services.</p> <p>We have implemented BywayOS, a μ-kernel instantiation of Byways, and evaluated its performance, efficiency, and isolation properties compared to state of the art virtual machine networking technologies. A Byway processing memcached traic through an isolated NF demonstrates throughput and latency competitive with, and often better than, Linux host performance (i.e., without a NF nor a VM), and throughput 1.25x-6.43x higher than other host NF+VM technologies, while offering stronger isolation and a trusted computing base (TCB) more than 20x smaller.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698551">Cloud-native Workflow Scheduling using a Hybrid Priority Rule, Dynamic Resource Allocation, and Dynamic Task Partition</a></h3><ul class="DLauthors"><li class="nameList">Jungeun Shin</li><li class="nameList">Diana Arroyo</li><li class="nameList">Asser Tantawi</li><li class="nameList">Chen Wang</li><li class="nameList">Alaa Youssef</li><li class="nameList Last">Rakesh Nagi</li></ul><div class="DLabstract"><div style="display:inline">
		<p>As cloud-native workflow orchestration tools become increasingly important for complex data science workloads, there is a growing need for more efficient scheduling. Existing cloud schedulers rely on basic heuristics and user choice for task partitioning for parallel computing, leading to under-utilization of cluster resources and prolonged job completion times. To address this, we propose a novel workflow scheduling algorithm that leverages workflow characteristics to enhance resource utilization and reduce weighted job completion time. The algorithm combines three sub-algorithms, each reflecting a distinct aspect of the scheduling strategy: 1) Hybrid Maximum Children (MC) -Weighted Shortest Critical Path Time (WSCPT) rule alternates between two heuristics, MC and WSCPT, which prioritize jobs based on workflow structure and critical path, respectively. The choice between these heuristics is dynamically adjusted according to the cluster queue size. 2) Dynamic Resource Allocation (DRA), which dynamically adjusts the number of executors assigned to each workflow, and 3) Dynamic Task Partition (DTP), which autonomously determines the task parallelism level. We tested our algorithm with extensive experiments on various workflow types using Spark-imitated simulation. Our algorithm outperformed other schedulers, including learning-based models, by reducing 21-47% of the combined performance of average job completion time and makespan for unweighted workflows and reducing at least 50% of weighted job completion time for weighted workflows.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698552">Streamlining Cloud-Native Application Development and Deployment with Robust Encapsulation</a></h3><ul class="DLauthors"><li class="nameList">Pawissanutt Lertpongrujikorn</li><li class="nameList">Hai Duc Nguyen</li><li class="nameList Last">Mohsen Amini Salehi</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Current Serverless abstractions (e.g., FaaS) poorly support non-functional requirements (e.g., QoS and constraints), are provider-dependent, and are incompatible with other cloud abstractions (e.g., databases). As a result, application developers have to undergo numerous rounds of development and manual deployment refinements to finally achieve their desired quality and efficiency. In this paper, we present Object-as-a-Service (OaaS)---a novel serverless paradigm that borrows the object-oriented programming concepts to encapsulate business logic, data, and non-functional requirements into a single deployment package, thereby streamlining provider-agnostic cloud-native application development. We also propose a declarative interface for the non-functional requirements of applications that relieves developers from daunting refinements to meet their desired QoS and deployment constraint targets. We realized the OaaS paradigm through a platform called Oparaca and evaluated it against various real-world applications and scenarios. The evaluation results demonstrate that Oparaca can enhance application performance by 60× and improve reliability by 50× through latency, throughput, and availability enforcement---all with remarkably less development and deployment time and effort.</p>
	</div></div>

					<h2>SESSION: Algorithms and Applications</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698517">Komet: A Serverless Platform for Low-Earth Orbit Edge Services</a></h3><ul class="DLauthors"><li class="nameList">Tobias Pfandzelter</li><li class="nameList Last">David Bermbach</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Low-Earth orbit satellite networks can provide global broadband Internet access using constellations of thousands of satellites. Integrating edge computing resources in such networks can enable global low-latency access to compute services, supporting end users in rural areas, remote industrial applications, or the IoT. To achieve this, resources must be carefully allocated to various services from multiple tenants. Moreover, applications must navigate the dynamic nature of satellite networks, where orbital mechanics necessitate frequent client hand-offs. Therefore, managing applications on the low-Earth orbit edge will require the right platform abstractions.</p> <p>We introduce Komet, a serverless platform for low-Earth orbit edge computing. Komet integrates Function-as-a-Service compute with data replication, enabling on-demand elastic edge resource allocation and frequent service migration against satellite orbital trajectories to keep services deployed in the same geographic region. We implement Komet as a proof-of-concept prototype and demonstrate how its abstractions can be used to build low-Earth orbit edge applications with high availability despite constant mobility. Further, we propose simple heuristics for service migration scheduling in different application scenarios and evaluate them in simulation based on our experiment traces, showing the trade-off between selecting an optimal satellite server at every instance and minimizing service migration frequency.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698526">A Data Optimizer for Region-Aware Self-describing Files in Scientific Computing</a></h3><ul class="DLauthors"><li class="nameList">Yanjie Song</li><li class="nameList">Tianyuan Wu</li><li class="nameList">Yuanhao Li</li><li class="nameList">Guancheng Li</li><li class="nameList">Yuchen Liu</li><li class="nameList">Shu Yin</li><li class="nameList">Wei Xue</li><li class="nameList Last">Junchao Wang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Acquiring data from scientific simulations for analytical purposes is inherently challenging due to the complex and irregularly shaped regions within which the data resides, particularly when using self-describing data formats. The process of region-based data distillation becomes even more arduous when employing persistent memory or parallel file systems. To tackle this challenge, we introduce RASTER (Region-Aware Self-describing daTa optimizER), a lightweight middleware designed for region-aware data preprocessing. RASTER dynamically reorganizes data into variable groups based on regional identifiers during runtime, thereby eliminating the need for sequential searches to locate the required data. We have developed a prototype of RASTER and successfully integrated it into three distinct computing environments: a single-node server equipped with Intel® Optane™ DC persistent memory, the Huawei® OceanStor cloud storage platform, and the Sunway TaihuLight supercomputer. We then conducted a thorough evaluation of the RASTER prototype on the latter two platforms using a real-world scientific application, CESM (Community Earth System Model). Our experimental results demonstrate that RASTER enhances data acquisition performance by up to 2.83× and achieves a 2.36× speedup over conventional netCDF and the state-of-the-art ADIOS2. Additionally, RASTER significantly reduces memory usage by up to 400%, showcasing its scalability potential.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698540">Rethinking State Management in Actor Systems for Cloud-Native Applications</a></h3><ul class="DLauthors"><li class="nameList">Yijian Liu</li><li class="nameList">Rodrigo Laigner</li><li class="nameList Last">Yongluan Zhou</li></ul><div class="DLabstract"><div style="display:inline">
		<p>The actor model has gained increasing popularity. However, it lacks support for complex state management tasks, such as enforcing foreign key constraints and ensuring data replication consistency across actors. These are crucial properties in partitioned application designs, such as microservices. To fill this gap, we start by analyzing the key impediments in state-of-the-art actor systems. We find it difficult for developers to express complex data relationships across actors and reason about the impact of state updates on performance due to opaque state management abstractions. To solve this conundrum, we develop SmSa, a novel data management layer for actor systems, allowing developers to declare data dependencies that cut across actors, including foreign keys, data replications, and other dependencies. SmSa can transparently enforce the declared dependencies, reducing the burden on developers. Furthermore, SmSa employs novel logging and concurrency control algorithms to support transactional maintenance of data dependencies.</p> <p>We demonstrate SmSa can support core data management tasks where dependencies across components appear frequently without jeopardizing application logic expressiveness and performance. Our experiments show SmSa significantly reduces the logging overhead and leads to increased concurrency level, improving by up to 2X the performance of state-of-the-art deterministic scheduling approaches. As a result, SmSa will make it easier to design and implement highly partitioned and distributed applications.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698524">IncBoost: Scaling Incremental Graph Processing for Edge Deletions and Weight Updates</a></h3><ul class="DLauthors"><li class="nameList">Xizhe Yin</li><li class="nameList">Zhijia Zhao</li><li class="nameList Last">Rajiv Gupta</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Incremental query evaluation is key to efficiently processing rapidly changing graph data. By focusing on the parts of the query results affected by updates, it avoids unnecessary computations, allowing for faster query evaluation. While this technique works well in the cases of edge insertions, its benefit quickly diminishes when the volumes of edge deletions and edge weight updates increases.</p> <p>To address the above scalability issue, this work introduces several techniques for handling large update batches that include many edge deletions and weight updates. First, for edge deletions, this work introduces a bottom-up dependency tracing method to identify the affected vertices. Unlike the existing top-down tracing, it completely avoids traversing the underlying graph, thus more scalable for large deletion batches. Second, for edge weight updates, existing graph systems treat each weight change as an edge deletion (with old weight) followed by an edge insertion (with new weight). This "two-round" method is computationally excessive. This work shows that it is, in fact, possible to handle weight updates directly. Finally, this work shows the benefits of adjusting the processing strategy according to the update volume. We integrated the above ideas into a graph system called IncBoost. Based on our evaluation, IncBoost can scale incremental query evaluation to large update batches that represent 30-60% of the graph size. By contrast, the state-of-the-art streaming graph system (RisGraph) typically fails to yield benefits when the batch size reaches 5-15% of the graph size. Regarding the absolute processing time, IncBoost consistently outperforms RisGraph with 3.1× and 5.2× speedups for edge deletions and weight updates on large batches, respectively.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698565">Memory Management in Complex Join Queries: A Re-evaluation Study</a></h3><ul class="DLauthors"><li class="nameList">Shiva Jahangiri</li><li class="nameList">Michael J. Carey</li><li class="nameList Last">Johann-Christoph Freytag</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Efficient multi-join query processing is crucial but remains a complex, ongoing challenge for high-performance data management systems (DBMSs). This paper studies the impact of different memory distribution techniques among join operators on different classes of multi-join query plans under different assumptions regarding memory availability and storage devices such as HDD and SSD on Amazon Web Services (AWS). We re-evaluate the results of one of the early impactful studies from the 1990s that was originally done using a simulator for the Gamma database system.</p> <p>The main goal of our study is to scientifically re-evaluate and build upon previous studies whose results have become the basis for the design of past and modern database systems, and to provide a solid foundation for understanding basic "join physics", which is essential for eventually designing a resource-based scheduler for concurrent complex workloads.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698560">FAAStloop: Optimizing Loop-Based Applications for Serverless Computing</a></h3><ul class="DLauthors"><li class="nameList">Shruti Mohanty</li><li class="nameList">Vivek M. Bhasi</li><li class="nameList">Myungjun Son</li><li class="nameList">Mahmut Taylan Kandemir</li><li class="nameList Last">Chita Das</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Serverless Computing has garnered significant interest for executing High-Performance Computing (HPC) applications in recent years, attracting attention for its elastic scalability, reduced entry barriers, and pay-per-use pricing model. Specifically, highly parallel HPC apps can be divided and offloaded to multiple Serverless Functions (SFs) that execute their respective tasks concurrently and, finally, their results are stored/aggregated. While state-of-the-art userside serverless frameworks have attempted to fine-tune task division amongst the SFs to optimize for performance and/or cost, they have either used static task division parameters or have only focused on minimizing the number of SFs through task packing. However, these methods treat the HPC code as a black-box and usually require significant manual intervention to find the optimal task division. Since a significant portion of the HPC applications have a loop structure, in this work, we try to answer the following two questions: (i) Can modifying the loop structure in the HPC code, originally optimized for monolithic (non-serverless) frameworks, enhance performance and reduce costs in a serverless architecture?, and (ii) Can we develop a framework that allows for an efficient transition of monolithic code to serverless, with minimum user input?</p> <p>To this end, we propose a novel framework, FAAStloop, which intelligently employs loop-based optimizations (as well as task packing) in SF containers to optimally execute HPC apps across SFs. FAAStloop chooses the relevant optimization parameters using statistical models (constructed via app profiling) that are able to predict the relevant performance/cost metrics as a function of our choice of parameters. Our extensive experimental evaluation of FAAStloop on the AWS Lambda platform reveals that our framework outperforms state-of-the-art works by up to 3.3× and 2.1×, in terms of end-to-end execution latency and cost, respectively.</p>
	</div></div>

					<h2>SESSION: Systems Supporting Machine Learning III: Training</h2>
						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698535">Distributed Training of Large Language Models on AWS Trainium</a></h3><ul class="DLauthors"><li class="nameList">Xinwei Fu</li><li class="nameList">Zhen Zhang</li><li class="nameList">Haozheng Fan</li><li class="nameList">Guangtai Huang</li><li class="nameList">Mohammad El-Shabani</li><li class="nameList">Randy Huang</li><li class="nameList">Rahul Solanki</li><li class="nameList">Fei Wu</li><li class="nameList">Ron Diamant</li><li class="nameList Last">Yida Wang</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Large language models (LLMs) are ubiquitously powerful but prohibitively expensive to train, often requiring thousands of compute devices, typically GPUs. To reduce the cost of training LLMs for customers, Amazon Web Services (AWS) launched the Amazon EC2 trn1 instances, powered by AWS Trainium, an Amazon's homegrown deep learning accelerator, as an alternative to distributed LLM training. The trn1 instances provide a high-performance LLM training solution at a lower cost compared to their GPU-based counterpart, the p4d instances, which are powered by Nvidia A100 GPUs. This paper describes the design and development of the Neuron Distributed Training Library, a component of the AWS Neuron SDK, which enables distributed training of large language models on AWS Trainium. Neuron Distributed Training Library supports a variety of existing distributed training techniques with unified interfaces, and provides features to address trn1-specific challenges as well. Our evaluation shows that trn1 instances, specifically the trn1.32xlarge, achieve better or comparable performance (up to 24.6% improvement) while offering significant lower costs (up to 46.3% cost saving) in selected workloads when compared to p4d.24xlarge instances. As a result, AWS Trainium has been adopted for training numerous external and internal models, showcasing its high-performance and cost-effectiveness. Several supported open-source LLMs are accessible via HuggingFace Optimum Neuron.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698541">Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training</a></h3><ul class="DLauthors"><li class="nameList">Xue Li</li><li class="nameList">Cheng Guo</li><li class="nameList">Kun Qian</li><li class="nameList">Menghao Zhang</li><li class="nameList">Mengyu Yang</li><li class="nameList Last">Mingwei Xu</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Data parallelism has become a cornerstone in scaling up the training of deep neural networks (DNNs). However, the communication overhead associated with synchronizing gradients across multiple nodes has emerged as a significant bottleneck, adversely affecting training efficiency and leading to a surge in large-scale distributed model training costs. By leveraging insights into the statistical characteristics of gradients, we present GComp, a near-lossless gradient compression scheme designed to reduce the communication burden during data-parallel training significantly. GComp develops an optimized Huffman encoding/decoding strategy to compress gradient exponents effectively. Additionally, it introduces an innovative multi-level quantization method for mantissa, complemented by a pruning strategy that eliminates zero-valued gradients. These integrated approaches significantly reduce the volume of data for synchronization, while virtually not affecting the DNN model's training accuracy. We conduct comprehensive evaluations of GComp, demonstrating that our method can decrease the communication volume by as much as 67.1%, and enhance training speed by up to 1.9×.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698549">Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores</a></h3><ul class="DLauthors"><li class="nameList">Diana Petrescu</li><li class="nameList">Arsany Guirguis</li><li class="nameList">Do Le Quoc</li><li class="nameList">Javier Picorel</li><li class="nameList">Rachid Guerraoui</li><li class="nameList Last">Florin Dinu</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Storage disaggregation underlies today's cloud and is naturally complemented by pushing down some computation to storage, thus mitigating the potential network bottleneck between the storage and compute tiers. We show how ML training benefits from storage pushdowns by focusing on transfer learning (TL), the widespread technique that democratizes ML by reusing existing knowledge on related tasks. We propose HAPI, a new TL processing system centered around two complementary techniques that address challenges introduced by disaggregation. First, applications must carefully balance execution across tiers for performance. HAPI judiciously splits the TL computation during the feature extraction phase yielding pushdowns that not only improve network time but also improve total TL training time by overlapping the execution of consecutive training iterations across tiers. Second, operators want resource efficiency from the storage-side computational resources. HAPI employs storage-side batch size adaptation allowing increased storage-side pushdown concurrency without affecting training accuracy. HAPI yields up to 2.5× training speed-up while choosing in 86.8% of cases the best performing split point or one that is at most 5% off from the best.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698553">Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization</a></h3><ul class="DLauthors"><li class="nameList">Amey Agrawal</li><li class="nameList">Sameer Reddy</li><li class="nameList">Satwik Bhattamishra</li><li class="nameList">Venkata Prabhakara Sarath Nookala</li><li class="nameList">Vidushi Vashishth</li><li class="nameList">Kexin Rong</li><li class="nameList Last">Alexey Tumanov</li></ul><div class="DLabstract"><div style="display:inline">
		<p>The likelihood of encountering in-training failures rises substantially with larger Deep Learning (DL) training workloads, leading to lost work and resource wastage. Such failures are typically offset by checkpointing, which comes at the cost of storage and network bandwidth overhead. State-of-the-art approaches involve lossy model compression mechanisms, which induce a tradeoff between the resulting model quality and compression ratio. We make a key enabling observation that the sensitivity of model weights to compression varies during training, and different weights benefit from different quantization levels, ranging from retaining full precision to pruning. We propose (1) a non-uniform quantization scheme that leverages this variation, (2) an efficient search mechanism that dynamically finds the best quantization configurations, and (3) a quantization-aware delta compression mechanism that rearranges weights to minimize checkpoint differences and thereby improving compression.</p> <p>We instantiate these contributions in Inshrinkerator, an in-training checkpoint compression system for DL workloads. Our experiments show that Inshrinkerator consistently achieves a better tradeoff between accuracy and compression ratio compared to prior works, enabling a compression ratio up to 39x and withstanding up to 10 restores with negligible accuracy impact in fault-tolerant training. Inshrinkerator achieves at least an order of magnitude reduction in checkpoint size for failure recovery and transfer learning without any loss of accuracy.</p>
	</div></div>


						<h3><a class="DLtitleLink" title="Full Citation in the ACM Digital Library" referrerpolicy="no-referrer-when-downgrade" href="https://dl.acm.org/doi/10.1145/3698038.3698563">ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks</a></h3><ul class="DLauthors"><li class="nameList">Ziji Shi</li><li class="nameList">Jialin Li</li><li class="nameList Last">Yang You</li></ul><div class="DLabstract"><div style="display:inline">
		<p>Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints.</p> <p>In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.</p>
	</div></div>

					</div></div></body></html>