Papers

Cowbird: Freeing CPUs to Compute by Offloading the Disaggregation of Memory
Xinyi Chen, Liangcheng Yu, Vincent Liu, Qizhen Zhang
Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2023
Foundation Cloud Networks Resource Disaggregation Communication Offloading

Memory disaggregation allows applications running on compute servers to expand their pool of available memory capacity by leveraging remote resources through low-latency networks. Unfortunately, in existing software-level disaggregation frameworks, the simple act of issuing requests to remote memory—paid on every access—can consume many CPU cycles. This overhead represents a direct cost to disaggregation, not only on the throughput of remote memory access but also on application logic, which must contend with the framework's CPU overheads.
In this paper, we present Cowbird, a memory disaggregation architecture that frees compute servers to fulfill their stated purpose by removing disaggregation-related logic from their CPUs. Our experimental evaluation shows that Cowbird eliminates disaggregation overhead on compute-server CPUs and can improve end-to-end application performance by up to 3.5x compared to RDMA-only communication.

@article{CowbirdSIGCOMM2023, author = {Xinyi Chen and Liangcheng Yu and Vincent Liu and Qizhen Zhang}, title = {Cowbird: Freeing CPUs to Compute by Offloading the Disaggregation of Memory}, booktitle = {{SIGCOMM} '23: Annual Conference of the ACM Special Interest Group on Data Communication}, pages = {1060--1073}, publisher = {{ACM}}, year = {2023} }

Disaggregated Database Systems
Jianguo Wang, Qizhen Zhang
ACM International Conference on Management of Data, SIGMOD 2023 Tutorial
Rethinking DBMSs Resource Disaggregation

Disaggregated database systems achieve unprecedented excellence in elasticity and resource utilization at the cloud scale and have gained great momentum from both industry and academia recently. Such systems are developed in response to the emerging trend of disaggregated data centers where resources are physically separated and connected through fast data center networks. Database management systems have been traditionally built based on monolithic architectures, so disaggregation fundamentally challenges the designs. On the other hand, disaggregation offers benefits like independent scaling of compute, memory, and storage. Nonetheless, there is a lack of systematic investigation into new research challenges and opportunities in recent disaggregated database systems.
To provide database researchers and practitioners with insights into different forms of resource disaggregation, we take a snapshot of state-of-the-art disaggregated database systems and related techniques and present an in-depth tutorial. The primary goal is to better understand the enabling techniques and characteristics of resource disaggregation and its implications for next-generation database systems. To that end, we survey recent work on storage disaggregation, which separates secondary storage devices (e.g., SSDs) from ompute servers and is widely deployed in current cloud data centers, and memory disaggregation, which further splits compute and memory with Remote Direct Memory Access (RDMA) and is driving the transformation of clouds. In addition, we mention two techniques that bring novel perspectives to the above two paradigms: persistent memory and Compute Express Link (CXL). Finally, we identify several directions that shed light on the future development of disaggregated database systems.

@article{DisaggDBMSSIGMOD2023, author = {Jianguo Wang and Qizhen Zhang}, title = {Disaggregated Database Systems}, journal = {Companion of the 2023 International Conference on Management of Data, {SIGMOD/PODS} 2023}, pages = {37--44}, publisher = {{ACM}}, year = {2023} }

FlexChain: An Elastic Disaggregated Blockchain
Chenyuan Wu, Mohammad Javad Amiri, Jared Asch, Heena Nagda, Qizhen Zhang, Boon Thau Loo
International Conference on Very Large Data Bases, VLDB 2023
Foundation Transactions Blockchains Resource Disaggregation

While permissioned blockchains enable a family of data center applications, existing systems suffer from imbalanced loads across compute and memory, exacerbating the underutilization of cloud resources. This paper presents FlexChain, a novel permissioned blockchain system that addresses this challenge by physically disaggregating CPUs, DRAM, and storage devices to process different blockchain workloads efficiently. Disaggregation allows blockchain service providers to upgrade and expand hardware resources independently to support a wide range of smart contracts with diverse CPU and memory demands. Moreover, it ensures efficient resource utilization and hence prevents resource fragmentation in a data center. We have explored the design of XOV blockchain systems in a disaggregated fashion and developed a tiered key-value store that can elastically scale its memory and storage. Our design significantly speeds up the execution stage. We have also leveraged several techniques to parallelize the validation stage in FlexChain to further improve the overall blockchain performance. Our evaluation results show that FlexChain can provide independent compute and memory scalability, while incurring at most 12.8% disaggregation overhead. FlexChain achieves almost identical throughput as the state-of-the-art distributed approaches with significantly lower memory and CPU consumption for compute-intensive and memory-intensive workloads respectively.

@article{FlexChainVLDB2023, author = {Chenyuan Wu and Mohammad Javad Amiri and Jared Asch and Heena Nagda and Qizhen Zhang and Boon Thau Loo}, title = {FlexChain: An Elastic Disaggregated Blockchain}, journal = {Proc. {VLDB} Endow.}, volume = {16}, number = {1}, pages = {23--36}, year = {2023} }

Templating Shuffles
Qizhen Zhang, Jiacheng Wu, Ang Chen, Vincent Liu, Boon Thau Loo
Conference on Innovative Data Systems Research, CIDR 2023
Foundation Data Analytics Shuffle Cloud Data Centers

Cloud data centers are evolving fast. At the same time, today's large-scale data analytics applications require non-trivial performance tuning that is often specific to the applications, workloads, and data center infrastructure. We propose TeShu, which makes network shuffling an extensible unified service layer common to all data analytics. Since an optimal shuffle depends on a myriad of factors, TeShu introduces parameterized shuffle templates, instantiated by accurate and efficient sampling that enables TeShu to dynamically adapt to different application workloads and data center layouts. Our preliminary experimental results show that TeShu efficiently enables shuffling optimizations that improve performance and adapt to a variety of data center network scenarios.

@inproceedings{TeShuCIDR2023, author = {Qizhen Zhang and Jiacheng Wu and Ang Chen and Vincent Liu and Boon Thau Loo}, title = {Templating Shuffles}, booktitle = {{CIDR} '23: Conference on Innovative Data Systems Research}, publisher = {www.cidrdb.org}, year = {2023} }

Optimizing Data-intensive Systems in Disaggregated Data Centers with TELEPORT
Qizhen Zhang, Xinyi Chen, Sidharth Sankhe, Zhilei Zheng, Ke Zhong, Sebastian Angel, Ang Chen, Vincent Liu, Boon Thau Loo
ACM International Conference on Management of Data, SIGMOD 2022
Foundation Data Processing Compute Pushdown Resource Disaggregation

Recent proposals for the disaggregation of compute, memory, storage, and accelerators in data centers promise substantial operational bene￿ts. Unfortunately, for resources like memory, this comes at the cost of performance overhead due to the potential insertion of network latency into every load and store operation. This effect is particularly felt by data-intensive systems due to the size of their working sets, the frequency at which they need to access memory, and the relatively low computation per access. This performance impairment offsets the elasticity benefit of disaggregated memory. This paper presents TELEPORT, a compute pushdown framework for data-intensive systems that run on disaggregated architectures; compared to prior work on compute pushdown, TELEPORT is unique in its efficiency and flexibility. We have developed optimization principles for several popular systems including a columnar in-memory DBMS, a graph processing system, and a MapReduce system. The evaluation results show that using TELEPORT to push down simple operators improves the performance of these systems on state-of-the-art disaggregated OSes by an order of magnitude, thus fully exploiting the elasticity of disaggregated data centers.

@inproceedings{TeleportSIGMOD2022, author = {Qizhen Zhang and Xinyi Chen and Sidharth Sankhe and Zhilei Zheng and Ke Zhong and Sebastian Angel and Ang Chen and Vincent Liu and Boon Thau Loo}, title = {Optimizing Data-intensive Systems in Disaggregated Data Centers with {TELEPORT}}, booktitle = {{SIGMOD} '22: International Conference on Management of Data}, pages = {1345--1359}, publisher = {{ACM}}, year = {2022} }

Redy: Remote Dynamic Memory Cache
Qizhen Zhang, Philip Bernstein, Daniel Berger, Badrish Chandramouli
International Conference on Very Large Data Bases, VLDB 2022
Foundation Cloud Data Systems Caching Harvested Resources

Redy is a cloud service that provides high performance caches using RDMA-accessible remote memory. An application can customize the performance of each cache with a service level objective (SLO) for latency and throughput. By using remote memory, it can leverage stranded memory and spot VM instances to reduce the cost of its caches and improve data center resource utilization. Redy auto- matically customizes the resource configuration for the given SLO, handles the dynamics of remote memory regions, and recovers from failures. The experimental evaluation shows that Redy can deliver its promised performance and robustness under remote memory dynamics in the cloud. We augment a production key-value store, FASTER, with a Redy cache. When the working set exceeds local memory, using Redy is significantly faster than spilling to SSDs.

@article{RedyVLDB2022, author = {Qizhen Zhang and Philip A. Bernstein and Daniel S. Berger and Badrish Chandramouli}, title = {Redy: Remote Dynamic Memory Cache}, journal = {Proc. {VLDB} Endow.}, volume = {15}, number = {4}, pages = {766--779}, year = {2022} }

CompuCache: Remote Computable Caching using Spot VMs
Qizhen Zhang, Philip Bernstein, Daniel Berger, Badrish Chandramouli, Vincent Liu, Boon Thau Loo
Conference on Innovative Data Systems Research, CIDR 2022
Foundation Cloud Data Systems Caching Harvested Resources

Data management systems are hungry for main memory, and cloud data centers are awash in it. But that memory is not always easily accessible and often too expensive. To bridge this gap, we propose a new cloud service, CompuCache, that allows data-intensive systems to opportunistically offload their in-memory data, and computation over that data, to inexpensive cloud resources. For reduced cost each cache is hosted by spot virtual machine (VM) instances when possible or provisioned VMs when not. CompuCache provides a byte-array abstraction and stored procedures so users can easily allocate inexpensive caches and specify their behavior. It distributes each stored procedure execution across the instances. In this paper, we discuss challenges in designing the interface, execution strategy, and fault tolerance mechanisms for CompuCache. We propose initial solutions for them, describe types of applications that can benefit from CompuCache, and report on the performance of an initial prototype. It executes 126 million stored procedure invocations per second on one VM with 16 threads.

@inproceedings{CompuCacheCIDR2022, author = {Qizhen Zhang and Philip A. Bernstein and Daniel S. Berger and Badrish Chandramouli and Vincent Liu and Boon Thau Loo}, title = {CompuCache: Remote Computable Caching using Spot VMs}, booktitle = {{CIDR} '22: Conference on Innovative Data Systems Research}, publisher = {www.cidrdb.org}, year = {2022} }

MimicNet: Fast Performance Estimates for Data Center Networks with Machine Learning
Qizhen Zhang, Kelvin K.W. Ng, Charles W. Kazer, Shen Yan, João Sedoc, Vincent Liu
Annual Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2021
Foundation Data Center Networks Performance Prediction Machine Learning

At-scale evaluation of new data center network innovations is becoming increasingly intractable. This is true for testbeds, where few, if any, can afford a dedicated, full-scale replica of a data center. It is also true for simulations, which while originally designed for precisely this purpose, have struggled to cope with the size of today's networks. This paper presents an approach for quickly obtaining accurate performance estimates for large data center networks. Our system, MimicNet, provides users with the familiar abstraction of a packet-level simulation for a portion of the network while leveraging redundancy and recent advances in machine learning to quickly and accurately approximate portions of the network that are not directly visible. MimicNet can provide over two orders of magnitude speedup compared to regular simulation for a data center with thousands of servers. Even at this scale, MimicNet estimates of the tail FCT, throughput, and RTT are within 5% of the true results.

@inproceedings{MimicNetSIGCOMM2021, author = {Qizhen Zhang and Kelvin K. W. Ng and Charles Kazer and Shen Yan and Jo\~{a}o Sedoc and Vincent Liu}, title = {{MimicNet}: Fast Performance Estimates for Data Center Networks with Machine Learning}, booktitle = {{SIGCOMM} '21: Annual Conference of the ACM Special Interest Group on Data Communication}, pages = {287--304}, publisher = {{ACM}}, year = {2021} }

Understanding the Effect of Data Center Resource Disaggregation on Production DBMSs
Qizhen Zhang, Yifan Cai, Xinyi Chen, Sebastian Angel, Ang Chen, Vincent Liu, Boon Thau Loo
International Conference on Very Large Data Bases, VLDB 2020
Rethinking DBMSs Resource Disaggregation

Resource disaggregation is a new architecture for data centers in which resources like memory and storage are decoupled from the CPU, managed independently, and connected through a high-speed network. Recent work has shown that although disaggregated data centers (DDCs) provide operational benefits, applications running on DDCs experience degraded performance due to extra network latency between the CPU and their working sets in main memory. DBMSs are an interesting case study for DDCs for two main reasons: (1) DBMSs normally process data-intensive workloads and require data movement between different resource components; and (2) disaggregation drastically changes the assumption that DBMSs can rely on their own internal resource management.
We take the first step to thoroughly evaluate the query execution performance of production DBMSs in disaggregated data centers. We evaluate two popular open-source DBMSs (MonetDB and PostgreSQL) and test their performance with the TPC-H benchmark in a recently released operating system for resource disaggregation. We evaluate these DBMSs with various configurations and compare their performance with that of single-machine Linux with the same hardware resources. Our results confirm that significant performance degradation does occur, but, perhaps surprisingly, we also find settings in which the degradation is minor or where DDCs actually improve performance.

@article{DDCDBMSVLDB2020, author = {Qizhen Zhang and Yifan Cai and Xinyi Chen and Sebastian Angel and Ang Chen and Vincent Liu and Boon Thau Loo}, title = {Understanding the Effect of Data Center Resource Disaggregation on Production DBMSs}, journal = {Proc. {VLDB} Endow.}, volume = {13}, number = {9}, pages = {2150--8097}, year = {2020} }

Rethinking Data Management Systems for Disaggregated Data Centers
Qizhen Zhang, Yifan Cai, Sebastian Angel, Ang Chen, Vincent Liu, Boon Thau Loo
Conference on Innovative Data Systems Research, CIDR 2020
Rethinking DBMSs Resource Disaggregation

One recent trend of cloud data center design is resource disaggregation. Instead of having server units with "converged" compute, memory, and storage resources, a disaggregated data center (DDC) has pools of resources of each type connected via a network. While the systems community has been investigating the research challenges of DDC by designing new OS and network stacks, the implications of DDC for next-generation database systems remain unclear. In this paper, we take a first step towards understanding how DDCs might affect the design of relational databases, discuss the potential advantages and drawbacks in the context of data processing, and outline research challenges in addressing them.

@inproceedings{DDCDMCIDR2020, author = {Qizhen Zhang and Yifan Cai and Sebastian Angel and Ang Chen and Vincent Liu and Boon Thau Loo}, title = {Rethinking Data Management Systems for Disaggregated Data Centers}, booktitle = {{CIDR} '20: Conference on Innovative Data Systems Research}, publisher = {www.cidrdb.org}, year = {2020} }

Optimizing Declarative Graph Queries at Large Scale
Qizhen Zhang, Akash Acharya, Hongzhi Chen, Simran Arora, Ang Chen, Vincent Liu, Boon Thau Loo
ACM International Conference on Management of Data, SIGMOD 2019
Foundation Graph Processing Query Optimization Cloud Data Centers

This paper presents GraphRex, an efficient, robust, scalable, and easy-to-program framework for graph processing on datacenter infrastructure. To users, GraphRex presents a declarative, Datalog-like interface that is natural and expressive. Underneath, it compiles those queries into efficient implementations. A key technical contribution of GraphRex is the identification and optimization of a set of global operators whose efficiency is crucial to the good performance of datacenter-based, large graph analysis. Our experimental results show that GraphRex significantly outperforms existing frameworks--both high- and low-level--in scenarios ranging across a wide variety of graph workloads and network conditions, sometimes by two orders of magnitude.

@inproceedings{GraphRexSIGMOD2019, author = {Qizhen Zhang and Akash Acharya and Hongzhi Chen and Simran Arora and Ang Chen and Vincent Liu and Boon Thau Loo}, title = {Optimizing Declarative Graph Queries at Large Scale}, booktitle = {{SIGMOD} '19: International Conference on Management of Data}, pages = {1411--1428}, publisher = {{ACM}}, year = {2019} }

Predicting Startup Crowdfunding Success through Longitudinal Social Engagement Analysis
Qizhen Zhang, Tengyuan Ye, Meryem Essaidi, Shivani Agarwal, Vincent Liu, Boon Thau Loo
ACM International Conference on Information and Knowledge Management, CIKM 2017
Application Crowdfunding Social Networks Machine Learning

A key ingredient to a startup's success is its ability to raise funding at an early stage. Crowdfunding has emerged as an exciting new mechanism for connecting startups with potentially thousands of investors. Nonetheless, little is known about its effectiveness, nor the strategies that entrepreneurs should adopt in order to maximize their rate of success. In this paper, we perform a longitudinal data collection and analysis of AngelList--a popular crowdfunding social platform for connecting investors and entrepreneurs. Over a 7-10 month period, we track companies that are actively fund-raising on AngelList, and record their level of social engagement on AngelList, Twitter, and Facebook. Through a series of measures on social engagement (e.g. number of tweets, posts, new followers), our analysis shows that active engagement on social media is highly correlated to crowdfunding success. In some cases, the engagement level is an order of magnitude higher for successful companies. We further apply a range of machine learning techniques (e.g. decision tree, SVM, KNN, etc) to predict the ability of a company to successfully raise funding based on its social engagement and other metrics. Since fund-raising is a rare event, we explore various techniques to deal with class imbalance issues. We observe that some metrics (e.g. AngelList followers and Facebook posts) are more significant than other metrics in predicting fund-raising success. Furthermore, despite the class imbalance, we are able to predict crowdfunding success with 84% accuracy.

@inproceedings{CrowdfundingCIKM2017, author = {Qizhen Zhang and Tengyuan Ye and Meryem Essaidi and Shivani Agarwal and Vincent Liu and Boon Thau Loo}, title = {Predicting Startup Crowdfunding Success through Longitudinal Social Engagement Analysis}, booktitle = {{CIKM} '17: International Conference on Information and Knowledge Management}, pages = {1937--1946}, publisher = {{ACM}}, year = {2017} }

Architectural Implications on the Performance and Cost of Graph Analytics Systems
Qizhen Zhang, Hongzhi Chen, Da Yan, James Cheng, Boon Thau Loo, Purushotham Bangalore
ACM Symposium on Cloud Computing, SoCC 2017
Rethinking Graph Processing System Architectures Performance vs. Cost

Graph analytics systems have gained significant popularity due to the prevalence of graph data. Many of these systems are designed to run in a shared-nothing architecture whereby a cluster of machines can process a large graph in parallel. In more recent proposals, others have argued that a single-machine system can achieve better performance and/or is more cost-effective. There is however no clear consensus which approach is better. In this paper, we classify existing graph analytics systems into four categories based on the architectural differences, i.e., processing infrastructure (centralized vs distributed), and memory consumption (in-memory vs out-of-core). We select eight open-source systems to cover all categories, and perform a comparative measurement study to compare their performance and cost characteristics across a spectrum of input data, applications, and hardware settings. Our results show that the best performing configuration can depend on the type of applications and input graphs, and there is no dominant winner across all categories. Based on our findings, we summarize the trends in performance and cost, and provide several insights that help to illu- minate the performance and resource cost tradeoffs across different graph analytics systems and categories.

@inproceedings{GrahpPACSoCC2017, author = {Qizhen Zhang and Hongzhi Chen and Da Yan and James Cheng and Boon Thau Loo and Purushotham Bangalore}, title = {Architectural Implications on the Performance and Cost of Graph Analytics Systems}, booktitle = {{SoCC} '17: Symposium on Cloud Computing}, pages = {40--51}, publisher = {{ACM}}, year = {2017} }