Hyperscale Data Management
The largest deployments of cloud DBMSs have reached the hyperscale level where a single database manages petabytes of data across hundreds to thousands of machines. At this scale, communication is crucial to efficiency as massive amounts of data are moved between components through the data center network. Network internals, such as topology, routing, congestion control, and packet processing are generating fast-increasing effect on end-to-end data processing performance. Yet, DBMSs treat the network as a black box. In this research, we investigate how data center networks can possibly speed up/slow down cloud data processing and build network-aware DBMSs and DBMS-friendly networks to sustain high efficiency for hyperscale data management.
Disaggregated Data Management
Cloud providers are embarking on a brave journey to evolve their infrastructures into disaggregated data centers (DDCs). Unlike traditional monolithic servers that are aggregates of compute, memory, and storage, hardware resources of different types in DDCs are physically separated, managed in independent resource pools, and connected via fast networking. Disaggregation solves many critical issues of existing data centers, but data-intensive systems suffer from new challenges. In this research, we study the implications of resource disaggregation and codesign data systems and cloud infrastructure to unlock the full benefits of disaggregation.
Accelerated Data Management
The end of Dennard scaling and the slowdown of Moore's law have meant that advancing the performance of general-purpose processors is increasingly challenging. CPUs are not getting faster, but data growth is continuing. We hence face unprecedented needs for more efficient compute resources for data processing, which is a fundamental challenge in today's management of data. In this research, we rethink how data management systems and hardware interact and investigate how to best leverage widely available hardware accelerators in cloud data centers to improve the performance and cost of data tasks at scale.