ML for Systems

Our vision for research on ML for Systems is laid out in SageDB, a new type of data processing system that highly specializes to a particular application through code synthesis and machine learning. This vision is also a focus of MIT DSAIL.

Here, we provide an overview of data systems components that we are currently working on, with more detailed project descriptions in the links, as well as a list of open-source repositories.

For high-level descriptions of our research, you can check out our Learned Systems Blog.

If you want to find out more about the exciting work in this area, we have also compiled a list of ML for Systems Papers.

Data Access

Machine Learning just ate Algorithms in one large bite, thx to @tim_kraska, @alexbeutel, @edchi, @JeffDean & Polyzotis at @Google—faster, smaller trees, hashes, bloom filters
— Christopher Manning, Professor at Stanford

Storage layout and index structures are the most important factors to guarantee efficient data access, and both are amenable to be enhanced by learned data and workload models. The original work on learned indexes showed that replacing traditional index structures with learned models can improve data access times while reducing memory footprint. Since then, we have worked to make learned indexes more practical, developed learned multi-dimensional indexes, created a benchmark for learned index structures, and applied learned indexes to DNA sequence search.

Publications:

The Case for Learned Index Structures. SIGMOD 2018. Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean, Neoklis Polyzotis
SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on ML for Systems 2019. Andreas Kipf*, Ryan Marcus*, Alexander van Renen*, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann
LISA: Towards Learned DNA Sequence Search. NeurIPS Workshop on Systems for ML 2019. Darryl Ho, Jialin Ding, Sanchit Misra, Nesime Tatbul, Vikram Nathan, Vasimuddin Md, Tim Kraska
RadixSpline: A Single-Pass Learned Index. aiDM Workshop @ SIGMOD 2020. Andreas Kipf*, Ryan Marcus*, Alexander van Renen*, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann
Learning Multi-dimensional Indexes. SIGMOD 2020. Vikram Nathan*, Jialin Ding*, Mohammad Alizadeh, Tim Kraska
CDFShop: Exploring and Optimizing Learned Index Structures. SIGMOD 2020 (demo). Ryan Marcus, Emily Zhang, Tim Kraska
ALEX: An Updatable Adaptive Learned Index. SIGMOD 2020. Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, Tim Kraska
Benchmarking Learned Indexes. VLDB 2021. Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, Tim Kraska
Partitioned Learned Bloom Filter. ICLR 2021. Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, Tim Kraska
Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. VLDB 2021. Jialin Ding, Vikram Nathan, Mohammad Alizadeh, Tim Kraska
LEA: A Learned Encoding Advisor for Column Stores. aiDM Workshop @ SIGMOD 2021. Lujing Cen, Andreas Kipf, Ryan Marcus, Tim Kraska
When Are Learned Models Better Than Hash Functions?. AIDB Workshop @ VLDB 2021. Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Tim Kraska
Towards Practical Learned Indexing. AIDB Workshop @ VLDB 2021. Mihail Stoian, Andreas Kipf, Ryan Marcus, Tim Kraska
Bounding the Last Mile: Efficient Learned String Indexing. AIDB Workshop @ VLDB 2021. Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, Tim Kraska

Query Execution

Learning the data distribution can also be used to speed up query execution. In particular, we are applying learned techniques to sorting, scheduling, and joins.

Publications:

Learning Scheduling Algorithms for Data Processing Clusters. SIGCOMM 2019. Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh
The Case for a Learned Sorting Algorithm. SIGMOD 2020. Ani Kristo*, Kapil Vaidya*, Ugur Cetintemel, Sanchit Misra, Tim Kraska
LSched: A Workload-Aware Learned Query Scheduler for Analytical Database Systems. SIGMOD 2022. Ibrahim Sabek, Tenzin Samten Ukyab, Tim Kraska

Query Optimization

Traditional query optimizers are extremely hard to build, maintain, and often yield sub-optimal query plans. The brittleness and complexity of the optimizer makes it a good candidate to be learned. We have introduced the first end-to-end learned query optimizer. We are also exploring learning-based cardinality estimation techniques.

Publications:

Neo: A Learned Query Optimizer. VLDB 2019. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul
Cost-Guided Cardinality Estimation: Focus Where it Matters. SMDB 2020. Parimarjan Negi, Ryan Marcus, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh
Bao: Making Learned Query Optimization Practical. SIGMOD 2021. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska
Flow-Loss: Learning Cardinality Estimates That Matter. VLDB 2021. Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh

Open Source

We have open-sourced a number of our projects:

Search on Sorted Data Benchmark (SOSD). https://github.com/learnedsystems/SOSD.
Reference implementation of recursive model index (RMI). https://github.com/learnedsystems/RMI.
RadixSpline: A Single-Pass Learned Index. https://github.com/learnedsystems/RadixSpline.
Park: An Open Platform for Learning Augmented Computer Systems. https://github.com/park-project/park.
Reference implementation of Learned Sort. https://github.com/learnedsystems/LearnedSort.
ALEX: An Updatable Adaptive Learned Index. https://github.com/microsoft/ALEX.
Bao: Learning to Steer Query Optimizers. https://github.com/learnedsystems/baoforpostgresql.
Self-Organizing Data Containers (SDC). Will be released soon.