ML for Systems
Our vision for research on ML for Systems is laid out in SageDB, a new type of data processing system that highly specializes to a particular application through code synthesis and machine learning. This vision is also a focus of MIT DSAIL.
Here, we provide an overview of data systems components that we are currently working on, with more detailed project descriptions in the links, as well as a list of open-source repositories.
For high-level descriptions of our research, you can check out our Learned Systems Blog.
If you want to find out more about the exciting work in this area, we have also compiled a list of ML for Systems Papers.
Data Access
Machine Learning just ate Algorithms in one large bite, thx to @tim_kraska, @alexbeutel, @edchi, @JeffDean & Polyzotis at @Google—faster, smaller trees, hashes, bloom filters
— Christopher Manning, Professor at Stanford
Storage layout and index structures are the most important factors to guarantee efficient data access, and both are amenable to be enhanced by learned data and workload models. The original work on learned indexes showed that replacing traditional index structures with learned models can improve data access times while reducing memory footprint. Since then, we have worked to make learned indexes more practical, developed learned multi-dimensional indexes, created a benchmark for learned index structures, and applied learned indexes to DNA sequence search.
Publications:
- The Case for Learned Index Structures. SIGMOD 2018. Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean, Neoklis Polyzotis
- SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on ML for Systems 2019. Andreas Kipf*, Ryan Marcus*, Alexander van Renen*, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann
- LISA: Towards Learned DNA Sequence Search. NeurIPS Workshop on Systems for ML 2019. Darryl Ho, Jialin Ding, Sanchit Misra, Nesime Tatbul, Vikram Nathan, Vasimuddin Md, Tim Kraska
- RadixSpline: A Single-Pass Learned Index. aiDM Workshop @ SIGMOD 2020. Andreas Kipf*, Ryan Marcus*, Alexander van Renen*, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann
- Learning Multi-dimensional Indexes. SIGMOD 2020. Vikram Nathan*, Jialin Ding*, Mohammad Alizadeh, Tim Kraska
- CDFShop: Exploring and Optimizing Learned Index Structures. SIGMOD 2020 (demo). Ryan Marcus, Emily Zhang, Tim Kraska
- ALEX: An Updatable Adaptive Learned Index. SIGMOD 2020. Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, Tim Kraska
- Benchmarking Learned Indexes. VLDB 2021. Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, Tim Kraska
- Partitioned Learned Bloom Filter. ICLR 2021. Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, Tim Kraska
- Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. VLDB 2021. Jialin Ding, Vikram Nathan, Mohammad Alizadeh, Tim Kraska
- LEA: A Learned Encoding Advisor for Column Stores. aiDM Workshop @ SIGMOD 2021. Lujing Cen, Andreas Kipf, Ryan Marcus, Tim Kraska
- When Are Learned Models Better Than Hash Functions?. AIDB Workshop @ VLDB 2021. Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Tim Kraska
- Towards Practical Learned Indexing. AIDB Workshop @ VLDB 2021. Mihail Stoian, Andreas Kipf, Ryan Marcus, Tim Kraska
- Bounding the Last Mile: Efficient Learned String Indexing. AIDB Workshop @ VLDB 2021. Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, Tim Kraska
Query Execution
Learning the data distribution can also be used to speed up query execution. In particular, we are applying learned techniques to sorting, scheduling, and joins.
Publications:
- Learning Scheduling Algorithms for Data Processing Clusters. SIGCOMM 2019. Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh
- The Case for a Learned Sorting Algorithm. SIGMOD 2020. Ani Kristo*, Kapil Vaidya*, Ugur Cetintemel, Sanchit Misra, Tim Kraska
- LSched: A Workload-Aware Learned Query Scheduler for Analytical Database Systems. SIGMOD 2022. Ibrahim Sabek, Tenzin Samten Ukyab, Tim Kraska
Query Optimization
Traditional query optimizers are extremely hard to build, maintain, and often yield sub-optimal query plans. The brittleness and complexity of the optimizer makes it a good candidate to be learned. We have introduced the first end-to-end learned query optimizer. We are also exploring learning-based cardinality estimation techniques.
Publications:
- Neo: A Learned Query Optimizer. VLDB 2019. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul
- Cost-Guided Cardinality Estimation: Focus Where it Matters. SMDB 2020. Parimarjan Negi, Ryan Marcus, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh
- Bao: Making Learned Query Optimization Practical. SIGMOD 2021. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska
- Flow-Loss: Learning Cardinality Estimates That Matter. VLDB 2021. Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh
Open Source
We have open-sourced a number of our projects:
- Search on Sorted Data Benchmark (SOSD). https://github.com/learnedsystems/SOSD.
- Reference implementation of recursive model index (RMI). https://github.com/learnedsystems/RMI.
- RadixSpline: A Single-Pass Learned Index. https://github.com/learnedsystems/RadixSpline.
- Park: An Open Platform for Learning Augmented Computer Systems. https://github.com/park-project/park.
- Reference implementation of Learned Sort. https://github.com/learnedsystems/LearnedSort.
- ALEX: An Updatable Adaptive Learned Index. https://github.com/microsoft/ALEX.
- Bao: Learning to Steer Query Optimizers. https://github.com/learnedsystems/baoforpostgresql.
- Self-Organizing Data Containers (SDC). Will be released soon.