Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine (Andrew Lamb)

Summary notes created by Deciphr AI

The seminar, sponsored by Google, introduces the Database Building Blocks series, focusing on modular libraries like Apache Data Fusion for building database systems. Andrew Lamb, a staff engineer at InfluxDB and chair of the Apache Arrow Data Fusion PMC, discusses the architecture and functionality of Data Fusion, likening it to LLVM for databases. Data Fusion allows developers to customize OLAP databases without rewriting core components, emphasizing ease of use and industrial best practices. The seminar highlights Data Fusion's modular system, performance capabilities, and community contributions, with a focus on scalability and efficient resource management using Rust and Tokyo.

Summary Notes

Introduction to Database Building Blocks Seminar Series

The seminar series focuses on database libraries that can be combined to build comprehensive systems.
Data Fusion is highlighted as a trending library in this space.

"This semester dat seminar series is on what we're calling the database building blocks."

The series is designed to explore foundational libraries for database systems.

Overview of Data Fusion

Data Fusion is likened to LLVM for databases, enabling innovation without reinventing foundational components.
It allows developers to focus on application-specific features rather than rebuilding core database functionalities.

"In my mind, the best analogy is that it's the LLVM for databases."

Data Fusion serves as a shared foundation for building various data systems, particularly OLAP databases.
It provides a full-featured SQL query engine that can be customized extensively.

"You get a full-featured SQL query engine, and then you can basically customize everything."

Architectural Goals of Data Fusion

Designed to work out of the box with minimal setup, allowing for quick deployment.
Prioritizes industrial best practices over the latest research to ensure reliability and ease of understanding.

"The goal is that it works out of the box... you get up and running out of the box as soon as possible."

Encourages experimentation and customization in real production systems.

Use Cases and Applications

Used to build entire database systems, streaming systems, and time-series databases.
Employed as an execution engine for various front ends and academic experiments.

"We have people who use it to build entire database systems... and time-series databases."

Supports implementations of table formats like Delta Lake, Iceberg, and Hoodie.

Performance Considerations

Despite being modular, Data Fusion achieves competitive performance through well-understood module boundaries.
Emphasizes engineering resources as the primary factor for performance optimization.

"We have very good performance in Data Fusion... it's how much engineering effort you can invest."

Benefits from the Arrow format, which allows for efficient data handling and kernel optimization.

Community and Development

Data Fusion was founded independently and donated to the Apache Arrow project.
The community is robust, with hundreds of contributors and regular releases.

"It was actually founded by Andy Grove... now has hundreds of distinct contributors."

Positioned to support the increasing specialization of database systems.

Architecture of Data Fusion

The architecture includes a catalog, various data sources, front ends, logical and physical plan representations, and an execution engine.
Extensible at multiple points, allowing for custom functions, operators, and execution streams.

"It's got a catalog and various data sources... you can basically extend the system anywhere."

Uses Apache Arrow internally for data handling, leveraging its optimized kernels and avoiding conversion costs.

"It uses Arrow internally... you can take advantage of all the engineering effort that goes in there."

Implementation in Rust

Written in Rust for memory and thread safety, modern tooling, and community interest.
Rust's safety features have minimized memory issues and improved reliability.

"Rust, the upside of Rust, the pitch was it has memory and thread safety."

The choice of Rust has attracted contributors interested in learning the language and has proven effective for database implementations.

Memory Safety and Compiler Enforcement in Open Source Projects

Memory safety constraints enforced by the compiler are crucial for large open source projects.
The compiler minimizes the risk of errors such as race conditions, which are common in languages like C++.
Compiler enforcement allows more contributors to participate in the project without causing significant damage.

"By having the compiler enforce the memory safety constraints, it actually helps a lot when what you're building is this big open source project."

The compiler's role is vital in ensuring memory safety in large-scale projects, reducing reliance on expert code reviewers.

Package Management and Dependencies in Rust

Rust's package manager simplifies starting new projects, exemplified by the ease of setting up a project with Data Fusion.
Adding dependencies in Rust is straightforward, leading to a large dependency tree, similar to Python projects.
While beneficial, the ease of adding dependencies can result in a non-trivial dependency footprint.

"You run cargo blah blah blah my new project, you go into the directory and then you go cargo add data fusion and you build it and like now you're ready to rock."

Rust's package manager allows for a quick setup of projects, making it accessible for developers to start working immediately.

SQL Planning and Execution in Data Fusion

SQL planning in Data Fusion involves transforming a SQL query into a logical plan, followed by physical planning and execution.
Logical plans are represented as relational operator trees, processed from bottom to top.
The output of these processes is Arrow record batches, which are sets of Arrow rows.

"When you see you get to SQL query, it starts a text... turns into something called the logical plan."

SQL planning in Data Fusion involves multiple stages, starting from parsing and resulting in logical and physical plans.

Logical and Physical Plans in Data Fusion

Logical plans in Data Fusion are data flow graphs representing operations like project, filter, window, aggregate, and sort.
Physical plans are lower-level execution plans that involve vectorized operations on data.
Data Fusion allows for extension nodes in logical plans, enabling custom operations not built into the system.

"Logical plans are basically data flow graphs... they represent data flowing from sources down at the bottom flowing up through the plan."

Logical plans represent the flow of data through relational operations, while physical plans focus on execution details.

User-Defined Functions and Extensions in Data Fusion

Data Fusion supports creating user-defined functions using a consistent API.
Both built-in and user-defined functions are implemented using the same interface.
The system allows for creating custom query languages through logical plan extensions.

"All the functions are actually implemented using the same API... you can implement scaler aggregate window functions table functions."

Data Fusion provides a flexible API for implementing various types of functions, ensuring extensibility and customization.

Catalog API and File-Based Catalogs in Data Fusion

The catalog API in Data Fusion provides information on tables and schemas, with implementations for memory-based and file-based catalogs.
File-based catalogs follow a directory structure, treating directories as tables and subdirectories as partitions.
The catalog API is asynchronous, allowing for network-based implementations.

"Catalog is like what tables exist what schemas exist that kind of stuff... what comes with data Fusion is two this memory based one... and then there's also one that's like a directory of files."

Data Fusion's catalog API is designed to be simple yet flexible, supporting both in-memory and file-based catalogs for managing table metadata.

Data Fusion Overview

Data Fusion is a platform that supports various table formats and data processing optimizations.
It includes common file formats like CSV, JSON, Parquet, and Arrow, with community-provided interfaces for databases like SQLite and MySQL.
The Table Provider API is central to how Data Fusion reads data, managing projections, filters, and execution plans.

"The prepackaged one in terms of table providers and table formats is a whole B like it includes common file formats like CSV, Json, Parquet, and Arrow."

Data Fusion supports a wide range of file formats, providing flexibility in data integration.

"When Data Fusion wants to read data, it calls scan and it passes projection filter push down projections push down filters push down Li so the which columns to read which Expressions to apply if you have some limit of number of rows and it returns an execution plan."

The Table Provider API in Data Fusion manages data reading, applying filters, and creating execution plans.

Parquet Reader and Optimizations

Data Fusion's Parquet reader is highly sophisticated, offering features like row group and data page pruning, late materialization, and metadata caching.
It supports advanced optimizations like IO pushdowns and external index information.

"If you look at the Parquet reader that's in Data Fusion, it's about it has about as all the tricks that I know that you can play with Parquet readers."

The Parquet reader in Data Fusion is advanced, incorporating numerous optimization techniques for efficient data processing.

"It does late materialization like they'll actually read only some columns apply the filters and then only materialize values from the subsequent columns that match the first."

Late materialization in Data Fusion allows for efficient data processing by only materializing necessary data.

Logical Query Optimization

Logical query optimization involves rewriting logical plans to compute the same results more efficiently.
Built-in optimizations include push-down projections, limits, filters, expression simplifications, and subquery decorrelations.

"What does logical query optimization mean? It's basically you have a logical plan there's a rewrite that goes on that tries to make the basically compute the same thing but does it faster."

Logical query optimization in Data Fusion focuses on enhancing query performance by rewriting plans.

"There's an API so if you don't like what's in there you need something new you can add your own and the API literally is you get the logical plan and you have to return a logical plan."

The API allows users to customize logical query optimizations by adding new rules or modifying existing ones.

Join Ordering and Semantic Optimization

Join ordering in Data Fusion uses a semantic optimizer to plan joins as they appear in the query.
Complex join ordering is challenging due to unsolved issues in cardinality estimation.

"What semantic optimizer means basically is you plan the joins in the orders like they appear in the query that you like wrote them."

Semantic optimization in Data Fusion focuses on maintaining the join order as specified in the query.

"Cardinality estimation is really hard it's basically unsolved people still write papers about it it's still unsolved and without that complex rejoin ordering is really really hard."

The complexity of join ordering is attributed to the challenges in accurate cardinality estimation.

Physical Planning and Execution

Physical planning transforms logical plans into execution plans that generate streams of record batches.
The executor model in Data Fusion is a pull-based, volcano-style system that operates on batches of rows.

"The way physical works you start with logical plan and eventually you get to this execution plan thing and the execution plan thing eventually executes and generates multiple streams of these AR record batches."

Physical planning in Data Fusion involves generating execution plans that produce streams of data for processing.

"Data Fusion uses a like it's classic volcano style pull based I'll explain how this works it works great has exchange operators or the equivalent it's a streaming engine that means it doesn't buffer stuff internally unless it has to."

The executor model in Data Fusion is pull-based and streaming, minimizing internal buffering.

Optimizer Soundness and Correctness

Ensuring the soundness and correctness of optimizer passes relies on schema checks and user responsibility for semantic accuracy.
There's no theoretical validation for ensuring optimized plans produce the same results as unoptimized ones.

"There's a check that make sure at least the schemas come out that are the same but otherwise I don't think there there's no I'm not sure there's a theoretical way to validate that they compute the same same answer."

Schema checks are conducted to ensure correctness, but semantic validation is largely user-dependent.

Memory Management and Resource Tracking

Data Fusion uses cooperative memory management, tracking large allocations while ignoring smaller ones.
The default memory pool gives operators as much memory as they request until depletion, with alternative strategies for fair distribution.

"Memory dis management is done by cooperatively reporting the Giant allocations and ignoring the small ones and assuming they're accounted for in some slop."

Memory management in Data Fusion focuses on tracking significant allocations while assuming smaller ones are covered in general overhead.

"The prepackaged memory of course is in trait so you can provide your own implementation memory pool but the basic one is just it'll give operators as much memory as they ask for until it runs out."

Custom memory pool implementations can be provided, allowing for tailored memory management strategies.

Scheduling and Thread Management

Data Fusion utilizes Tokyo for thread scheduling, benefiting from its high-quality work-stealing scheduler.
The use of async features in Rust allows for efficient handling of IO and CPU-bound tasks.

"Data Fusion uses something called Tokyo it's Implement like it implements the CPU bound work on a Tokyo uh thread pool it's called a runtime Tokyo is this thing that came out of the rust ecosystem it was designed for networking."

Tokyo provides an efficient thread scheduling system for Data Fusion, leveraging Rust's async capabilities.

"If you have to do IO or something right in your middle of doing one of these operations uh the Tokyo schedule handles that and can basically take off and and do your IO and come back and rerun you when you're ready."

Tokyo's scheduling system allows for seamless handling of IO operations, improving overall system efficiency.

Rust Stream Ecosystem and Scheduling

Discussion on the challenges and advantages of different scheduling systems within the Rust ecosystem.
Comparison between push-based and pull-based scheduling, with specific reference to Tokyo and other systems.
Emphasis on the complexity and inefficiency of implementing a push-based morsel scheduler compared to using Tokyo.

"I think a lot of people complain about the Rust stream ecosystem... it's bad but compared to everything else it's like the least bad of all the other options."

The Rust stream ecosystem, despite its flaws, is seen as a comparatively better option than others.

"We actually tried to implement a push-based morsel scheduler... but basically he couldn't show there was no significant benefits over using Tokyo."

Attempts to implement a push-based scheduler did not yield significant improvements over Tokyo, highlighting Tokyo's efficiency.

Performance and Control in Scheduling

Analysis of performance benefits and limitations of using Tokyo.
Discussion on the control plane and data plane separation in scheduling and its impact on performance.
The importance of managing CPU and IO operations efficiently without running them on the same thread pool.

"With Tokyo, like you can tell it to cancel a pending... it's essentially coroutine without like that mechanism."

Tokyo offers efficient cancellation mechanisms, enhancing control over scheduling tasks.

"You don't have to handle all that canceling explicitly because it's all sort of automatically been generated."

Automatic handling of task cancellation in Tokyo simplifies the process and enhances performance.

Work Stealing and Parallelism

Examination of work stealing as a core feature of Tokyo, enabling efficient task distribution across threads.
Limitations of setting parallelism at plan time without dynamic scaling during execution.
Discussion on resource management, specifically CPU and RAM, during task execution.

"The benefits of Tokyo is like from that work stealing... you have to basically set the plan the parallelism at plan time."

Tokyo's work stealing is a major advantage, but it requires setting parallelism upfront, limiting dynamic scaling.

"In my experience needing to switch CPU down during execution is actually typically like the last resource... you almost always run out of RAM first."

CPU resource management is less critical compared to RAM, which often becomes a bottleneck first.

Data Fusion vs. Duct DB Performance

Detailed comparison of Data Fusion and Duct DB performance across various benchmarks.
Emphasis on engineering effort rather than architectural differences as the primary factor in performance outcomes.
Exploration of specific performance metrics and areas for improvement in Data Fusion.

"The differences in performance is 100% engineering effort really than the architecture."

Performance differences are attributed to engineering efforts rather than fundamental architectural disparities.

"There's no fundamental difference in the scheduler that precludes you from getting good performance scalability up to 172 cores."

Both systems exhibit similar scalability, indicating no inherent scheduler limitations affecting performance.

Optimization and Statistics in Data Fusion

Overview of the optimization strategies used in Data Fusion, focusing on heuristic rather than cost-based methods.
Discussion on the absence of extensive statistics in Data Fusion's catalog and its impact on optimization capabilities.
Consideration of parallel operations and their effect on concurrency and contention.

"There's no cost-based optimizer in there... there's a bunch of heuristic reorders."

Data Fusion relies on heuristic optimizations rather than cost-based methods due to limited statistics.

"When you parallelize these operations on multiple cores... reshuffle data especially when it does hash partitioning... is definitely a compute-intensive operation."

Parallel operations, especially data reshuffling, are computationally intensive and require careful management to avoid contention.

Future Directions and Community Involvement

Encouragement for community contributions to enhance Data Fusion, particularly in areas like high cardinality aggregation.
Mention of ongoing projects like OPD (a new query project based on Data Fusion) and potential collaborations.
Call for more involvement and investment to drive innovation and improvements in Data Fusion.

"I'm trying to coax the same kind of excitement around Data Fusion to get the same level of investment."

There is a need for increased community engagement and investment to drive improvements in Data Fusion.

"We'd love to make it easier for you to do whatever you're doing... we'd love to collaborate more."

Open invitation for collaboration and contributions to enhance Data Fusion's capabilities and performance.

What others are sharing

Go To Library

3 Best Face Pack for Acne Treatment Naturally | Home Remedies for Acne | How to Get Rid of Acne

Introduction to Work Tips and Training - Our Journey starts here - Making Work Better for Everyone!

Synnada: Towards “Unified” Compute Engines: Opportunities and Challenges (Mehmet Ozan Kabak)