In a recent seminar, Mmed Oan Kabak, CEO and co-founder of Sonata, discussed building a unified compute engine leveraging Apache Data Fusion. Sonata aims to simplify data and AI infrastructures by creating a "Spark 2.0" that integrates modern technological advancements. The talk highlighted the need for engines that seamlessly unify batch and streaming data processing using standard SQL, address out-of-order data efficiently, and incorporate AI natively. Sonata's approach involves utilizing Data Fusion's capabilities, developing a flexible distribution layer, and enhancing observability for robust data application development. Their goal is to offer a scalable, composable infrastructure that reduces complexity in building data-intensive applications.
Introduction to Unified Compute Engines
- Unified compute engines are designed to integrate various data processing tasks into a single framework.
- They aim to simplify the development and maintenance of data-intensive applications by reducing the need for multiple separate tools.
- These engines are particularly relevant in modern contexts where both batch and streaming data processing are necessary.
"We are going to talk about how we use data Fusion to build a unified computer engine."
- The focus is on building a unified compute engine using Data Fusion, highlighting its importance and applicability in modern data processing.
Background of the Speaker and Company
- Mmed Oan Kabak is the CEO and co-founder of Sonata, with a background in stream processing, machine learning, and big data.
- Sonata is heavily involved in open-source projects and has contributed significantly to Apache Data Fusion.
- The company aims to develop a modern compute engine that reflects current technological advancements.
"I've been working in the fields of stream processing and machine learning and big data for like for a long time now more than 10 years."
- Mmed Oan Kabak has extensive experience in relevant fields, providing credibility to his insights on unified compute engines.
The Need for Unified Compute Engines
- The complexity of current data and AI infrastructures necessitates a unified approach to simplify processes.
- Existing technologies like Spark and Flink were designed under outdated assumptions, leading to the need for multiple additional tools.
- A modern compute engine could reduce the complexity and number of tools required for data processing.
"The centerpiece technologies that play a large role in these infrastructure diagrams like spark and Flink... they're old and they were designed in a world where the prevailing technical ideas and assumptions were different than today."
- Current technologies are outdated, necessitating a new approach to handle modern data processing needs effectively.
Building a Modern Compute Engine
- Sonata aims to create a "Spark 2.0" that simplifies data and AI infrastructures by integrating modern technological advancements.
- The goal is to provide a unified solution that reduces the need for multiple tools and simplifies the architecture of data-intensive applications.
"We want to build this spark 2.0 we want to simplify data Ai infrastructures and sin then aspires to be I guess data briak 2.0."
- The vision is to develop a modern compute engine that integrates various functionalities, reducing complexity and improving efficiency.
Example: Forecasting Traffic from Microservices
- Developing a data application for forecasting traffic involves both batch and streaming data processing considerations.
- The complexity of building such applications arises from the need to integrate multiple tools and technologies.
- A unified compute engine could streamline this process by handling both batch and streaming data within a single framework.
"Building this data app is not as simple as it seems from a simple description of what you want to do but I just just wanted to you know forecast the traffic like why is it so complicated."
- The complexity of data applications stems from the need to integrate diverse tools, which a unified compute engine could simplify.
Simplifying with Unified Compute Engines
- A unified compute engine could potentially handle various considerations like memory management, fault tolerance, and model updates automatically.
- By providing a single interface, such as SQL, users can focus on business logic without worrying about underlying technical complexities.
"Let's try to imagine a world where I can just give this to the engine and the engine does its best to take care of all these questions."
- The vision is to create a compute engine that abstracts complex technical details, allowing users to focus solely on business logic.
Conclusion
- The development of a unified compute engine is driven by the need to simplify data processing infrastructures.
- By integrating modern technologies and providing a unified interface, these engines aim to reduce complexity and improve efficiency in data-intensive applications.
"Is it possible to really do this I think the answer is yes maybe not not in its entirety."
- While achieving a completely unified compute engine may not be fully possible yet, significant strides can be made towards simplifying data processing.
Unified Compute for Stream and Batch Processing
- The concept of CA architecture suggests a unified compute system that integrates stream and batch processing.
- A unified engine should process both types of data using standard SQL without requiring changes to the user-facing API.
- Queries that are theoretically executable on streaming data should run seamlessly without engine-specific modifications.
"A unified engine should accept standard SQL to process both kinds of data. The data can be available to you as a whole, could be at rest, and you can process it that way, or maybe it's coming to you piecemeal."
- This quote emphasizes the need for a unified engine that handles both batch and streaming data using standard SQL, ensuring user experience remains consistent.
Challenges with Current Engines
- Current engines like Spark and Flink require query modifications due to internal engine limitations.
- A truly unified engine should not require query alterations unless the query is theoretically non-executable.
"Spark and Flink unfortunately require you to change your query to appease the internals of the engine."
- The quote highlights the limitations of current engines in processing queries without requiring user intervention to modify them according to engine constraints.
Example of Unified Engine Query
- An example query involves converting sales transactions from various currencies to US dollars using exchange rates streamed from an API.
- The query should execute efficiently if the engine is capable of handling it without unnecessary modifications.
"For every unique transaction, look at the exchange rates that are in the past that came before this transaction, take the last one, and dollarize using that exchange rate."
- This quote illustrates a practical example of a query that should run efficiently on a unified engine, highlighting the need for engines to handle such scenarios without issues.
Key Requirements for a Unified Engine
- Fully leverage data ordering during planning and maintain order-preserving variants of fundamental operators.
- Plan with awareness of unbounded streams and use rigorous techniques like interval arithmetic for range calculations.
"You have to keep track of data ordering across all operators in your plan without any loss of information."
- The quote underscores the importance of maintaining data order throughout the planning process to ensure efficient query execution.
Handling Out-of-Order Data
- Out-of-order data is a significant challenge in streaming infrastructures, requiring effective handling strategies.
- Two common approaches include the pure data flow model and the pure update table model, each with its advantages and drawbacks.
"Out-of-order data is the bane of streaming infrastructure's existence."
- The quote emphasizes the challenge posed by out-of-order data and the necessity for robust handling mechanisms in streaming systems.
Future Directions and Experimentation
- There is ongoing experimentation to find a balance between the pure data flow and pure update table models.
- The goal is to develop a more efficient and flexible approach to handling data ordering and out-of-order events.
"There could be a way to actually have two cha..."
- The incomplete quote suggests ongoing exploration into hybrid models that combine elements of existing approaches for better handling of streaming data challenges.
Stream Processing and Data Ordering
- Stream processing involves handling in-order and out-of-order data streams.
- In-order data is processed efficiently, while out-of-order updates are handled separately.
- The approach aims to converge to a final true result while maintaining efficiency.
"The in-order output of that top-level operator is your provisional result... there's this side stream of out-of-order updates that you can choose to apply and amend the provisional result or ignore it."
- This quote explains the dual-stream approach where the in-order stream provides a provisional result, and the out-of-order stream provides updates that can be applied as needed.
Challenges in Unified Engines
- Unified engines must handle both streaming and batch processing.
- Specialized engines often overfit, lacking generalizable techniques for expression evaluation.
- Implementing general techniques can improve engine flexibility and efficiency.
"You have to have these generalizable approaches to finding ranges of expressions which you can use in your joint operators or elsewhere."
- The quote highlights the need for generalizable techniques in unified engines to handle a variety of expressions efficiently.
Order-Preserving Variants of Operators
- Order-preserving variants of operators are crucial for maintaining data order through processing stages.
- Efficient joins with data pruning are an example of order-preserving operations.
- General techniques can be used to implement these variants.
"Now you have a composable thing... you preserve the ordering which means that subsequent operator can also do the same thing."
- This quote emphasizes the importance of maintaining data order through composable operations, ensuring efficiency and correctness in processing.
Integration of Existing Techniques
- The dream of a unified engine is achievable by integrating existing techniques from data systems literature.
- Apache Data Fusion serves as a foundational library for building on these techniques.
- Sinara aims to build a unified engine on top of this foundation.
"The dream of this unified engine is possible... by digesting what are the best practices and techniques in the data system literature."
- The quote underscores the strategy of integrating existing techniques to achieve a unified engine, rather than inventing new ones from scratch.
AI Model Integration in Unified Engines
- Integrating AI models as stateful functions requires decoupling model design from compilation.
- Efficient integration involves compiling models into UDFs for use in query engines.
- The data format should support vectorization and GPU utilization for efficiency.
"Design your model once in a very comfortable way... you compile it into a UDF so that you can build this forecasting."
- This quote outlines the process of integrating AI models into query engines by decoupling design and compilation, facilitating efficient use in data processing.
Efficient Inference and Hyperparameter Management
- Efficient inference requires operators and plans to utilize available devices, avoiding data slushing.
- Determining when to update models is a challenge; it must be deterministic and automatable.
- Hyperparameter management is crucial, with options like AutoML being resource-intensive.
"At every data point is typically not realistic... it has to be automatable... how do you deal with that?"
- The quote highlights the challenges in model updating and hyperparameter management, emphasizing the need for deterministic and automated solutions.
Mythro Library for Model Compilation
- Mythro is a library created to separate model design from compilation.
- It acts as a compiler, supporting various frameworks and infrastructure formats.
- The library aims to bring logical vs. physical distinctions into machine learning.
"You can build machine learning models for different frameworks... it acts as a compiler."
- This quote describes Mythro's role in facilitating model integration by acting as a compiler, supporting multiple frameworks and formats.
Model Compilation and Deployment
- Discusses the process of compiling models into Arrow functions with 16-bit precision for CPU deployment.
- Highlights the separation of concerns in model design, experimentation, and deployment.
- Emphasizes the ability to mix and match different models and bridge the gap between models and data fusion.
"The idea is to have this as a compiler available to you, separate SE bring separation of concern into model design and experimentation and deployment."
- Discusses the importance of a compiler in separating concerns between model design and deployment.
"It should give you the ability to compose and mix and match different models, UDF generation, bridge the gap between models and data fusion."
- Highlights the flexibility in model composition and the integration of user-defined functions (UDFs) in data fusion.
Open Source Project and Community Involvement
- Introduces an open-source project with an Apache license, inviting community involvement.
- Describes the project's aim to bring best practices from query engines to machine learning.
"It is not an Apache project, maybe it will be in the future, I don't know, and it will be released open source very soon."
- Indicates the project's potential transition to an Apache project and its imminent open-source release.
"If you're interested in bringing the best practices from query engines to the world of machine learning, come join us."
- Encourages community participation in integrating query engine best practices into machine learning.
Unified Engine and Flexible Distribution
- Discusses the need for a unified engine with flexible distribution across heterogeneous clusters.
- Highlights the cost-reduction potential of cross-platform embeddable engines.
"A very good property to have would be flexible distribution like if we had a cross-platform embeddable engine."
- Stresses the importance of flexible distribution for cost reduction and efficiency.
"The task here is to write a distribution layer supporting a dynamic cluster of heterogeneous devices that have different capabilities."
- Describes the engineering challenge of creating a distribution layer for diverse device capabilities.
Observability and Error Handling
- Emphasizes the need for full observability and error handling in a unified engine.
- Discusses the integration of operational data and metrics for improved observability.
"A unified engine should collect and expose all operational data and metrics to offer full observability."
- Highlights the importance of operational data for comprehensive observability.
"You need to categorize, consider, and handle all error possibilities which goes back to like this stream and batch unification."
- Underlines the necessity of disciplined error handling in stream and batch processes.
Data Fusion and Community Collaboration
- Describes the collaboration with Data Fusion to achieve goals similar to LLVM for data systems.
- Highlights the community's positive reception and collaboration on open-source contributions.
"The reason why we joined data Fusion is I somehow come across data Fusion Andrew Lamb's one of the very earliest talks."
- Explains the motivation behind joining Data Fusion inspired by Andrew Lamb's vision.
"It's a great vehicle to learn, it's a great vehicle to innovate because it gives you all these API hooks."
- Emphasizes the learning and innovation opportunities provided by Data Fusion's API hooks.
Future Plans and Roadmap
- Outlines plans to release a workbench for prototyping data applications.
- Discusses the roadmap for building a composable, malleable, and lean infrastructure.
"We're going to release this thing called workbench, so it's going to be like a notion notebook in which you can prototype your data applications."
- Announces the upcoming release of a workbench for data application prototyping.
"What we want to build is composable, malleable, and lean infrastructure solving a lot of the problems at a fundamental lower level."
- Describes the vision of creating a flexible and efficient infrastructure for data processing.
- Highlights the importance of performance and extensibility in data fusion development.
- Emphasizes the need for a simple yet powerful core and user-friendly interfaces.
"Performance, performance, performance. I think it's very important."
- Stresses the critical importance of performance in data fusion systems.
"Keeping the data Fusion core simple but powerful and avoiding unnecessary API churn."
- Advocates for maintaining simplicity and power in the data fusion core to enhance usability.