Processing big data effectively often requires multiple database engines, each specialized to a purpose. Databases that are very good at event-oriented real-time processing are likely not good at batch analytics against large volumes. Here’s a quick look at another of the Fast Data recipes from the ebook, “Fast Data: Smart and at Scale” Ryan Betts and I authored.
Data arriving at high-velocity, ingest-oriented systems needs to be processed and captured into volume-oriented systems. In more advanced cases, reports, analytics, and predictive models generated from volume-oriented systems need to be communicated to velocity-oriented systems to support real-time applications. Real-time analytics from the velocity side need to be integrated into operational dashboards or downstream applications that process real-time alerts, alarms, insights, and trends.
In practice, this means that many big data applications sit on top of a platform of tools. Data and processing outputs move between all of these systems. Designing that dataflow—designing a processing pipeline—that coordinates these different platform components is key to solving many big data challenges.
Table Of Contents
Pattern: Use Streaming Transformations to Avoid ETL
New events being captured into a long-term repository often require transformation, filtering, or processing before they are available for reporting use cases. There are at least two approaches to running these transformations.
- All of the data can be landed to a long-term repository and then extracted, transformed, and re-loaded back in its final form.
- The transformations can be executed in a streaming fashion before the data reaches the long-term repository.
Pattern: Connect Big Data Analytics to Real-Time Stream Processing
Real-time applications processing incoming events often require analytics from backend systems. This introduces a few important requirements. First, the fast data, velocity-oriented application requires a data management system capable of holding the state generated by the batch system; second, this state needs to be regularly updated or replaced in full. There are a few common ways to manage the refresh cycle—the best tradeoff will depend on your specific application.
Other applications require the analytics data to be strictly consistent; if it is insufficient for each record to be internally consistent, the set of records as a whole requires a consistency guarantee. Producing a correct result therefore requires that the full data set be consistent. A reasonable approach to transferring these report data from the batch analytics system to the real-time system is to write the data to a shadow table. Once the shadow table is completely written, it can be atomically renamed, or swapped, with the main table that is addressed by the application. The application will either see only data from the previous version of the report, or only data from the new version of the report, but will never see a mix of data from both reports in a single query.
Pattern: Use Loose Coupling to Improve Reliability
When connecting multiple systems, it is imperative that all systems have an independent fate. Any part of the pipeline should be able to fail while leaving other systems available and functional. If the batch back end is offline, the high-velocity front end should still be operating, and vice versa.
In every pipeline, there is by definition a slowest component—a bottleneck. When designing, explicitly choose the component that will be your bottleneck. Having many systems, each with identical performance, means a minor degradation to any system will create a new overall bottleneck. This is operationally painful. It is often better to choose your most reliable component as your bottleneck or your most expensive resource as your bottleneck. Overall you will achieve a more predictable level of reliability.
This was a quick overview of effective data pipeline recipes. If you are interested in this topic and would like to read more, download the ebook “Fast Data: Smart and at Scale.”