Congratulations! You’ve mastered machine learning and can now generate a model that will help your enterprise succeed. There’s just one problem. Or rather several, actually. Below we discuss the challenges you face getting an ML model into production and how Volt Active Data can help.
For the purposes of this article let’s assume that your model’s job is to predict whether customers of your airline are going to have their flights delayed. Given that your most frequent ‘Platinum’ flyers spend lots of money with you they need to be kept happy, so the purpose of the model is to identify journeys that are ‘at risk’ and rebook them before they are inconvenienced. The system will send them a text message explaining the problem and the proposed solution, to which they reply ‘YES’ if they want it done. All of this sounds straightforward, and after much development you now have a model which will predict delays for such customers. But even though the model is done, putting it into production is quite another matter. In this series of blogs we’ll show how Volt Active Data can help, starting at the business problem level and drilling down to examples.
Table Of Contents
The challenges you’ll face as you try and get Machine Learning into production
The first challenge is that our model needs lots of data, from lots of different sources:
- Historical information about on time performance is publicly available, but only reports the outcome of a given flight, without any information explaining why it was late/early. It’s possible to generate a model using this alone, but the usefulness is limited.
- Airline Schedule Data appears to be static but isn’t, as not only can times change but the aircraft type and capacity can change as well. Most of these changes are on the day of travel.
- Customer booking and preference data. Sometimes customers make multiple separate bookings for different lets of what is actually the same trip.
- Industrial Action/Congestion/Equipment Failure/Weather – Sometimes predictable, frequently not.
- Crew time limits. Flight crews are prohibited from being on duty for more than a certain number of hours, which means that all delays will eventually affect the crew.
The second challenge is that all of these feeds operate at different time scales and sometimes lag behind reality. So historical data is published monthly, schedule data is in near real time, weather reports are behind reality by around 20 minutes and so on.
The third – and biggest challenge in this case – is that our user’s experience is driven by chain reactions of consequences, not a single score in a single pass of the model. If I am flying from San Francisco, CA to Nantucket, MA I might actually have three legs – SFO to Charlotte, Charlotte to Boston, Boston to Nantucket. What’s obvious is that failing to leave SFO on time will create a risk of not getting to Nantucket. What’s not obvious is that in this case the plane that’s supposed to go from Charlotte to Boston starts out in Newark, and is currently stuck on the ground in Tampa due to a thunderstorm. So the weather in Florida could prevent us from reaching our destination on time, even though our journey has nothing to do with the Sunshine State.
Last but not least, while the model may be stateless our customers are not, and our interactions with them are transactional in nature – we don’t want to bombard them with multiple text messages, and once they agree to be rerouted the data about their journey will change, thus leading to another round of re-calculations.
The bottom line is that even though we may have algorithm that works, using it at scale and in real time is far from trivial. Here at Volt Active Data we’ve seen people try to create real time recommendation engines using open source stacks, but the common problem (other than complexity) is that while all the individual components are fast when you glue them together the resulting stack is slow. In the scenario we describe above Volt Active Data allows you to maintain all the reference data you need to feed your engine in RAM and make sure it follows the normal rules of a relational database. It also allows you to keep track of individual user’s state in an ACID compliant way.
Integrating Machine Learning platforms with Volt Active Data.
Volt Active Data has a C++ core. You interact with this core using Java classes that run on the same host and implement a specific Java Interface. Clients send messages to Volt Active Data, which are routed to the relevant stored procedure class, which then uses a JNI bridge to send SQL to the C++ core.
To integrate a Machine Learning engine with Volt Active Data it needs be stand alone and have a Java (or JVM) runtime. It’s also possible to speak to a C++ runtime using JNI, but that’s outside the scope of this article. In the next two articles I’ll show practical examples of how to integrate Volt Active Data with h20.ai and JPMML.