In the previous articles, Getting Machine Learning into Production and Using Volt Active Data to Get H20 into Production, I discussed the broad problems you face when you try to commercialize machine learning and showed how to integrate h2o. In this article I’ll show how you can use PMML inside Volt Active Data.
There are two ways to use JPMML inside Volt Active Data.
- Instantiated from inside a stored procedure.
- As part of a user defined function.
JPMML is tricker to use than H20, for a number of reasons:
- Creating the JPMML engine can take anything up to 700 milliseconds. Given that in the Volt Active Data universe a millisecond is a long time we therefore need to be careful that we aren’t instantiating JPPML each time.
- In H20 we had a customer-generated Java POJO that implemented the model. In JPMML we have to feed an XML definition into an engine. Given that a Volt Active Data cluster can have anything up to 30 nodes or more making sure that the right XML is in the right place at the right time is non-trivial.
- If your business use case requires that you access two models H20 will give you two fundamentally different POJOS; In JPMML you’ll have two instances of the same class, with different properties because they were fed different XML.
To solve these problems I used Apache Commons Pools to create a pool of JPMML instances of different kids. You probably don’t need to do this. If you do the code is here. You will probably find Volt Active DataJPMMLWrangler useful, as it has a whole series of helper methods to do things like translate JPPML data types to Volt Active Data and vice-versa.
Integrating JPMML into Volt Active Data stored procedures
An example stored procedure is below. Note that to keep this as simple as possible it doesn’t actually interact with the database tables. A real-world deployment would presumably only accept Primary Key information (‘id’), retrieve the record and feed it into JPMML.
package jppml.generated; package jppml; import java.io.File; import org.voltdb.VoltProcedure; import org.voltdb.VoltTable; import org.voltdb.jpmml.JPMMLImpl; import org.voltdb.jpmml.JPMMLImplPool; import org.voltdb.jpmml.Volt Active DataJPMMLWrangler; public class AuditTreeStoredProcNoPool extends VoltProcedure { final String modelName = "AuditTree.xml"; final String modelDir = "/foo/models"; VoltTable pmmlOut = null; JPMMLImpl i = null; Volt Active DataJPMMLWrangler w = null; public VoltTable run(int id, double age, String employment, String education, String marital, String occupation, double income, String gender, double deductions, double hours) { try { // Only incur the cost of creating the model if we // have to... if (w == null) { w = new Volt Active DataJPMMLWrangler(new File(modelDir + File.separator + modelName)); } // Call a helpful method to give us a parameter object to fill in. // IRL this VoltTable could also be the result of a Volt SQL Query. VoltTable paramtable = w.getEmptyTable(modelName); // Load parameters. Note table is only suppose to have 1 row. paramtable.addRow(age, employment, education, marital, occupation, income, gender, deductions, hours); // Run model and use wrangler method to convert output to Volt Active Data format. pmmlOut = w.runModel(modelName, paramtable); // Return results. Note that in this example we never touched the database tables. return pmmlOut; } catch (Exception e) { System.err.println(e.getMessage()); e.printStackTrace(); return null; } } }
Table Of Contents
Creating a new SQL function in Volt Active Data that uses JPMML
Volt Active Data allows you to create new SQL functions from Java classes. There’s no reason it can’t call JPMML. In the example below we define such a function in Java. it uses the pool utility we mentioned earlier to gain access to a PMML engine, and then returns the first column of the first result row as a Java String.
The Java code is here.
package jppml.generated; import org.voltdb.VoltTable; import org.voltdb.jpmml.JPMMLImpl; import org.voltdb.jpmml.JPMMLImplPool; import org.voltdb.jpmml.Volt Active DataJPMMLWrangler; public class AuditTreeProcedure { public String auditTree(double age, String employment, String education, String marital, String occupation, double income, String gender, double deductions, double hours) { final String modelName = "AuditTree.xml"; VoltTable pmmlOut = null; JPMMLImpl i = null; Volt Active DataJPMMLWrangler w = null; try { i = JPMMLImpl.getInstance(); JPMMLImplPool p = i.getPool(); w = p.borrowObject(); VoltTable paramtable = w.getEmptyTable(modelName); paramtable.addRow(age, employment, education, marital, occupation, income, gender, deductions, hours); pmmlOut = w.runModel(modelName, paramtable); return pmmlOut.getString(0); } catch (Exception e) { System.err.println(e.getMessage()); e.printStackTrace(); return null; } finally { if (i != null && w != null) { i.getPool().returnObject(w); } } } }
To make it usable we need to load it into the database and then create it as a function using SQL:
CREATE FUNCTION auditTree FROM METHOD ppml.generated.AuditTreeProcedure.auditTree;
We can then access it via SQL, either directly or by creating a wrapper procedure around it:
CREATE PROCEDURE auditId PARTITION ON TABLE auditdata COLUMN id AS select id, age, employment, education, marital, occupation, income, gender, deductions, hours , auditTree(age, employment, education, marital, occupation, income, gender, deductions, hours) from auditdata where id = ? order by id;