Home < Blog < Using UDF for Machine Learning

Using UDF for Machine Learning

6 min read

We added support for scalar user-defined functions in our recent release, Volt Active Data V7.6. With this feature, users can write their own SQL functions in Java and load them into Volt Active Data. Those functions, once loaded, can be used in all kinds of SQL queries just like any other built-in SQL functions. User-defined functions can give you great flexibility in customizing your business logic inside the database. In this blog post, we will use a specific example to show how user-defined functions can benefit machine learning applications.

For example, classification is a fundamental problem in machine learning and statistics. Given an input data set, a classification model can make a prediction and assign each data sample to one of several predefined classes. Commonly used classification models include logistic regression, naive Bayesian classifier, support vector machines, decision trees, neural networks, etc. Here, we will choose the naive Bayesian classifier and the k-nearest neighbors algorithm as two examples to illustrate how user-defined functions can simplify business applications.

The data set we are going to use in our examples is the iris data set. The iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher. It contains in total 150 data samples selected evenly from three species of iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured for each sample: the length and the width of both the sepals and petals, in centimeters.

The original data set comes in the CSV format. To store it in the database, we first define the iris table as follows:

Table iris includes four columns for different feature values, and a target class column which records the real species of each sample. To load the data, Volt Active Data provides a command line utility, CSVLoader, which can import the contents of a CSV file and insert it into a Volt Active Data table.

A typical machine learning application involves two parts: learning and scoring. Volt Active Data strives for providing solid support for fast data applications, where the fast incoming data stream is summarized and processed in real-time. That is, Volt Active Data can apply machine learning models on the incoming data stream for real-time decision making with high throughput and low latency. Building the models on the historical data, however, is normally done on a back-end large-scale big data analytics platform. Volt Active Data can export the live transaction data into those platforms for incremental model update, and sync up with the back-end system to get the updated model for future data processing.

In our example, we include the learning part for completeness. So we first build the classification models (the naive Bayesian classifier and the k-nearest neighbor classifier) from the iris data set. The naive Bayesian classifier is a simple probabilistic classifier based on applying the Bayes’ theorem with strong (naive) independence assumptions between the features. While the k-nearest neighbor classifier classifies the samples based on the target class of their nearest neighbors. The technical discussion about how those models can be computed from the training data set is out of the scope of this blog post. We encourage the audience to refer to numerous resources that are available on the Internet to get more familiarized with those models if you are interested. Here, we create a Java stored procedure in Volt Active Data to scan the input table and learn the models leveraging the algorithm implementations provided by the Java-ML library.

Please note that the input data table in this example is replicated, i.e., no partitioning column is defined and each partition will contain a full copy of the table data. This is only for the simplicity of the example because every partition will run the same learning method on the same data, no model synchronization between partitions is required. The data set can be partitioned in order to achieve maximum performance if Volt Active Data is only used for real-time decision making, i.e., data scoring.

The code for the Volt Active Data Java stored procedure to train the k-nearest neighbor classification model is shown below (the code for the naive Bayesian classifier is almost the same):

The classifier knnc is a static member of the IrisKNNClassifier. When the stored procedure is run on all partitions, it learns the model and stores it with the class. In a real-world use case, this procedure should get its model from an external analytics system.

To apply the model, we create a Java method in the same class as the stored procedure method in order to access the model data. The code for the Java method is shown below:

The Java method takes four double type parameters. Volt Active Data supports both boxed types and primitive types in the UDF method. it automatically maps the Java types to the Volt Active Data SQL data types. For more information, see our documentation. Basically, what this function does is it gets the data that is passed in as parameters, applies the model, and returns the class prediction.

The compilation process of a UDF is the same as that of a stored procedure. It is also perfectly fine to define a stored procedure and UDFs together. After the JAR file is compiled, we can easily load the procedures and the UDFs by executing the following commands:

LOAD CLASSES VoltMachineLearning.jar;
CREATE PROCEDURE FROM CLASS IrisNBClassifier;
CREATE PROCEDURE FROM CLASS IrisKNNClassifier;
CREATE FUNCTION nb_classify FROM METHOD IrisNBClassifier.classify;
CREATE FUNCTION knn_classify FROM METHOD IrisKNNClassifier.classify;

Please note that in order to run with the Java-ML library, we need to copy the related JAR files for Java-ML to /lib/extension folder where Volt Active Data is installed, so Volt Active Data can load them at startup.

To build the classification models, we run the stored procedures:

EXEC IrisNBClassifier;
EXEC IrisKNNClassifier;

We can then apply the models on the same data set to see how accurate their predictions are:

The result of the query above is:

NB_CORRECTKNN_CORRECTTOTAL
77145150

In our example, out of 150 samples, the k-nearest classifier got 145 correct classifications (96.6%), while the naive Bayesian classifier only got 77 correct (51.3%).

In this example, we show the potential of Java user-defined functions with a brief use case. User-defined functions unlock the full power and flexibility of the java programming language and environment. With tons of external libraries that UDFs can link to, there is a whole bunch of things you can do inside the database. The source code for this example is available on GitHub. We hope you find the feature helpful in designing your fast data application, and please let us know what you think!