Deleting records is often an afterthought when you’re first building a proof of concept with Volt Active Data, but it is an important function that should be given some thought as part of any design. Few applications can retain data forever, and in-memory databases in particular require some careful planning for how much data should be stored in RAM.
Before data is deleted, you want to make sure it’s already been copied to another data store. In Volt Active Data this can be done continuously using the Export feature. Normally, data is exported downstream shortly after it arrives, rather than just prior to deletion, so the export and deletion processes are often separate.
A common pitfall to avoid is a batch deletion process that deletes millions of records in one SQL statement. Just as you would avoid running a long SQL query that selects millions of records at a time, you should not try to delete millions of records at once. Not only do these records need to be scanned and deleted, but the process needs to be reversible if any error were to occur, so the data must be copied into an in-memory UNDO log that is kept for one transaction at a time. This is normally a very lightweight operation for discrete OLTP transactions, but for a transaction involving many records it can take longer. Then when the data is deleted, the memory blocks that held the data may become sparse enough to cause a compaction process to run to consolidate data into fewer blocks. These processes can lead to a long-running delete transaction that causes other requests to wait for longer than expected.
Automating Deletion
The following best practices should be considered when designing a deletion process to ensure it does not impact the performance of other workloads on the database.
- Limit the number of records to be deleted in a single transaction
For this you want to check how many records meet a given criteria before using that criteria to delete. For example, if you are deleting all records prior to a given timestamp, you might use the following query to check this:
SELECT COUNT(*) FROM timedata WHERE update_ts <= ?;
Then if the number of records was high, you could find a different timestamp that would affect fewer records. The offset parameter is the number of records you want to delete at a time. Use a size between 100 and 1000, but you might test to find the best size depending on the record size:
SELECT update_ts FROM timedata ORDER BY update_ts ASC OFFSET ? LIMIT 1;
Then delete this limited chunk of records:
DELETE FROM timedata WHERE update_ts <= ?;
A loop could be used to repeat this process until the original target timestamp has been reached, running periodically or continuously.
Alternatively, if you want to delete records that exceed a target # of records to retain, you first check if the count exceeds the target:
SELECT COUNT(*) FROM timedata;
If there are more than the target number of records present, you subtract to see how many records should be deleted. But if this is greater than the maximum number of records to delete at a time (say 100 or perhaps up to 1000), you instead use that number as the offset for this query to get the appropriate cutoff timestamp:
SELECT update_ts FROM timedata ORDER BY update_ts ASC OFFSET ? LIMIT 1;
Then you can delete just those rows:
DELETE FROM timedata WHERE update_ts <= ?;
These example queries are taken from the windowing example application, which is provided with Volt Active Data. This example also shows how to accomplish the other best practices for deletion.
- If possible, delete records using a single-partition transaction
This can still be done in cases where the partitioning key column is not part of the criteria for the delete query. The key is to put the queries above in a java stored procedure that is partitioned on the table, but notice that the partitioning column is not part of the query criteria. That is ok, as long as a parameter is passed in when calling the procedure, the procedure will run on one partition and will perform the deletion on the subset of records present in that partition. To iterate through all of the partitions in the database, use the “Run Everywhere” pattern. This is just calling the @GetPartitionKeys system procedure to get a set of values to iterate through in order to call a procedure once in each of the partitions. There is a basic tutorial here, but there is also an example in the windowing example application in the ContinuousDeleter class.
- Delete continuously, or consider integrating the deletion and ingestion processes
You could run the delete process from a thread that runs in the background, as with the ContinuousDeleter class, or you could run it as a periodic batch process.
Another option is to delete as you go by integrating the deletion process into the same procedures that are used for inserting new data. This has the benefit of not needing any additional client code. The windowing example application does it both ways, with a command-line parameter to control the configuration. Below are the procedures that perform the deletions, by timestamp or count, deleting alone or deleting with ingestion.
InsertAndDeleteAfterDate.java – This procedure inserts one new record and deletes records that are older than a given timestamp, but if there are too many it will find another cut-off point so that it deletes no more than a given maximum number of records, and the next transaction can take it further.
InsertAndDeleteOldestToTarget.java – This procedure inserts a new record and deletes a limited number of records that exceed a target number of records per partition.
DeleteAfterDate.java and DeleteOldestToTarget.java – These are similar to the first two, except they only delete, so they must be called separately. In the benchmark client code, they are driven by the ContinuousDeleter.java class which runs as a thread and just continuously calls one of these procedures at a governed rate.