My name is Yetian Xia. I worked as a software engineering intern at Volt Active Data during the summer, working with the core team. My favorite project was optimizing the performance of deleting entries from a low-cardinality non-unique tree index.
Index is used to increase the speed of looking up a row, and is heavily used in almost all database use cases. Logically, it can be regarded as a map, in which the key is a search key generated from some columns of a row and the value is the address of a row. Volt Active Data has two types of indexes, hash indexes and balanced tree indexes. This architecture works well most of the time. In very rare cases, Volt Active Data can be slow when deleting rows from tables. These cases all have low-cardinality indices on some tables, which means the indices have few unique keys.
A low-cardinality index is often cited as a reason for bad performance. When deleting a row from an index, the Volt Active Data engine needed to generate the key based on the row, and iterate the nodes sharing the same key to find the exact entry of that row, because the key does not contain any information on the address. If the index has low cardinality, the linear search of these keys is comparable to a full table scan, leading to a slow deletion process. In short, the worst case for deleting a tuple entry from a low-cardinality tree index is O(n).
The method we used to resolve this problem was to create a new key type by appending the address of the tuple to the original key. Then all duplicate keys become unique. Finding a specific tuple is the same as searching a node in a balanced tree, which only needs O(log(n)) time.
Two more items needed extra work. One was that the value field was no longer necessary for an entry, since the address had already been kept in the new key. Rather than waste 64 bits for each row, so I refactored the tree node implementation to eliminate the field. Additionally, several search functions of the index had to be slightly modified. It is efficient to look up tuples using the new (key, address) scheme on deletion, but it is not correct if you want to find all tuples that share the same key. In these cases, index lookups only use the key portion.
Also worth mentioning, we used a C++ template in the index. This reduced the number of lines of code, making the code simpler and easier to read and maintain. It also eliminated type checking in the code, boosting performance at runtime.
Below is an example of the improvement in performance on my own desktop, running i7-8 cores, 12G RAM and Ubuntu 14.04.
The ddl I used was:CREATE TABLE P1 ( ID BIGINT NOT NULL UNIQUE, AGE INTEGER, NUM INTEGER, GENDER TINYINT, PRIMARY KEY (ID) ); create index idx_AGE_TREE on P1(AGE); create index idx_NUM_TREE on P1(NUM); PARTITION TABLE P1 ON COLUMN ID;
AGE was evenly distributed between 5~99, and NUM was evenly distributed between 0~9. GENDER was a random binary variable. If we executed the SQL query:DELETE FROM P1 WHERE AGE=5;
This query would delete around 1% rows from the table P1. The result is shown in the table below.
# of rows | old (delete 1%) | new(delete 1%) |
1m | 0.77s | 0.08s |
4m | 12.00s | 0.12s |
16m | 189.62s | 0.23s |
The new method achieved better performance and scaled better. If we had defined an index on binary column GENDER, the performance improvement would have been even greater.