TreeCache: a Tree Structured Replicated Transactional Cache
Bela
Ban, Ben Wang, Nov 2003
This and a companion documents describe the TreeCache, a tree-structured replicated transactional cache, and the TreeCacheAop, which is an AOP enabled subclass of TreeCache, allowing for Plain Old Java Objects (POJOs) to be inserted and replicated transactionally between nodes in a cluster. The TreeCache is configurable, i.e. aspects of the system such as replication (async, sync, or none), transaction isolation levels, and transactional manager are all configurable.
TreeCache can also be used independently, but TreeCacheAop requires both TreeCache and the JBoss4.0 AOP standalone subsystem.
The structure of the cache is a tree with nodes. Each node has a name and zero or more children. A node can only have 1 parent; there is currently no support for graphs. A node can be reached by navigating from the root recursively through children, until the node is found. It can also be accessed by giving a fully qualified name (FQN), which consists of the concatenation of all node names from the root to this particular node.
A TreeCache can have multiple roots, allowing for a number of different trees to be present in the cache. Note that a one level tree is essentially a hashtable. Each node in the tree has a hashmap of keys and values. For a replicated cache, all keys and values have to be serializable. Serializability is not a requirement for TreeCacheAop, where reflection and AOP is used to replicate any type.
A TreeCache can be either local or replicated. Local trees exist only inside the VM in which they are created, whereas replicated trees propagate any changes to all other replicated trees in the same cluster.
The first version of a cache for JBoss was a hashmap. However, the decision was taken to go with a tree structured cache because (a) it is more flexible and efficient and (b) a tree can always be reduced to a hashmap, thereby offering both possibilities. The efficiency argument was driven by concerns over replication overhead, and was that a value itself can be a rather sophisticated object, with aggregation pointing to other objects, or an object containing many fields. A small change in the object would therefore trigger the entire object (possibly the transitive closure over the object graph) to be serialized and propagated to the other nodes in the cluster. With a tree, only the modified nodes in the tree need to be serialized. This is not necessarily a concern for TreeCache, but is a vital requirement for TreeCacheAop (as we will see in the TreeCacheAop documenation).
When a change is made to the TreeCache, and that change is done in the context of a transaction, then we wait with replication until the TX commits successfully. All modifications are kept in a list associated with the transaction for the caller. When the TX commits, we replicate the changes. Otherwise, (on a rollback) we simply undo the changes and release the locks, resulting in no replication traffic. For example, if a caller makes 100 modifications and then rolls back the TX, we will not replicate anything, resulting in no network traffic.
If a caller has no TX associated with it (and isolation level is not NONE), we will replicate right after each modification, e.g. in the above case we would send 100 messages, plus an additional message for the rollback. In this sense, no transaction can be think of autocommit is on in JDBC terminology where each operation is committed automatically.
There is an API for plugging in different transaction managers: all it requires is to get the TX associated with the caller’s thread. There is a dummy TransactionManager implementation, and one for JBoss.
Finally, we use pessimistic locking of the cache by default (optimistic locking is on the todo list). As mentioned previously, we can configure the local locking policy corresponding to the JDBC style transaction isolation level, i.e., SERIALIZABLE, REPEATABLE, READ_COMMITTED, READ_UNCOMMITTED and NONE. More on the transaction isolation level will be discussed later. Note that the cluster-wide isolation level is by default read-uncommitted because we don’t acquire a cluster-wide lock on touching an object for which we don’t yet have a lock (this would result in too much overhead for messaging)[1].
![]() |
The architecture is shown above. The example shows 2 Java VMs, each has created an instance of TreeCache. These VMs can be located on the same machine, or on 2 different machines. The setup of the underlying group communication subsystem is done using JGroups (http://www.jgroups.org).
Any modification (see API below) in one cache will be replicated to the other cache[2] and vice versa. Depending on the TX settings, this will be done either after each modification or at the end of a TX (at commit time). When a new cache is created, it can optionally acquire the contents from one of the existing caches.
Here’s some sample code before we dive into the API itself:
TreeCache tree = new
TreeCache();
tree.setClusterName("demo-cluster");
tree.setClusterProperties("default.xml");
tree.setCacheMode(TreeCache.REPL_SYNC);
tree.start(); // kick start tree cache
tree.put("/a/b/c",
"name", "Ben");
tree.put("/a/b/c/d",
"uid", new Integer(322649));
Integer
tmp=(Integer)tree.get("/a/b/c/d", "uid");
tree.remove("/a/b");
tree.stop();
The sample code first creates a TreeCache instance and then configures it. There is another constructor which accepts a number of configuration options. However, the TreeCache can be configured entirely from an XML file (shown later), so we don't recommend manual configuration as shown in the sample.
The cluster name, properties of the underlying JGroups stack, and cache mode (synchronous replication) are configured first (a list of configuration options is shown later). Then we start the TreeCache. If replication is enabled, this will make the TreeCache join the cluster, and (optionally) acquire initial state from an existing node.
Then we add 2 items into the cache: the first element creates a node "a" with a child node "b" that has a child node "c". (TreeCache by default creates intermediary nodes that don't exist). The key "name" is then inserted into the "/a/b/c" node, with a value of "Ben".
The other element will create just the subnode "d" of "c" because "/a/b/c" already exists. It binds the integer 322649 under key "uid".
The resulting tree looks like this:
The
TreeCache has 4 nodes "a", "b", "c" and
"d". Nodes "/a/b/c" has values "name" associated
with "Ben" in its hashmap, and node "/a/b/c/d" has values
"uid" and 322649.
Each node can be retrieved by its absolute name (e.g. "/a/b/c") or by navigating from parent to children (e.g. navigate from "a" to "b", then from "b" to "c").
The next method in the example gets the value associated with key="uid" in node "/a/b/c/d", which is the integer 322649.
The remove() method then removes node "/a/b" and all subnodes recursively from the cache. In this case, nodes "/a/b/c/d", "/a/b/c" and "/a/b" will be removed, leaving only "/a".
Finally, the TreeCache is stopped. This will cause it to leave the cluster, and every node in the cluster will be notified. Note that TreeCache can be stopped and started again. When it is stopped, all contents will be deleted. And when it is restarted, if it joins a cache group, the state will be replicated initially. So potentially you can recreate the contents.
In the sample, replication was enabled, which caused the 2 put() and the 1 remove() methods to replicated their changes to all nodes in the cluster. The get() method was executed on the local cache only.
Keys into the cache can be either strings separated by slashes ('/'), e.g. "/a/b/c", or they can be fully qualified names Fqns. An Fqn is essentially a list of Objects that need to implement hashCode() and equals(). All strings are actually transformed into Fqns internally. Fqns are more efficient than strings, for example:
String
n1="/300/322649";
Fqn n2=new Fqn(new Object{new Integer(300), new Integer(322649)});
In this example, we want to access a node that has information for employee with id=322649 in department with id=300. The string version needs 2 hashmap lookups on Strings, whereas the Fqn version needs to hashmap lookups on Integer. In a large hashtable, the hashCode() method for String may have collisions, leading to actual string comparisons. Also, clients of the cache may already have identifiers for their objects in Object form, and don't want to transform between Object and Strings, preventing unnecessary copying.
Note that the modification methods are put() and remove(). The only get method is get().
There are 2 put() methods[3]: put(Fqn node, Object key, Object key) and put(Fqn node, Hashmap values). The former takes the node name, creates it if it doesn't yet exist, and put the key and value into the node's hashmap, returning the previous value. The latter takes a hashmap of keys and values and adds them to the node's hashmap, overwriting existing keys and values. Content that is not in the new hashmap remains in the node's hashmap.
There are 3 remove() methods: remove(Fqn node, Object key), remove(Fqn node) and removeData(Fqn node). The first removes the given key from the node. The second removes the entire node and all subnodes, and the third removes all elements from the given node's hashmap.
The get methods are get(Fqn node) and get(Fqn node, Object key). The former returns a Node[4] object, allowing for direct navigation, the latter returns the value for the given key for a node.
Also, the TreeCache has a number of getters and setters. Since the API may change at any time, we recommend the Javadoc for up-to-date information.
The TreeCache can be configured to be either local or replicated.
Local caches don't join a cluster and don't replicate changes to other nodes in a cluster. Therefore their elements don't need to be serializable[5]. On the other hand, replicated caches replicate all changes to the other TreeCaches (node) in the cluster. Replication can either happen after each modification (no transactions), or at the end of a transaction (commit time).
Replication can be synchronous or asynchronous. Use of either one of the options is application dependent. Synchronous replication blocks the caller (e.g. on a put()) until the modifications have been replicated successfully to all nodes in a cluster. Asynchronous replication performs replication in the background (the put() returns immediately). TreeCache also offers a replication queue, where modifications are replicated periodically (i.e. interval-based), or when the queue size exceeds a number of elements, or a combination thereof).
Asynchronous replication is faster (no caller blocking), because synchronous replication requires acknowledgments from all nodes in a cluster that they received and applied the modification successfully (round-trip time). However, when a synchronous replication returns successfully, the caller knows for sure that all modifications have been applied at all nodes, whereas this may or may not be the case with asynchronous replication. With asynchronous replication, errors are simply written to a log.
TreeCache currently uses pessimistic locking to prevent concurrent access to the same data. Locking is not exposed directly to user. Instead, a transaction isolation level which provides different locking behavior is configurable.
Locking internally is done on a node-level, so for example when we want to access "/a/b/c", a lock will be acquired for nodes "a", "b" and "c". When the same transaction wants to access "/a/b/c/d", since we already hold locks for "a", "b" and "c", we only need to acquire a lock for "d".
Lock owners are either transactions (call comes in on a transaction) or threads (no transaction associated with incoming transaction). Regardless, a local transaction or a thread is internally transformed into an instance of GlobalTransaction, which is used as a globally unique ID for modifications across a cluster. E.g. when we run a two-phase commit protocol (see below) across the cluster, the GlobalTransaction uniquely identifies the unit of work across a cluster.
TreeCache supports transaction isolation level in the manner of the JDBC model. A user can configure per instance-wide isolation level (as those defined in java.sql.Connection), i.e., NONE, READ_UNCOMMITTED, READ_COMMITTED, REPEATABLE_READ, and SERIALIZABLE. The mapping is summarized here.
1. NONE. No transaction support is needed. There is no locking at this level, e.g., users will have to manage the data integrity.
2. READ_UNCOMMITTED. Data can be read and write anytime while write operation is exclusive. Note that this level doesn’t prevent the so-called ‘dirty read’ where a thread modifies the data in a transaction can be read by another thread.
3. READ_COMMITTED. Data can be read any time as long as there is no write, while write is non-blocking. This level prevents the dirty read. But it doesn’t prevent the so-called ‘non-repeatable read’ where one thread reads the data twice can produce different results.
4. REPEATABLE. Data can be read while there is no write and vice versa. This level prevents ‘non-repeatable read’ but it does not prevent the so-called ‘phantom read’ where new data can be inserted.
5. SERIALIZABLE. Data is synchronized with only read or write can proceed at one time. This is the maximum level of locking.
A TreeCache can be configured to use transactions to bundle units of work, which can then be replicated as one unit. Alternatively, if transaction support is disabled, it is equivalent to setting AutoCommit to on where modifications are potentially[6] replicated after every change (if replication is enabled).
What TreeCache needs to do on every incoming call (e.g. put()) is (a) get the transaction associated with the thread and (b) register (if not already done) with the transaction manager to be notified when a transaction commits or is rolled back. In order to do this, the cache has to be configured with an instance of TransactionManagerLookup which returns a javax.transaction.TransactionManager.
There are currently 2 implementations of TransactionManagerLookup available: DummyTransactionManagerLookup and JBossTransactionManagerLookup. The former is a dummy implementation of a TransactionManager which can be used for standalone TreeCache applications (running outside an appserver). This is just for demo purposes (e.g. the standalone demo runs with this one). The latter is to be used when TreeCache is used inside JBoss, and its implementation looks like this:
public class JBossTransactionManagerLookup
implements TransactionManagerLookup {
public JBossTransactionManagerLookup() {;}
public TransactionManager
getTransactionManager() throws Exception {
Object tmp=new
InitialContext().lookup("java:/TransactionManager");
return (TransactionManager)tmp;
}
}
The implementation looks up the JBoss TransactionManager from the JNDI and returns it.
When a call comes in, the TreeCache gets the current transaction and records the modification under the transaction as key. (If there is no transaction, the modification is applied immediately and possibly replicated). So over the lifetime of the transaction all modifications will be recorded and associated with the transaction. Also, the TreeCache registers with the transaction to be notified of transaction comitted or aborted when it first encounters the transaction.
When a transaction rolls back, we undo the changes in the cache and release all locks.
When the transaction commits, we initiate a two-phase commit protocol[7]: in the first phase, a PREPARE containing all modifications for the current transaction is sent to all nodes in the cluster. Each node acquires all necessary locks and applies the changes, and then sends back a success message. If a node in a cluster cannot acquire all locks, or fails otherwise, it sends back a failure message.
The coordinator of the two-phase commit protocol waits for all responses (or a timeout, whichever occurs first). If one of the nodes in the cluster responds with FAIL (or we hit the timeout), then a rollback phase is initiated: a ROLLBACK message is sent to all nodes in the cluster. On reception of the ROLLBACK message, every node undoes the changes for the given transaction, and releases all locks held for the transaction.
If all responses are OK, a COMMIT message is sent to all nodes in the cluster. On reception of a COMMIT message, each node applies the changes for the given transaction and releases all locks associated with the transaction.
When we referred to 'transaction', we actually mean a global representation of a local transaction, which uniquely identifies a transaction across a cluster.
Let's look at an example of how to use the standalone (e.g. outside an appserver) TreeCache with dummy transactions:
Properties prop = new Properties();
prop.put(Context.INITIAL_CONTEXT_FACTORY,
"org.jboss.cache.transaction.DummyContextFactory");
User Transaction
tx=(UserTransaction)new
InitialContext(prop).lookup("UserTransaction");
TreeCache tree = new
TreeCache();
config = new PropertyConfigurator();
config.configure(tree,
"META-INF/replSync-service.xml");
tree.start(); // kick start tree cache
try {
tx.begin();
tree.put("/classes/cs-101",
"description", "the basics");
tree.put("/classes/cs-101",
"teacher", "Ben");
tx.commit();
}
catch(Throwable ex) {
try { tx.rollback(); } catch(Throwable t) {}
}
The first lines obtain a user transaction using the 'J2EE way' via JNDI. Note that we could also say
UserTransaction tx=new
DummyUserTransaction(DummyTransactionManager.getInstance());
Then we create a new TreeCache and configure it using a PropertyConfigurator class and a configuration XML file (see below for a list of all configuration options).
Next we start the cache. Then, we start a transaction (and associate it with the current thread internally). Any methods invoked on the cache will now be collected and only applied when the transaction is committed. In the above case, we create a node "/classes/cs-101" and add 2 elements to its hashmap. Assuming that the cache is configured to use synchronous replication, on transaction commit the modifications are replicated. If there is an exception in the methods (e.g. lock acquisition failed), or in the two-phase commit protocol applying the modifications to all nodes in the cluster, the transaction is rolled back.
All properties of the cache are configured via setters and can be retrieved via getters. This can be done either manually, or via the PropertyConfigurator and an XML file. A sample configuration file is shown below (stripped of comments etc):
<?xml version="1.0"
encoding="UTF-8"?>
<server>
<mbean code="org.jboss.cache.TreeCache"
name="jboss.cache:service=TreeCache">
<depends>jboss:service=Naming</depends>
<depends>jboss:service=TransactionManager</depends>
<attribute
name="TransactionManagerLookupClass">
org.jboss.cache.DummyTransactionManagerLookup
</attribute>
<!--
Node locking level : SERIALIZABLE
REPEATABLE_READ
(default)
READ_COMMITTED
READ_UNCOMMITTED
NONE
-->
<attribute
name="IsolationLevel">REPEATABLE_READ</attribute>
<!-- Valid modes are LOCAL, REPL_ASYNC,
REPL_SYNC -->
<attribute
name="CacheMode">REPL_SYNC</attribute>
<!-- Cluster name. Needs to be the same
for all nodes, to find each other -->
<attribute
name="ClusterName">TreeCache-Cluster</attribute>
<!-- JGroups protocol stack properties
-->
<attribute name="ClusterConfig">
<config>
<UDP
mcast_addr="228.1.2.3" mcast_port="45566"
loopback="false" />
<
<MERGE2
min_interval="10000" max_interval="20000" />
<FD shun="true" up_thread="true"
down_thread="true" />
<VERIFY_SUSPECT
timeout="1500"/>
<pbcast.NAKACK gc_lag="50"
retransmit_timeout="600,1200,2400,4800"/>
<pbcast.STABLE
desired_avg_gossip="20000"/>
<UNICAST
timeout="600,1200,2400" window_size="100"
min_threshold="10"/>
<FRAG frag_size="8192"
down_thread="false" up_thread="false" />
<pbcast.GMS
join_timeout="5000" join_retry_timeout="2000"
shun="true"/>
<pbcast.STATE_TRANSFER
up_thread="true" down_thread="true" />
</config>
</attribute>
<!--
The max amount of time (in
milliseconds) we wait until the
initial state (ie. the contents of the
cache) are retrieved from
existing members in a clustered
environment
-->
<attribute name="InitialStateRetrievalTimeout">20000</attribute>
<!--
Max msecs to wait until all responses
for a sync repl have been received.
-->
<attribute
name="SyncReplTimeout">10000</attribute>
<!-- Max number of milliseconds to wait
for a lock acquisition -->
<attribute
name="LockAcquisitionTimeout">15000</attribute>
</mbean>
</server>
The PropertyConfigurator.configure() method needs to have as argument a filename that needs to be found on the classpath, and will use it to configure the TreeCache from the properties defined in it. Note that this configuration file is used to configure the TreeCache both as a standalone cache, and as an MBean if run inside the JBoss container[8].
A list of properties is shown below:
Name |
Description |
CacheMode |
LOCAL, REPL_SYNC or REPL_ASYNC |
ClusterName |
Name of cluster. Needs to be the same for all nodes in a cluster in order to find each other |
ClusterConfig |
The configuration of the underlying JGroups stack. See cluster-service.xml for an example. |
EvictionPolicyClass |
The name of a class implementing EvictionPolicy. Not currently used |
FetchStateOnStartup |
Whether or not to acquire the initial state from existing members. Allows for warm/hot caches (true/false) |
InitialStateRetrievalTimeout |
Time in milliseconds to wait for initial state retrieval |
IsolationLevel |
Node locking level : SERIALIZABLE, REPEATABLE_READ (default), READ_COMMITTED, READ_UNCOMMITTED, and NONE. Case doesn't matter. See documentation on locking for details. |
LockAcquisitionTimeout |
Time in milliseconds to wait for a lock to be acquired. If a lock cannot be acquired an exception will be thrown |
LockLeaseTimeout |
The time in milliseconds a lock is held. (Not currently used). May be removed in the future. We detect crashed members and do remove their locks. |
MaxCapacity |
The max number of elements in the cache. Can be used by an eviction policy. |
ReplQueueInterval |
Time in milliseconds for elements from the replication queue to be replicated. |
SyncReplTimeout |
For synchronous replication: time in milliseconds to wait until replication acks have been received from all nodes in the cluster |
ReplQueueMaxElements |
Max number of elements in the replication queue until replication kicks in |
TransactionManagerLookupClass |
The fully qualified name of a class implementing TransactionManagerLookup. Default is JBossTransactionManagerLookup |
UseReplQueue |
For asynchronous replication: whether or not to use a replication queue (true/false). |
[1] We will offer a cluster-wide serializable policy once we implement optimistic locking, which will be more efficient than implementing a serializable policy over a pessimistic locking scheme.
[2] Note that you can have more than 2 caches in a cluster.
[3] Plus their equivalent helper methods taking a String as node name.
[4] This is mainly used internally, and we may decide to remove public access to the Node in a future release.
[5] However, we recommend making them serializable, enabling a user to change the cache mode at any time.
[6] Depending on whether interval-based replication is used
[7] Only on synchronous replication.
[8] Actually, we will use an XMBean in the next release.