TreeCache: a Tree Structured Replicated Transactional Cache

Bela Ban, Ben Wang, Nov 2003

bela@jboss.org

ben.wang@jboss.org

 

 

This and a companion documents describe the TreeCache, a tree-structured replicated transactional cache, and the TreeCacheAop, which is an AOP enabled subclass of TreeCache, allowing for Plain Old Java Objects (POJOs) to be inserted and replicated transactionally between nodes in a cluster. The TreeCache is configurable, i.e. aspects of the system such as replication (async, sync, or none), transaction isolation levels, and transactional manager are all configurable.

TreeCache can also be used independently, but TreeCacheAop requires both TreeCache and the JBoss4.0 AOP standalone subsystem.

1.  Overview

The structure of the cache is a tree with nodes. Each node has a name and zero or more children. A node can only have 1 parent; there is currently no support for graphs. A node can be reached by navigating from the root recursively through children, until the node is found. It can also be accessed by giving a fully qualified name (FQN), which consists of the concatenation of all node names from the root to this particular node.

A TreeCache can have multiple roots, allowing for a number of different trees to be present in the cache. Note that a one level tree is essentially a hashtable. Each node in the tree has a hashmap of keys and values. For a replicated cache, all keys and values have to be serializable. Serializability is not a requirement for TreeCacheAop, where reflection and AOP is used to replicate any type.

A TreeCache can be either local or replicated. Local trees exist only inside the VM in which they are created, whereas replicated trees propagate any changes to all other replicated trees in the same cluster.

The first version of a cache for JBoss was a hashmap. However, the decision was taken to go with a tree structured cache because (a) it is more flexible and efficient and (b) a tree can always be reduced to a hashmap, thereby offering both possibilities. The efficiency argument was driven by concerns over replication overhead, and was that a value itself can be a rather sophisticated object, with aggregation pointing to other objects, or an object containing many fields. A small change in the object would therefore trigger the entire object (possibly the transitive closure over the object graph) to be serialized and propagated to the other nodes in the cluster. With a tree, only the modified nodes in the tree need to be serialized. This is not necessarily a concern for TreeCache, but is a vital requirement for TreeCacheAop (as we will see in the TreeCacheAop documenation).

When a change is made to the TreeCache, and that change is done in the context of a transaction, then we wait with replication until the TX commits successfully. All modifications are kept in a list associated with the transaction for the caller. When the TX commits, we replicate the changes. Otherwise, (on a rollback) we simply undo the changes and release the locks, resulting in no replication traffic. For example, if a caller makes 100 modifications and then rolls back the TX, we will not replicate anything, resulting in no network traffic.

If a caller has no TX associated with it (and isolation level is not NONE), we will replicate right after each modification, e.g. in the above case we would send 100 messages, plus an additional message for the rollback. In this sense, no transaction can be think of autocommit is on in JDBC terminology where each operation is committed automatically.

There is an API for plugging in different transaction managers: all it requires is to get the TX associated with the caller’s thread. There is a dummy TransactionManager implementation, and one for JBoss.

Finally, we use pessimistic locking of the cache by default (optimistic locking is on the todo list). As mentioned previously, we can configure the local locking policy corresponding to the JDBC style transaction isolation level, i.e., SERIALIZABLE, REPEATABLE, READ_COMMITTED, READ_UNCOMMITTED and NONE. More on the transaction isolation level will be discussed later. Note that the cluster-wide isolation level is by default read-uncommitted because we don’t acquire a cluster-wide lock on touching an object for which we don’t yet have a lock (this would result in too much overhead for messaging)[1].

 

2. 

Architecture

 

The architecture is shown above. The example shows 2 Java VMs, each has created an instance of TreeCache. These VMs can be located on the same machine, or on 2 different machines. The setup of the underlying group communication subsystem is done using JGroups (http://www.jgroups.org).

Any modification (see API below) in one cache will be replicated to the other cache[2] and vice versa. Depending on the TX settings, this will be done either after each modification or at the end of a TX (at commit time). When a new cache is created, it can optionally acquire the contents from one of the existing caches.

 

3.  API

Here’s some sample code before we dive into the API itself:

TreeCache tree = new TreeCache();

tree.setClusterName("demo-cluster");

tree.setClusterProperties("default.xml");

tree.setCacheMode(TreeCache.REPL_SYNC);

tree.start();  // kick start tree cache

tree.put("/a/b/c", "name", "Ben");

tree.put("/a/b/c/d", "uid", new Integer(322649));

Integer tmp=(Integer)tree.get("/a/b/c/d", "uid");

tree.remove("/a/b");

tree.stop();

The sample code first creates a TreeCache instance and then configures it. There is another constructor which accepts a number of configuration options. However, the TreeCache can be configured entirely from an XML file (shown later), so we don't recommend manual configuration as shown in the sample.

The cluster name, properties of the underlying JGroups stack, and cache mode (synchronous replication) are configured first (a list of configuration options is shown later). Then we start the TreeCache. If replication is enabled, this will make the TreeCache join the cluster, and (optionally) acquire initial state from an existing node.

Then we add 2 items into the cache: the first element creates a node "a" with a child node "b" that has a child node "c". (TreeCache by default creates intermediary nodes that don't exist). The key "name" is then inserted into the "/a/b/c" node, with a value of "Ben".

The other element will create just the subnode "d" of "c" because "/a/b/c" already exists. It binds the integer 322649 under key "uid".

The resulting tree looks like this:

The TreeCache has 4 nodes "a", "b", "c" and "d". Nodes "/a/b/c" has values "name" associated with "Ben" in its hashmap, and node "/a/b/c/d" has values "uid" and 322649.

Each node can be retrieved by its absolute name (e.g. "/a/b/c") or by navigating from parent to children (e.g. navigate from "a" to "b", then from "b" to "c").

The next method in the example gets the value associated with key="uid" in node "/a/b/c/d", which is the integer 322649.

The remove() method then removes node "/a/b" and all subnodes recursively from the cache. In this case, nodes "/a/b/c/d", "/a/b/c" and "/a/b" will be removed, leaving only "/a".

Finally, the TreeCache is stopped. This will cause it to leave the cluster, and every node in the cluster will be notified. Note that TreeCache can be stopped and started again. When it is stopped, all contents will be deleted. And when it is restarted, if it joins a cache group, the state will be replicated initially. So potentially you can recreate the contents.

In the sample, replication was enabled, which caused the 2 put() and the 1 remove() methods to replicated their changes to all nodes in the cluster. The get() method was executed on the local cache only.

Keys into the cache can be either strings separated by slashes ('/'), e.g. "/a/b/c", or they can be fully qualified names Fqns. An Fqn is essentially a list of Objects that need to implement hashCode() and equals(). All strings are actually transformed into Fqns internally. Fqns are more efficient than strings, for example:

String n1="/300/322649";

Fqn n2=new Fqn(new Object{new Integer(300), new Integer(322649)});

In this example, we want to access a node that has information for employee with id=322649 in department with id=300. The string version needs 2 hashmap lookups on Strings, whereas the Fqn version needs to hashmap lookups on Integer. In a large hashtable, the hashCode() method for String may have collisions, leading to actual string comparisons. Also, clients of the cache may already have identifiers for their objects in Object form, and don't want to transform between Object and Strings, preventing unnecessary copying.

Note that the modification methods are put() and remove(). The only get method is get().

There are 2 put() methods[3]: put(Fqn node, Object key, Object key) and put(Fqn node, Hashmap values). The former takes the node name, creates it if it doesn't yet exist, and put the key and value into the node's hashmap, returning the previous value. The latter takes a hashmap of keys and values and adds them to the node's hashmap, overwriting existing keys and values. Content that is not in the new hashmap remains in the node's hashmap.

There are 3 remove() methods: remove(Fqn node, Object key), remove(Fqn node) and removeData(Fqn node). The first removes the given key from the node. The second removes the entire node and all subnodes, and the third removes all elements from the given node's hashmap.

The get methods are get(Fqn node) and get(Fqn node, Object key). The former returns a Node[4] object, allowing for direct navigation, the latter returns the value for the given key for a node.

Also, the TreeCache has a number of getters and setters. Since the API may change at any time, we recommend the Javadoc for up-to-date information.

4.  Replication

The TreeCache can be configured to be either local or replicated.

Local caches don't join a cluster and don't replicate changes to other nodes in a cluster. Therefore their elements don't need to be serializable[5]. On the other hand, replicated caches replicate all changes to the other TreeCaches (node) in the cluster. Replication can either happen after each modification (no transactions), or at the end of a transaction (commit time).

Replication can be synchronous or asynchronous. Use of either one of the options is application dependent. Synchronous replication blocks the caller (e.g. on a put()) until the modifications have been replicated successfully to all nodes in a cluster. Asynchronous replication performs replication in the background (the put() returns immediately). TreeCache also offers a replication queue, where modifications are replicated periodically (i.e. interval-based), or when the queue size exceeds a number of elements, or a combination thereof).

Asynchronous replication is faster (no caller blocking), because synchronous replication requires acknowledgments from all nodes in a cluster that they received and applied the modification successfully (round-trip time). However, when a synchronous replication returns successfully, the caller knows for sure that all modifications have been applied at all nodes, whereas this may or may not be the case with asynchronous replication. With asynchronous replication, errors are simply written to a log.

5.  Transaction Isolation Level

TreeCache currently uses pessimistic locking to prevent concurrent access to the same data. Locking is not exposed directly to user. Instead, a transaction isolation level which provides different locking behavior is configurable.

Locking internally is done on a node-level, so for example when we want to access "/a/b/c", a lock will be acquired for nodes "a", "b" and "c". When the same transaction wants to access "/a/b/c/d", since we already hold locks for "a", "b" and "c", we only need to acquire a lock for "d".

Lock owners are either transactions (call comes in on a transaction) or threads (no transaction associated with incoming transaction). Regardless, a local transaction or a thread is internally transformed into an instance of GlobalTransaction, which is used as a globally unique ID for modifications across a cluster. E.g. when we run a two-phase commit protocol (see below) across the cluster, the GlobalTransaction uniquely identifies the unit of work across a cluster.

TreeCache supports transaction isolation level in the manner of the JDBC model. A user can configure per instance-wide isolation level (as those defined in java.sql.Connection), i.e., NONE, READ_UNCOMMITTED, READ_COMMITTED, REPEATABLE_READ, and SERIALIZABLE. The mapping is summarized here.

1.      NONE. No transaction support is needed. There is no locking at this level, e.g., users will have to manage the data integrity.

2.      READ_UNCOMMITTED. Data can be read and write anytime while write operation is exclusive. Note that this level doesn’t prevent the so-called ‘dirty read’ where a thread modifies the data in a transaction can be read by another thread.

3.      READ_COMMITTED. Data can be read any time as long as there is no write, while write is non-blocking. This level prevents the dirty read. But it doesn’t prevent the so-called ‘non-repeatable read’ where one thread reads the data twice can produce different results.

4.      REPEATABLE. Data can be read while there is no write and vice versa. This level prevents ‘non-repeatable read’ but it does not prevent the so-called ‘phantom read’ where new data can be inserted.

5.      SERIALIZABLE. Data is synchronized with only read or write can proceed at one time. This is the maximum level of locking.

/**** Let’s skip the following. Users need not know what kind of locking mechanism to use. We abstract it to transaction isolation level, a la JDBC style. This is not an issue in the case of no tx since each operation is atomic in that case (autocommit on).

Locks can be read-write or read-only. Write locks serialize read and write access, whereas read-only locks only serialize read access. When a write lock is held, no other write or read locks can be acquired. When a read lock is held, others can acquire read locks. However, to acquire write locks, one has to wait until all read locks have been released. When scheduled concurrently, write locks always have precedence over read locks. Note that (if enabled) read locks can be upgraded to write locks.

Using read-write locks helps in the following scenario: consider a tree with entries "/a/b/n1" and "/a/b/n2". With write-locks, when Tx1 accesses "/a/b/n1", Tx2 cannot access "/a/b/n2" until Tx1 has completed and released its locks. However, with read-write locks this is possible, because Tx1 acquires read-locks for "/a/b" and a read-write lock for "/a/b/n1". Tx2 is then able to acquire read-locks for "/a/b" as well, plus a read-write lock for "/a/b/n2". This allows for more concurrency in accessing the cache.

****/

 

6.  Transactional Support

A TreeCache can be configured to use transactions to bundle units of work, which can then be replicated as one unit. Alternatively, if transaction support is disabled, it is equivalent to setting AutoCommit to on where modifications are potentially[6]  replicated after every change (if replication is enabled).

What TreeCache needs to do on every incoming call (e.g. put()) is (a) get the transaction associated with the thread and (b) register (if not already done) with the transaction manager to be notified when a transaction commits or is rolled back. In order to do this, the cache has to be configured with an instance of TransactionManagerLookup which returns a javax.transaction.TransactionManager.

There are currently 2 implementations of TransactionManagerLookup available: DummyTransactionManagerLookup and JBossTransactionManagerLookup. The former is a dummy implementation of a TransactionManager which can be used for standalone TreeCache applications (running outside an appserver). This is just for demo purposes (e.g. the standalone demo runs with this one). The latter is to be used when TreeCache is used inside JBoss, and its implementation looks like this:

public class JBossTransactionManagerLookup implements TransactionManagerLookup {

    public JBossTransactionManagerLookup() {;}

    public TransactionManager getTransactionManager() throws Exception {

        Object tmp=new InitialContext().lookup("java:/TransactionManager");

        return (TransactionManager)tmp;

    }

}

The implementation looks up the JBoss TransactionManager from the JNDI and returns it.

When a call comes in, the TreeCache gets the current transaction and records the modification under the transaction as key. (If there is no transaction, the modification is applied immediately and possibly replicated). So over the lifetime of the transaction all modifications will be recorded and associated with the transaction. Also, the TreeCache registers with the transaction to be notified of transaction comitted or aborted when it first encounters the transaction.

When a transaction rolls back, we undo the changes in the cache and release all locks.

When the transaction commits, we initiate a two-phase commit protocol[7]: in the first phase, a PREPARE containing all modifications for the current transaction is sent to all nodes in the cluster. Each node acquires all necessary locks and applies the changes, and then sends back a success message. If a node in a cluster cannot acquire all locks, or fails otherwise, it sends back a failure message.

The coordinator of the two-phase commit protocol waits for all responses (or a timeout, whichever occurs first). If one of the nodes in the cluster responds with FAIL (or we hit the timeout), then a rollback phase is initiated: a ROLLBACK message is sent to all nodes in the cluster. On reception of the ROLLBACK message, every node undoes the changes for the given transaction, and releases all locks held for the transaction.

If all responses are OK, a COMMIT message is sent to all nodes in the cluster. On reception of a COMMIT message, each node applies the changes for the given transaction and releases all locks associated with the transaction.

When we referred to 'transaction', we actually mean a global representation of a local transaction, which uniquely identifies a transaction across a cluster.

 

6.1.              Example

Let's look at an example of how to use the standalone (e.g. outside an appserver) TreeCache with dummy transactions:

Properties prop = new Properties();

prop.put(Context.INITIAL_CONTEXT_FACTORY,

    "org.jboss.cache.transaction.DummyContextFactory");

User Transaction tx=(UserTransaction)new InitialContext(prop).lookup("UserTransaction");

TreeCache tree = new TreeCache();

config = new PropertyConfigurator();

config.configure(tree, "META-INF/replSync-service.xml");

tree.start();  // kick start tree cache

try {

    tx.begin();

    tree.put("/classes/cs-101", "description", "the basics");

    tree.put("/classes/cs-101", "teacher", "Ben");

    tx.commit();

}

catch(Throwable ex) {

  try { tx.rollback(); } catch(Throwable t) {}

}

The first lines obtain a user transaction using the 'J2EE way' via JNDI. Note that we could also say

UserTransaction tx=new DummyUserTransaction(DummyTransactionManager.getInstance());

Then we create a new TreeCache and configure it using a PropertyConfigurator class and a configuration XML file (see below for a list of all configuration options).

Next we start the cache. Then, we start a transaction (and associate it with the current thread internally). Any methods invoked on the cache will now be collected and only applied when the transaction is committed. In the above case, we create a node "/classes/cs-101" and add 2 elements to its hashmap. Assuming that the cache is configured to use synchronous replication, on transaction commit the modifications are replicated. If there is an exception in the methods (e.g. lock acquisition failed), or in the two-phase commit protocol applying the modifications to all nodes in the cluster, the transaction is rolled back.

 

7.  Configuration

All properties of the cache are configured via setters and can be retrieved via getters. This can be done either manually, or via the PropertyConfigurator and an XML file. A sample configuration file is shown below (stripped of comments etc):

 

<?xml version="1.0" encoding="UTF-8"?>

<server>

  <mbean code="org.jboss.cache.TreeCache" name="jboss.cache:service=TreeCache">

    <depends>jboss:service=Naming</depends>

    <depends>jboss:service=TransactionManager</depends>

 

    <attribute name="TransactionManagerLookupClass">

        org.jboss.cache.DummyTransactionManagerLookup

    </attribute>

 

    <!--

        Node locking level : SERIALIZABLE

                             REPEATABLE_READ (default)

                             READ_COMMITTED

                             READ_UNCOMMITTED

                             NONE

    -->

    <attribute name="IsolationLevel">REPEATABLE_READ</attribute>

 

    <!-- Valid modes are LOCAL, REPL_ASYNC, REPL_SYNC -->

    <attribute name="CacheMode">REPL_SYNC</attribute>

 

    <!-- Cluster name. Needs to be the same for all nodes, to find each other -->

    <attribute name="ClusterName">TreeCache-Cluster</attribute>

 

    <!-- JGroups protocol stack properties -->

    <attribute name="ClusterConfig">

        <config>

        <UDP mcast_addr="228.1.2.3" mcast_port="45566" loopback="false" />

        <PING timeout="2000" num_initial_members="3" />

        <MERGE2 min_interval="10000" max_interval="20000" />

        <FD shun="true" up_thread="true" down_thread="true" />

        <VERIFY_SUSPECT timeout="1500"/>

        <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"/>

        <pbcast.STABLE desired_avg_gossip="20000"/>

        <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"/>

        <FRAG frag_size="8192" down_thread="false" up_thread="false" />

        <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true"/>

        <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />

        </config>

    </attribute>

 

    <!--

        The max amount of time (in milliseconds) we wait until the

        initial state (ie. the contents of the cache) are retrieved from

        existing members in a clustered environment

    -->

    <attribute name="InitialStateRetrievalTimeout">20000</attribute>

 

    <!--

        Max msecs to wait until all responses for a sync repl have been received.

    -->

    <attribute name="SyncReplTimeout">10000</attribute>

 

    <!-- Max number of milliseconds to wait for a lock acquisition -->

    <attribute name="LockAcquisitionTimeout">15000</attribute>

  </mbean>

</server>

The PropertyConfigurator.configure() method needs to have as argument a filename that needs to be found on the classpath, and will use it to configure the TreeCache from the properties defined in it. Note that this configuration file is used to configure the TreeCache both as a standalone cache, and as an MBean if run inside the JBoss container[8].

A list of properties is shown below:

Name

Description

CacheMode

LOCAL, REPL_SYNC or REPL_ASYNC

ClusterName

Name of cluster. Needs to be the same for all nodes in a cluster in order to find each other

ClusterConfig

The configuration of the underlying JGroups stack. See cluster-service.xml for an example.

EvictionPolicyClass

The name of a class implementing EvictionPolicy. Not currently used

FetchStateOnStartup

Whether or not to acquire the initial state from existing members. Allows for warm/hot caches (true/false)

InitialStateRetrievalTimeout

Time in milliseconds to wait for initial state retrieval

IsolationLevel

Node locking level : SERIALIZABLE, REPEATABLE_READ (default), READ_COMMITTED, READ_UNCOMMITTED, and NONE. Case doesn't matter. See documentation on locking for details.

LockAcquisitionTimeout

Time in milliseconds to wait for a lock to be acquired. If a lock cannot be acquired an exception will be thrown

LockLeaseTimeout

The time in milliseconds a lock is held. (Not currently used). May be removed in the future. We detect crashed members and do remove their locks.

MaxCapacity

The max number of elements in the cache. Can be used by an eviction policy.

ReplQueueInterval

Time in milliseconds for elements from the replication queue to be replicated.

SyncReplTimeout

For synchronous replication: time in milliseconds to wait until replication acks have been received from all nodes in the cluster

ReplQueueMaxElements

Max number of elements in the replication queue until replication kicks in

TransactionManagerLookupClass

The fully qualified name of a class implementing TransactionManagerLookup. Default is JBossTransactionManagerLookup

UseReplQueue

For asynchronous replication: whether or not to use a replication queue (true/false).

 

 



[1] We will offer a cluster-wide serializable policy once we implement optimistic locking, which will be more efficient than implementing a serializable policy over a pessimistic locking scheme.

[2] Note that you can have more than 2 caches in a cluster.

[3] Plus their equivalent helper methods taking a String as node name.

[4] This is mainly used internally, and we may decide to remove public access to the Node in a future release.

[5] However, we recommend making them serializable, enabling a user to change the cache mode at any time.

[6] Depending on whether interval-based replication is used

[7] Only on synchronous replication.

[8] Actually, we will use an XMBean in the next release.