Coherent caching

3

I had lunch yesterday with a friend who is using ehcache (BTW, did you notice that “ehcache” is a palindrome?) on a production web site in a cluster. He’s had a problem where one server in the cluster goes “bad” and becomes CPU-bound. When that happens it starts to “infect” other servers in the cluster because as the tries to maintain coherency, it communicates with the other servers to spread the cache updates. Ultimately, this causes the whole cluster to die. My friend had a 30 minute outage yesterday due to this problem.

Anyhow, we were discussing how Terracotta’s clustered version of ehcache might help alleviate this problem. At the moment, this is just a thought exercise – I haven’t run any tests to back this up. Terracotta works by introducing a set of one or more Terracotta servers in addition to your application servers. Each application server then communicates with the Terracotta servers instead of directly with each other.

If a particular server becomes “bad” other servers will generally be unaffected. Their coherency and state are managed through the Terracotta server, not through communication with other clients. The bad server can then be killed and restarted and will rejoin the cluster and see the contents of the distributed cache as necessary. In Terracotta, the distributed shared memory is faulted into each VM as needed.

Of course, if the bad server is holding distributed locks it will have an effect on the other servers as they will not be able to proceed until the lock is released. We kicked around some ideas internally of how this could be improved with timed locks or perhaps even timed wait/notify.

Comments

3 Responses to “Coherent caching”
  1. Billy says:

    Guess it depends on whats bad. A bad JVM updating the cache will do the same thing unless Terracota can figure out that it’s bad. Maybe, it’s just busy, who knows. Whats the difference between bad and very busy if you see my point. That the JVM is using lots of CPU maybe normal, that it’s generating a lot of updates may be normal. Probably the best way to help here is to use a local partitioned cache instead. Now, if a server goes bad and makes lots of updates on the primary partitions in that JVM, only the replicas (usually 1 JVM) is impacted. The partitioning rather than peer-peer approach helps isolate this type of issue to at most a pair of JVMs.

  2. Taras Tielkes says:

    To follow up on the comment by Billy above, what the difference between a JVM gone bad an an in-progress GC? :)

Trackbacks

Check out what others are saying about this post...
  1. [...] As you might expect, this first day was mostly about setting up my computer. Alex and I did talk about what I would work on first, which is the distributed testing framework that is in development and a set of test use cases centered around Terracotta’s distributed cache functionality. [...]



Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!