YOUR FEEDBACK
wrote: Trackback Added: IBM aims at VDI players with … VERDE; IBM is enterin...
Cloud Computing Conference
March 30 - April 1, New York
Register Today and SAVE !..

SYS-CON.TV
TOP THREE LINKS YOU MUST CLICK ON


"Server, Heal Thyself"
Autonomic features in WebSphere v6: Peer Recovery of Transactions

At the forefront of IBM's autonomic computing strategy is WebSphere Application Server v6. This version of the application server is designed to automatically detect problems ranging from small network glitches to large-scale power failures and, in a matter of seconds, save and process Web-based business transac tions that could take hours or days to recover under older systems. This is referred to as "self-healing" software.

As soon as the application server detects a problem, the transaction and related data is automatically moved to another server either within the same data center or, in the case of a power outage or disaster, to another geographic location. This new functionality is known as "peer recovery of transactions." This article will discuss how this works and how to configure it in WebSphere Application Server v6.

High Availability Manager
Peer recovery of transactions is tightly coupled with the configuration of WebSphere's high availability environment. So let's start by taking a brief look at the WebSphere high-availability architecture.

WebSphere's architecture for high-availability employs the use of a high-availability manager (HAM) to monitor key services provided by the application server. These services encompass messaging, transaction managers, workload management controllers and other application servers in the cluster. Today, many enterprise application architectures include the use of network-attached storage (NAS) devices. A requirement of peer recovery is that the HAM makes use of such a device for storage of transaction logs from each application server in the cluster.

The HAM is responsible for automatic recovery of all in-flight as well as in-doubt work for any application server that fails in the high-availability cluster. This automatic functionality allows the WebSphere cluster to re-stabilize itself if one or more of the cluster members should fail.

In WebSphere v6, all of the application servers in a cell are defined as members of a core group. A core group is a statically-defined set of application servers that can be divided up into various high-availability groups. Each core group has only one logical high-availability manager that continually polls all of the group members to verify they are active. It is also responsible for making the services within the core group available and scalable.

WebSphere uses policy matching to localize and partition policy-driven work into high-availability groups. In other words, when a core group member (in this case, an application server) fails, the high-availability manager can dynamically reassign the failing member's work to another component from the same high-availability group (in this case, a healthy application server). Using NAS devices in the position of common logging facilities, the component which has been assigned the work can recover and process the in-doubt and in-flight work of the failed component. This is essentially the heart of peer recovery of transactions.

Transaction Service Logs
Knowing this, we can examine results. The WebSphere transaction service logs are essential for providing recovery of transactions, enabling transactional work in progress at the time of a server failure to be resolved when the server is restarted. These logs contain information about active transactional work that is preserved across a server crash.

WebSphere v6 allows you to configure the location of the transaction log directory, either using the WebSphere administrative console or commands. Special logic has been added to the administrative console to facilitate the migration of the transaction log configuration from earlier versions of WebSphere. In older versions the transaction log directory configuration was stored in the server.xml server-level configuration file. In the current version it is stored in the serverindex.xml node-level configuration file.

Prior to WebSphere v6, when an application server restarted from a crash, the server's recovery processes included retrieving the transaction service logs, processing the recorded information, recovering the transactional work and completing the in-doubt transactions (including the release of database locks). The server had to first completely restart and process the logs before the completion of the transactional work would take place. Thus, if the server was slow to recover or if it required manual intervention, the transactional work could not complete, and access to the database was disrupted.

Depending on the industry, IBM points to estimates that the failure of an e-business application can cost a company as much as $110,000 per minute in lost revenue and productivity. To minimize the cost and disruption of a failure, WebSphere v6 employs the strategy known as peer recovery of transactions.

Peer Recovery of Transactions
The peer recovery strategy is provided only within a server cluster, moving the high-availability requirements away from individual servers and onto the server cluster. As shown in Figure 1, a recovery process runs alongside the standard application server processes, enabling one server in the cluster to recover the work for another failed server in the same cluster. Thus, to recover the failed server, you do not need to start up a new application server.

The peer-recovery process is the logical equal to restarting the failed server, without actually doing so. It is important to note that the recovery process completes outstanding work, but does not start new work or provide "forward processing" functionality. The Workload Manager (WLM) of the cluster can then dispatch new work onto the remaining servers. Both transactions and the compensation service failover together to the same peer server. The only difference from the user's perspective is the potential drop in overall system throughput.

Self-Healing Example
Let's consider an example of the peer recovery process that takes place if a single server fails. Three servers (S1, S2 & S3), as detailed in Figure 2, are running in a clustered WebSphere Application Server environment with the Workload Manager load balancing the work between the servers. Note, database locks are being held for each of application servers.

During normal processing, let's suppose the S1 application server unexpectedly crashes. This will result in locks being held in the database. Subsequently, the application servers S2 and S3 are able to continue processing their existing transactions, but future transactional processing may be impeded due to the locks still held on behalf of crashed S1 application server.

In WebSphere v6, once S1 had failed, a peer recovery process for S1 would begin running inside S3 (or S2 depending on the configuration). The transaction service portion of the recovery process would retrieve the information persisted by S1, and use that information to complete any in-doubt transactions.

In addition, to recover any in-flight transactions, the endpoint references for S1 would be redirected to S3. Thus, the system would remain in a stable state with just two servers, between which the WLM engine can balance workload. At some future point when S1 is brought back online, it will have no recovery processing of its own to perform (see Figure 3).

Configuring the transaction properties required for peer recovery is part of the overall task for configuring a cluster to use high availability support. Note that a cluster can house both v5 and v6 application servers; however, the peer recovery support is only available with clusters where all the members are v6 servers.

To configure the transaction properties required for peer recovery, complete the following steps:

  1. Configure the transaction log directory setting for each server in the cluster. You can configure the location of the transaction log directory using either the WebSphere administrative console or commands.
  2. For peer recovery, each server in the cluster must be able to access the log directories of other servers in the same cluster. Interactions between the High Availability framework and the recovery log service prevent concurrent access to a single physical recovery log.
  3. When using WebSphere Application Server without High Availability support, you can leave the recovery log configuration for persistent services (such as the transactions service) unset. The application server assumes a default location within the appropriate profile directory. When High Availability support is enabled, this default may not be visible from all servers in the cluster (for example, if they are in different profiles or physical nodes.) As a result, it is recommended that the recovery log location be configured for each server in the cluster before enabling High Availability.
  4. The storage mechanism used to host recovery log files and access to that mechanism must support the file-based force operation that is used by the recovery log service to force data to disk.
  5. Enable the High Availability function for the cluster, by completing the following steps on the cluster configuration panel of the WebSphere administrative console:
  6. In the administrative console, click Servers > Clusters > your_cluster_name.
    a.  To enable high availability for a cluster, select the Enable high availability for persistent services option.
  7. To change the peer recovery characteristics for a specific server in the cluster, you need to create a new policy that defines the criteria for the specific server.
    a.  Change the WAS_TRANSACTION policy in the core group configuration
  8. In the administrative console, click Servers > Core Groups > DefaultCoreGroup > Policies > Cluster TM Policy.
    a.  Change the policy properties to suit your recovery requirements.
Once you have completed these steps you are now ready to take advantage of peer recovery of transactions which is at the center of WebSphere's "self-healing" functionality. By incorporating autonomic computing strategies into WebSphere v6, IBM has taken a giant leap forward toward self-healing, self-configuring, and self-protecting systems. With systems that can immediately begin to process in-doubt and in-flight work when they detect a server problem, your applications can achieve a greater level of reliability and success.
About Dennis Ashby
Dennis W. Ashby is a senior J2EE architect and WebSphere developer who specializes in server-side and back-end systems. Having joined Prolifics in 1996, Dennis is a key player in Prolifics' highly specialized group of WebSphere consultants. Headquartered in New York City, Prolifics is retained by IBM to apply their specialized WebSphere skills at any phase of the customer's project life cycle. Dennis has applied Prolifics' proven methodology and expertise in the design of J2EE applications and service-orientated architectures to clients all over the world including Pep Boys, Citigroup, KeySpan Energy, Allmerica, American Express, FEMA, Bank Zachodni, and Gelco.

WEBSPHERE LATEST STORIES . . .
IBM is taking another shot at blowing Microsoft off the desktop and this time it’s got the foul economic winds at its back. In the name of cost cutting, IBM is proposing that companies virtualize their desktops and turn them into thin clients using Virtual Bridges' Virtual Enterprise...
Lighthouse Computer Services has expanded its software-related services with the formation of a new group devoted to IBM WebSphere application infrastructure and integration solutions. Lighthouse's WebSphere Services Practice offers extensive capabilities surrounding application integr...
The reason why ex-IBM executive Mark Papermaster can’t work for Apple is because Apple and IBM compete in microprocessors for iPod and iPhones. That’s what the judge deciding where Papermaster can work – in view of his non-compete – said in his 28-page opinion explaining why IB...
IBM is going to buy Transitive, the British cross-platform virtualization firm that salvaged legacy Macintosh programs and made Apple's move from IBM to Intel chips as graceful as a prima ballerina’s pirouette. Transitive is clever at running applications written for one kind of micr...
Emulex has announced that its LightPulse LP21000 family of Fibre Channel over Ethernet (FCoE) Converged Network Adapters (CNAs) have been tested and found to be compatible for use with IBM Systems x3650(7979), x3655(7943) and x3755(7163) series servers. Emulex CNAs enable the consolida...
Mark Papermaster, the ex-VP of blade development at IBM and the guy that IBM stopped from going to Apple to run its iPod and IPhone development on the strength of the non-compete he signed, has sued his former master looking for a declaratory judgment in his favor.
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON FEATURED WHITEPAPERS

ADS BY GOOGLE
BREAKING WEBSPHERE NEWS
Today, IBM (NYSE: IBM) announced a set of actions to bolster its security solutions that can help cl...