Thursday, March 24, 2011

Concerns about multi-subnet failover clustering

According to Microsoft’s Denali books online, a multi subnet failover cluster is a cluster in which each node of the cluster can be located in different subnets.  In my case, I want a two node active/passive failover cluster, with each node in a different geographic location.  This is known as a stretch cluster.  Reading through Microsoft’s documentation, I noticed this point:
As there is no shared storage that all the nodes can access, data should be replicated between the data storage on the multiple subnets.
In our previous implementation, we replicated the data between the two locations using an EMC-based hardware mirror at the SAN level.  I was naively hoping that Microsoft's implementation would own the replication when I read the statement that followed:
With data replication, there is more than one copy of the data available. Therefore, a multi-subnet failover cluster provides a disaster recovery solution in addition to high availability.
After reading through various documents, I get the impression that Microsoft still expects us to use third party technology to replicate the data between the sites.  For example, the graphic displayed by this guy shows "SAN Replication".  For now, I guess we will continue to use EMC's SAN-level replication.  It might be worth checking into SteelEye -- at last year's TechEd conference, a Microsoft rep pointed me in their direction for a software-based mechanism that could do the necessary replication.  Oh well, at least our network guys will be happy about doing away with the stretch VLAN.
Other thoughts about multi-site clustering --

A few paragraphs later, Books Online gives as an example exactly what we plan to do:
SQL Server failover cluster SQLCLUST1 includes Node1and Node2. Node1 is connected to Subnet1. Node2 is connected to Subnet2. SQL Server sees this configuration as a multi-subnet cluster and sets the IP address resource dependency to OR.
This bit about setting the IP address resource dependency to OR is rather intriguing and points to one of the changes in SQL 2012.  It seems that the IP addresses are not owned by all the nodes in the failover cluster.  The IP address resource dependency can be set to either “OR” or “AND”.  Books online offers several examples where we would use “OR” as well as an example when we would use “AND”.  Of particular interest is the following example consistent of THREE nodes:
SQL Server failover cluster SQLCLUST1 includes Node1, Node2, and Node3. Node1 and Node2 are connected to Subnet1. Node 3 is connected to Subnet2. SQL Server sees this configuration as a multi-subnet cluster and sets the IP address resource dependency to OR. Because Node1 and Node2 are on the same subnet, this configuration provides additional local high availability.
This example raises another concern – that of DNS latency issues.  In the example above, recall that both nodes 1 and 2 are in the same subnet; node 3 is on a different subnet.  If SQL fails over to node 3, a DNS record must be updated to point to the new IP address on node 3.  Books online indicates that this could be a problem if we have multiple DNS servers (think DNS synchronization issues).  My real concern is this statement:
The SQL Server cluster will not come online on Node3 until the DNS synchronization is complete.
So I guess this is why some shops choose to go with three nodes in a cluster instead of our typical two.  It's something that I will have to consider going forward.

One other concern – be aware that multi-subnet failover clusters are only supported on SQL Datacenter, Enterprise, Developer, and Evaluation editions.  Notice what is missing – Standard (and Workgroup).  Bummer.



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.