I have been fiddling about with setting up a SQL Server 2012 Failover cluster using an Equallogic SAN. After a whole lot of digging about I found two different posts on two different sites which got me about 90% of the way there. However there were some key “gotcha’s” and other information that was missing in both cases and I wanted to document those here in addition to referencing the articles I followed for my setup.
BTW – Just my 2-cents, but setting up clustering is complicated… especially when you throw SQL in the mix. It isn’t bad once you have done it a few times (I tested again, and again, and again in a virtual environment) but there are honestly like 50+ considerations to take into account to ensure everything goes correctly.
I am assuming if you are here you already have a general understanding of failover clustering, know what you are wanting to do and why. This article also doesn’t really cover all aspects of high-availability. I don’t discuss how your SAN(s) should be networked for example. I do touch on a few items though that fall in this area. This isn’t meant to be comprehensive and a lot of it is just for personal reference.
So here are some tips if this is your first go around. These are in NO particular order or grouping (this is very “stream of thought”) so I would suggest reading this from start to finish at least once rather than referencing it as you are going through your setup.
- Most articles mention them but I will go ahead and also discuss a few general requirements:
- You need two matched servers on the same windows domain
For a proper failover cluster with redundancy you need 6 (yes SIX) network connections on each server if you are using an iSCSI SAN (like me): 2 Teamed NICs for regular networking, 2 Teamed Nics for cluster communications/heartbeat, 2 or more iSCSI SAN NICS using MPIO, NOT TEAMED!
EDIT 03/01/2017: Actually you need 5 network connections – The heartbeat communication is what I got wrong. I was using direct cross-over connections between the servers. This is a MISTAKE. If one server gets turned off the other side of the line for the heartbeat connection goes dark so it just looks like the heartbeat connection is disconnected on the still running server. This is BAD. So for the heartbeat connection put a switch in-between and furthermore, only run one cable to each server for this connection. Hence, only one line for heartbeat, your second connection is the Management connection which can also carry heartbeat information over the regular network. Things get screwy/complicated (at least they did for me) if you try to have two dedicated connections for the heartbeat communications.
- You need some local storage on each node… 2+ Hard Drives in at least a RAID 1 dedicated for the OS, 2+ Hard Drives in a RAID 1 for other program files (like SQL), 2 + Hard Drives for the TempDB. SQL Server 2012 lets you use local storage for the TempDB and you should take advantage of that if possible.
- Regarding the networking, the NIC groups listed up top need to be physically or logically isolated with non-overlapping network schemes. With iSCSI, I think best practice is to also use a separate non-overlapping network range with two separate networks. However MPIO will work just fine if both of your iSCSI nics and all of your SAN controllers are on the same network segment, just note that the Failover Cluster Validation Tool will throw a warning about the iSCSI nics being on the same network.
- Also regarding networking, for the Cluster Communication network/heartbeat, you can use Crossover cables to directly connect the two servers to one another if you are only building a 2 node cluster. If more than 2 nodes, you should have a separate VLAN at least, if not a separate set of switches.
- Okay, with that out of the way, next tip, SQL Server 2012 doesn’t work on Cluster Shared Volumes. So don’t mess with them. Some articles on Failover clustering (generally) talk about using them, that is because they work fine with roles like Hyper-V, but not for SQL Server 2012…
- When you are setting up volumes in your SAN for the cluster, all of them need to be marked “yes” during volume setup where it asks “Do you want to allow simultaneous access to this volume from more than one iSCSI initiator”. Each one of those volumes, after being attached to the servers, needs to be onlined, initialized, and formatted for NTFS. All of them… including your cluster witness disk.
- This is a BIG tip… If you are using the Dell HIT KIT… which you very much should… you need to FIRST install the Microsoft MPIO Feature (from add roles and features wizard) BEFORE installed Dell HIT KIT. Dell HIT KIT depends on this being installed first. Older versions used to check if Microsoft MPIO was already installed during the HIT KIT deployment and stop you if it wasn’t. Apparently that somehow got lost in the newer versions and MPIO will not function correctly. So install the MS MPIO feature first, reboot, then install the HIT KIT, reboot, be on your way.
- As far as the HIT KIT is concerned. I recommend NOT using it to remotely manage or connect to your SAN Group. So don’t enter CHAP Credentials and connect it. The reason why? This is my gross understanding, but when you do this, it establishes an iSCSI connection to some kind of “VSS Control” volume on the SAN to talk with it. This volume will show up in Disk Management on the server, and it will cause Cluster Validation to throw errors. AFAIK you don’t “need” this feature so don’t set it up.
- Oh, if you have never used the HIT KIT before. Dell clearly and obviously named the management console the “Auto Snapshot Manager” or “ASM” — utterly confusing. This isn’t used for just managing snapshots, it is also used for managing MPIO settings. Dear Dell, please rename this application/console to something like, I dunno’…. perhaps “Dell iSCSI Management Console”??
- The main thing you need to configure in the HIT KIT is which networks can be used for iSCSI connections. You need to go in and manually “exclude” all of the other networks or Dell/Windows will try to make them participate in MPIO.
- Oh yeah, on your iSCSI nics and your HeartBeat nics, you need to turn off IPv6 (if you aren’t using it) and also, when you go to set the static IP addresses on these, go to the advanced tab, and make sure you untick the box to automatically create DNS records for those interfaces.
- On your iSCSI nics you need to enable Jumbo Frames.
- Don’t forget to turn off IPv6 on your management nics as well.
- Just installing the HIT KIT and fixing the MPIO networks does NOT enable MPIO for all of your SAN connections. This done elsewhere, of course, and the referenced articles talk about it. When you open up the Microsoft iSCSI console, the “Discovered Targets” box on the “Targets” tab should populate with a long IQN indicating each volume you have made available to the server. You need to click on each one, and then hit the “connect” button. A Dialog will pop up, you need to leave the option ticked to “add to favorites blah blah” and you also need to tick the “enable MPIO” option. Do this for each IQN you are using for the cluster. You need to go through this exercise on all servers in the cluster.
- Your cluster witness disk doesn’t need to be any larger than 1 Gb.
- If you did all of the above, and you are still in the iSCSI initiator properties dialog box, then click on the Dell Equallogic MPIO tab. You should now see multiple connections to each SAN iSCSI volume indicating that MPIO is working.
- Oh yes… I was wondering HOW exactly Failover Clustering actually handles the iSCSI volumes. Reading and especially Writing to the same LUN volume formatted with NTFS just struck me as “wrong” and several articles about SANS in general seem to indicate that it is. Hence, I wasn’t sure if all of the volumes needed to be marked for allowing access from multiple initiators… Yes, yes they do… And there is nothing magical about what failover clustering does, simply put, Failover Clustering, once implemented, simply “offline’s” the attached SAN disks on the passive node. In the event of a failover, it will “online” those disks to bring them up. Hence your failover cluster is never doing simultaneous writes from two different servers to the same NTFS volume based on my gross understanding of things.
- Once all of the above is taken into consideration, you need to install the Microsoft “Failover Cluster Role” on both machines, BEFORE trying to install SQL Server on either box.
- Oh, from the “Add Roles and Features” wizard, be sure to add .net framework 3.5 to both server.
Okay, all of the above being said, I relied HEAVILY on these two articles (note my caveats)…
- For Dell Equallogic Specific setup for a Failover cluster:
This is for general failover clustering. Very helpful overall in walking through HIT KIT, Dell SAN MGMT, and the Failover Cluster Role installation. My only caveats are:
- Ignore the section where it talks about entering CHAP Credentials in the “Auto Snapshot Manager” console on the server. You don’t want to connect to that VSS Control Volume.
- Ignore the last part where it mentions setting up Clustered Shared Volumes, (CSV’s), they are fine for Hyper-V, but not for SQL. This article wasn’t written for SQL.
- For actually installing and setting up SQL Server 2012 R2 Clustering:
This is an article about Failover Clustering and SQL but has nothing specific to Equallogic. Very good overview of clustering and the requirements on part one. However, use the first article I referenced for setting up your SAN connections and installing Failover Clustering role. Part 2 and forward is where it gets to the “meat”. This is an excellent aid for first-time go around of actually doing the SQL Server Clustered Instance install.
- Finally, I would recommend you more or less avoid this article: https://www.1e.com/blogs/2014/08/11/creating-a-windows-cluster-using-iscsi-to-connect-to-shared-storage/
It is a LOT of reading and I mean no disrespect to the author but I am pretty sure there is some incorrect information. That being said, the point-in-time at which I was reading I was probably a bit confused. Parts 3 and 4 do have some good notes/examples on validating that failover is actually working though. That said, I found the first two articles listed as much more helpful and succinct.