SQL HA with MIM

MIM 2016 SP2 (and 4.4.1459.0 or later supports SQL Server Always On Availability Groups (AG)! Yeah!

Ok let’s implement it!

 But wait it won’t give us all we hope for!

Up to the moment distributed backup of the data — yes!

Automatic instant failover — no!

Let’s discuss what AG’s give us over the old SQL Failover cluster

Failover Clustering

With SQL Server (prior to SQL Server 2012) Failover clustering we have two or more servers in the same subnet (they can be in different datacenters via a stretch VLAN). When the active node goes offline then one of the other nodes will try to grab the necessary resources for that service, and bring them online. One of the critical pieces to note is that the Service has a virtual server name and virtual IP address. DNS points the virtual server name at the virtual IP address. During a failover, the new active node grabs the virtual IP address and sends out a gratuitous ARP request, which causes all devices on the subnet (including the router) to update the MAC address associated with the Virtual IP Address. So almost instantly requests for that IP address will be sent to the newly active node.

Starting with SQL Server 2012 we gained the ability to do failover clustering  over multiple subnets. However, failover is much slower as we would now depend on the cluster updating DNS, and DNS needs to replicate as well as DNS records have a time to live (TTL) which tells clients how long to cache the DNS record. The default TTL for cluster registered DNS records in 20 minutes. Which means that it could take 20 mins plus DNS replication time (default is every 180 minutes) for all clients to find the new server. Of course you can lower this, but I wouldn’t go below 60 seconds. If your DNS doesn’t or isn’t taking dynamic registrations from your SQL Server then you have to update it by hand.

SQL Mirroring

With SQL Mirroring we have two or more servers in the same subnet or different subnets, where one server is principal for a database and the others are mirroring/receiving the transactions either synchronously or asynchronously. Synchronous mirroring with a witness provides for instant failover to a database server in a remote subnet, provided that the client could support the failover_partner parameter in a connection string, or the application would allow you to do that. Of course, if the server specified in the connection string as the data source is up in can pass the client the name of the failover partner. But it didn’t provide for other databases to failover with it as a unit. It also meant that you needed to manage the connection strings, not such a daunting task for only a few clients.

Always On Availability Groups

SQL 2012 also brought Always On Availability Groups or AGs. With an AG you can group databases together and have them fail over as a unit. You create a listener (a Virtual Server Name and set of IP addresses). The cluster registers the Name with DNS and depending on the cluster parameter RegisterAllProvidersIP it will register all of the IP addresses for all of the nodes hosting the availability group. If in your connection string you set MultiSubNetFailover=TRUE, if your provider supports it, then your client will simultaneously try all of the IP addresses listed and then whichever one responds first (the one hosting the primary replica) will be used. Otherwise, if the MultiSubNetFailover=FALSE or your provider doesn’t support it, but RegisterAllProvidersIP =1 on the cluster the client will try all of the IP address returned from DNS, serially, rather than in parallel, which could result in substantial delays connecting to the server.

Now we can turn out attention back to MIM. MIM uses the SQL Native Client 11.0 OLEDB which doesn’t support the MultiSubNetFailover keyword. Additional Info on timeouts. In fact this provider was deprecated until March 2018 . When they released version 18 which adds support for MultiSubnetFailover 

Does MIM use this? Maybe! No mention of it in SP2 release notes  but in the hotfix prior to MIM 2016 SP2, 4.5.412.0, the article references this new ole db provider: “TLS 1.2 support is added to the MIM Service and Portal installer. This update will install if TLS 1.2 is the only enabled protocol *.  After you install this update, the change-mode setup of the MIM Service and Portal will succeed by having only TLS 1.2 enabled and SQL OLE DB driver installed.  * Microsoft OLE DB Driver 18 for SQL Server must be installed:” But this only references the MIM Service and Portal and not the sync engine. Furthermore, the connection string would still have to change to MultiSubnetFailover=TRUE.

Even if it didn’t we could still cheat if we could control the connection string: “If an availability group possesses only one secondary replica and is not configured to allow read-access to the secondary replica, clients can connect to the primary replica by using a database mirroring connection string.”

So MIM supports High Availability Groups for SQL  but “Synchronization server HA is not supported” and  “The SQLNCLI OLE DB Provider does not support the MultiSubnetFailover keyword. To use the MultiSubnetFailover keyword, use the ODBC driver.” Hmm, could we switch all of the services to use the ODBC provider? NO! They don’t provide a way to do that so this can only refer to SQL connections made using an MA.

So what does the “support” for High Availability Groups give us? We get mirroring that can automatically failover, albeit very slowly.  subject to DNS TTL. In the meantime the services will have failed. The FIM Service will retry 10 times with a 6 second timeout after which it fails. The FIM Synchronization service timeout and retries isn’t specified. But it too will fail, especially if in the middle of a sync. So we can configure the Services to restart after a short delay. So if we lower the HostRecordTTL cluster parameter to 60 seconds and configure services to restart we might have an automatic failover.

If you set the RegisterAllProvidersIP to 1 then you will get timeouts trying to start the sync service and the FIM service some of the time. If it set to 0 then you will get timeouts after a failover.

So for a lot of headache we get slightly improved mirroring and SQL transactions replication/mirroring slowing things down. So don’t use this is Sync performance or Portal performance is important. Use an AG for MIM PAM or if you have a smaller environment and are doing Password Sync.

Otherwise, doing a staging server or a standby server with SQL Log Shipping.